Command Line Convert From PDF Outputs Gibberish [closed]

asked 2014-05-02 19:40:03 +0200

ImmortalFirefly gravatar image

updated 2015-09-07 00:42:28 +0200

Alex Kemp gravatar image

I need to convert some docs on the fly, but converting a PDF doc to anything else, just outputs a bunch of crap. I've tried with multiple files converting to multiple formats. The command I use is:

$ /usr/bin/libreoffice --headless --convert-to html:HTML /path/to/file.pdf --outdir /path/to

On several attempts, the output I get back looks similar to the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
        <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
        <TITLE></TITLE>
        <META NAME="GENERATOR" CONTENT="LibreOffice 4.0.4.2 (Linux)">
        <META NAME="CREATED" CONTENT="0;0">
        <META NAME="CHANGED" CONTENT="0;0">
</HEAD>
<BODY LANG="en-US" DIR="LTR">
<PRE>%PDF-1.5^M
%����^M
1 0 obj^M
&lt;&lt;/Type/Catalog/Pages 2 0 R/Lang(en-US) /StructTreeRoot 30 0 R/MarkInfo&lt;&lt;/Marked true&gt;&gt;&gt;&gt;^M
endobj^M
2 0 obj^M
&lt;&lt;/Type/Pages/Count 2/Kids[ 3 0 R 19 0 R] &gt;&gt;^M
endobj^M
3 0 obj^M
&lt;&lt;/Type/Page/Parent 2 0 R/Resources&lt;&lt;/Font&lt;&lt;/F1 5 0 R/F2 8 0 R/F3 10 0 R/F4 15 0 R&gt;&gt;/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] &gt;&gt;/Annots[ 7 0 R 17 0 R 18 0 R] /MediaBox[ 0 0 612 792] /Contents 4 0 R/Group&lt;&lt;/Type/Group/S/Transparency/CS/DeviceRGB&gt;&gt;/Tabs/S/StructParents 0&gt;&gt;^M
endobj^M
4 0 obj^M
&lt;&lt;/Filter/FlateDecode/Length 3229&gt;&gt;^M
stream^M
x��[[s<SPAN LANG="hi-IN">۶</SPAN>#~�##�36C�x9��Ա�6m�v#�d2�y�%��Dj(�n���#Q�!%#[��#��o##{�#��7��<SPAN LANG="hi-IN">ݰ����ۛ</SPAN>k�����͏�q���g���3!y��,��(N����,f#���?#?_^M/���oC�#�_�����[#��62�&lt;#�-�����2#\} �!����e6�^M#�mO�T&lt;�;r�#/y&lt;�Bi�B#&gt;#/A�O@#GR�#�����k�,�pK9��TD�P,�t�D+DȜ�2#���P#.��bńg#&lt;�Q����#�Y#s�<SPAN LANG="hi-IN">ﺍ�</SPAN>nc)�����#4�s}#�b��.#��T#'�#u#%�+�#u#e#�A��o�a2`Zx#�Y��0^�h��r�T��+Y��'�|��?-F���&#8209;��ō����C��#�|T6�0#&lt;N~@�}#�/�g�h\�#�<SPAN LANG="hi-IN">ܳ</SPAN>$O%��^r#h�3l���%G��1�#��$��e���l�=k+'��n�w2yno�O
P1��;���x�#�#8SYW^˩(�#�#iO�<SPAN LANG="zh-CN">閏�</SPAN>H���-h�)7
~#��s�^W���|�&gt;W�3�+�fY��#�#&gt;�{(�k|Oyf�̏�/SI#%���7�i<SPAN LANG="hi-IN">ﺕ</SPAN>#�Ͳ`Z        j��@�$#�͓`Z%E�wH7^?|&lt;I�<SPAN LANG="zh-CN">禀</SPAN>&lt;���V�<SPAN LANG="hi-IN">ߓ</SPAN>=�@L�#|�{Di�EJ�m�u#��m�D��&gt;��F�~ո�����#t��{�<SPAN LANG="hi-IN">؋</SPAN>#�T$3W��j��Ȓ�����4����8;�<SPAN LANG="hi-IN">݉�</SPAN>q�H+�3$�x�a2t�<SPAN LANG="hi-IN">ޗ</SPAN>#�U<SPAN LANG="hi-IN">ݰ��</SPAN>kI#��#�eS���ʄ##��

And the bad output goes on and on and on. It about doubles the filesize of the original document.

I can get ... (more)

edit retag flag offensive reopen merge delete

Closed for the following reason question is not relevant or outdated by Alex Kemp
close date 2016-02-21 20:06:02.854655

Comments

What is the nature of the PDF? How was it generated? Only PDF v1.4 is fully supported, so if the PDF is encoded using v1.5+ it will likely experience conversion problems.

oweng gravatar imageoweng ( 2014-05-03 08:26:48 +0200 )edit

@oweng These are user submitted PDFs so they really aren't in our control.....I guess I could look at finding a program that converts the pdf to a lower version first, then have libreoffice parse that. I hadn't thought of that one...

ImmortalFirefly gravatar imageImmortalFirefly ( 2014-05-05 17:27:10 +0200 )edit

@oweng So far no dice. I installed ghostscript 8.7 and downgraded it to 1.4 and even 1.3 and I still get a ton of this

%PDF-1.4
%�쏢
6 0 obj
<</Length 7 0 R/Filter /FlateDecode>>
stream
x��=ۖ#�q<'o�?��#g
�@#��3�#;qr|,�b�@�n�*#�^?^��Z�����7{�y##�W{?�i�5��#�u]�#o�:�]�����o��������z��_/.���3�)...etcetcetc

The ghostscript command I used was gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=testing.pdf test.pdf

ImmortalFirefly gravatar imageImmortalFirefly ( 2014-05-05 18:09:50 +0200 )edit

Those foreign-looking characters are just compressed data. PDF v1.2 included a FlateDecode filter that "Decompresses data encoded using the zlib/deflate compression method, reproducing the original text or binary data." It appears the LO PDF input filter (Draw) may not be decompressing the data stream. You probably need to use the --infilter= parms to specify Draw to get PDF input support. A look through the list of filters does not indicate any PDF input filter though.

oweng gravatar imageoweng ( 2014-05-07 10:27:27 +0200 )edit

Another fruitless endeavor :( I googled around, found pdftk and installed it. ran pdftk input.pdf output output.pdf uncompress and it indeed did uncompress it. Then I ran libreoffice, and no dice. So I downgraded the version before uncompress and it still didn't work. Thanks for the ideas though. I just don't know what pdftohtml is doing that libreoffice isn't cause pdftohtml works, and libreoffice doesn't on the same pdfs.@oweng

ImmortalFirefly gravatar imageImmortalFirefly ( 2014-05-07 17:31:26 +0200 )edit