Command Line Convert From PDF Outputs Gibberish

ImmortalFirefly · May 2, 2014, 5:40pm

I need to convert some docs on the fly, but converting a PDF doc to anything else, just outputs a bunch of crap. I’ve tried with multiple files converting to multiple formats. The command I use is:

$ /usr/bin/libreoffice --headless --convert-to html:HTML /path/to/file.pdf --outdir /path/to

On several attempts, the output I get back looks similar to the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
        <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
        <TITLE></TITLE>
        <META NAME="GENERATOR" CONTENT="LibreOffice 4.0.4.2 (Linux)">
        <META NAME="CREATED" CONTENT="0;0">
        <META NAME="CHANGED" CONTENT="0;0">
</HEAD>
<BODY LANG="en-US" DIR="LTR">
<PRE>%PDF-1.5^M
%����^M
1 0 obj^M
&lt;&lt;/Type/Catalog/Pages 2 0 R/Lang(en-US) /StructTreeRoot 30 0 R/MarkInfo&lt;&lt;/Marked true&gt;&gt;&gt;&gt;^M
endobj^M
2 0 obj^M
&lt;&lt;/Type/Pages/Count 2/Kids[ 3 0 R 19 0 R] &gt;&gt;^M
endobj^M
3 0 obj^M
&lt;&lt;/Type/Page/Parent 2 0 R/Resources&lt;&lt;/Font&lt;&lt;/F1 5 0 R/F2 8 0 R/F3 10 0 R/F4 15 0 R&gt;&gt;/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] &gt;&gt;/Annots[ 7 0 R 17 0 R 18 0 R] /MediaBox[ 0 0 612 792] /Contents 4 0 R/Group&lt;&lt;/Type/Group/S/Transparency/CS/DeviceRGB&gt;&gt;/Tabs/S/StructParents 0&gt;&gt;^M
endobj^M
4 0 obj^M
&lt;&lt;/Filter/FlateDecode/Length 3229&gt;&gt;^M
stream^M
x��[[s<SPAN LANG="hi-IN">۶</SPAN>#~�##�36C�x9��Ա�6m�v#�d2�y�%��Dj(�n���#Q�!%#[��#��o##{�#��7��<SPAN LANG="hi-IN">ݰ����ۛ</SPAN>k�����͏�q���g���3!y��,��(N����,f#���?#?_^M/���oC�#�_�����[#��62�&lt;#�-�����2#\} �!����e6�^M#�mO�T&lt;�;r�#/y&lt;�Bi�B#&gt;#/A�O@#GR�#�����k�,�pK9��TD�P,�t�D+DȜ�2#���P#.��bńg#&lt;�Q����#�Y#s�<SPAN LANG="hi-IN">ﺍ�</SPAN>nc)�����#4�s}#�b��.#��T#'�#u#%�+�#u#e#�A��o�a2`Zx#�Y��0^�h��r�T��+Y��'�|��?-F���&#8209;��ō����C��#�|T6�0#&lt;N~@�}#�/�g�h\�#�<SPAN LANG="hi-IN">ܳ</SPAN>$O%��^r#h�3l���%G��1�#��$��e���l�=k+'��n�w2yno�O
P1��;���x�#�#8SYW^˩(�#�#iO�<SPAN LANG="zh-CN">閏�</SPAN>H���-h�)7
~#��s�^W���|�&gt;W�3�+�fY��#�#&gt;�{(�k|Oyf�̏�/SI#%���7�i<SPAN LANG="hi-IN">ﺕ</SPAN>#�Ͳ`Z        j��@�$#�͓`Z%E�wH7^?|&lt;I�<SPAN LANG="zh-CN">禀</SPAN>&lt;���V�<SPAN LANG="hi-IN">ߓ</SPAN>=�@L�#|�{Di�EJ�m�u#��m�D��&gt;��F�~ո�����#t��{�<SPAN LANG="hi-IN">؋</SPAN>#�T$3W��j��Ȓ�����4����8;�<SPAN LANG="hi-IN">݉�</SPAN>q�H+�3$�x�a2t�<SPAN LANG="hi-IN">ޗ</SPAN>#�U<SPAN LANG="hi-IN">ݰ��</SPAN>kI#��#�eS���ʄ##��

And the bad output goes on and on and on. It about doubles the filesize of the original document.

I can get it working with doc/docx/txt/etc just fine, PDF gives me grief. I’ve put a temp bandaid on the problem by installing pdftohtml, but I’m not as crazy about the output from pdftohtml. I’m on CentOS 6.4 with Libreoffice 4.0.4.2. I installed with yum install libreoffice-headless

oweng · May 3, 2014, 6:26am

What is the nature of the PDF? How was it generated? Only PDF v1.4 is fully supported, so if the PDF is encoded using v1.5+ it will likely experience conversion problems.

ImmortalFirefly · May 5, 2014, 3:27pm

@oweng These are user submitted PDFs so they really aren’t in our control…I guess I could look at finding a program that converts the pdf to a lower version first, then have libreoffice parse that. I hadn’t thought of that one…

ImmortalFirefly · May 5, 2014, 4:09pm

@oweng So far no dice. I installed ghostscript 8.7 and downgraded it to 1.4 and even 1.3 and I still get a ton of this

%PDF-1.4
%�쏢
6 0 obj
<</Length 7 0 R/Filter /FlateDecode>>
stream
x��=ۖ#�q<'o�?��#g
�@#��3�#;qr|,�b�@�n�*#�^?^��Z�����7{�y##�W{?�i�5��#�u]�#o�:�]�����o��������z��_/.���3�)...etcetcetc

The ghostscript command I used was gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=testing.pdf test.pdf

oweng · May 7, 2014, 8:27am

Those foreign-looking characters are just compressed data. PDF v1.2 included a FlateDecode filter that “Decompresses data encoded using the zlib/deflate compression method, reproducing the original text or binary data.” It appears the LO PDF input filter (Draw) may not be decompressing the data stream. You probably need to use the --infilter= parms to specify Draw to get PDF input support. A look through the list of filters does not indicate any PDF input filter though.

ImmortalFirefly · May 7, 2014, 3:31pm

Another fruitless endeavor I googled around, found pdftk and installed it. ran pdftk input.pdf output output.pdf uncompress and it indeed did uncompress it. Then I ran libreoffice, and no dice. So I downgraded the version before uncompress and it still didn’t work. Thanks for the ideas though. I just don’t know what pdftohtml is doing that libreoffice isn’t cause pdftohtml works, and libreoffice doesn’t on the same pdfs.@oweng

AlexKemp · February 21, 2016, 7:06pm