I need to convert some docs on the fly, but converting a PDF doc to anything else, just outputs a bunch of crap. I’ve tried with multiple files converting to multiple formats. The command I use is:
$ /usr/bin/libreoffice --headless --convert-to html:HTML /path/to/file.pdf --outdir /path/to
On several attempts, the output I get back looks similar to the following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
<TITLE></TITLE>
<META NAME="GENERATOR" CONTENT="LibreOffice 4.0.4.2 (Linux)">
<META NAME="CREATED" CONTENT="0;0">
<META NAME="CHANGED" CONTENT="0;0">
</HEAD>
<BODY LANG="en-US" DIR="LTR">
<PRE>%PDF-1.5^M
%����^M
1 0 obj^M
<</Type/Catalog/Pages 2 0 R/Lang(en-US) /StructTreeRoot 30 0 R/MarkInfo<</Marked true>>>>^M
endobj^M
2 0 obj^M
<</Type/Pages/Count 2/Kids[ 3 0 R 19 0 R] >>^M
endobj^M
3 0 obj^M
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 8 0 R/F3 10 0 R/F4 15 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/Annots[ 7 0 R 17 0 R 18 0 R] /MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>^M
endobj^M
4 0 obj^M
<</Filter/FlateDecode/Length 3229>>^M
stream^M
x��[[s<SPAN LANG="hi-IN">۶</SPAN>#~�##�36C�x9��Ա�6m�v#�d2�y�%��Dj(�n���#Q�!%#[��#��o##{�#��7��<SPAN LANG="hi-IN">ݰ����ۛ</SPAN>k�����͏�q���g���3!y��,��(N����,f#���?#?_^M/���oC�#�_�����[#��62�<#�-�����2#\} �!����e6�^M#�mO�T<�;r�#/y<�Bi�B#>#/A�O@#GR�#�����k�,�pK9��TD�P,�t�D+DȜ�2#���P#.��bńg#<�Q����#�Y#s�<SPAN LANG="hi-IN">ﺍ�</SPAN>nc)�����#4�s}#�b��.#��T#'�#u#%�+�#u#e#�A��o�a2`Zx#�Y��0^�h��r�T��+Y��'�|��?-F���‑��ō����C��#�|T6�0#<N~@�}#�/�g�h\�#�<SPAN LANG="hi-IN">ܳ</SPAN>$O%��^r#h�3l���%G��1�#��$��e���l�=k+'��n�w2yno�O
P1��;���x�#�#8SYW^˩(�#�#iO�<SPAN LANG="zh-CN">閏�</SPAN>H���-h�)7
~#��s�^W���|�>W�3�+�fY��#�#>�{(�k|Oyf�̏�/SI#%���7�i<SPAN LANG="hi-IN">ﺕ</SPAN>#�Ͳ`Z j��@�$#�͓`Z%E�wH7^?|<I�<SPAN LANG="zh-CN">禀</SPAN><���V�<SPAN LANG="hi-IN">ߓ</SPAN>=�@L�#|�{Di�EJ�m�u#��m�D��>��F�~ո�����#t��{�<SPAN LANG="hi-IN">؋</SPAN>#�T$3W��j��Ȓ�����4����8;�<SPAN LANG="hi-IN">݉�</SPAN>q�H+�3$�x�a2t�<SPAN LANG="hi-IN">ޗ</SPAN>#�U<SPAN LANG="hi-IN">ݰ��</SPAN>kI#��#�eS���ʄ##��
And the bad output goes on and on and on. It about doubles the filesize of the original document.
I can get it working with doc/docx/txt/etc just fine, PDF gives me grief. I’ve put a temp bandaid on the problem by installing pdftohtml, but I’m not as crazy about the output from pdftohtml. I’m on CentOS 6.4 with Libreoffice 4.0.4.2. I installed with yum install libreoffice-headless