why libreoffice convert PDF to Word as textbox instead normal document

I want to convert PDF to Microsoft Word (doc, docx) from Ubuntu 18 terminal using LibreOffice 6.1.3.2 10(Build:2) (actually i execute libreoffice using PHP). But i got full of textbox document instead normal word document. Note: i try convert using several VPS but give me same result.

First to understand my problem i suggest to download my file in here: https://nofile.io/f/DKvQYFRdYZg/pdf2word.rar

i have 4 file:

1.original.doc
2.original-to-pdf.pdf
3.pdf-to-word.doc
4.expected.doc

First i convert original.pdf to original-to-pdf.pdf , then I try convert back to word using this following command:

soffice --infilter="writer_pdf_import" --convert-to docx a.pdf

File creation was success but all content is converted to Textbox not as normal document. Then i try several PDF to Word converter like ilovepdf.com, pdf2doc.com and i got expected.doc

You can see the different by download my file in link above or see image below

my output:

enter image description here

ilovepdf and pdf2doc output:

enter image description here

I try several filter include pdf to odt then odt to word but all command below not give me expected result

soffice --infilter="writer_pdf_import" --convert-to docx:"Microsoft Word 2007/2010/2013 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 95" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 97" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2007 XML" a.pdf

I know about premium software like abbyy cloud or adobe cloud, but i dont think website like ilovepdf, pdf2doc will use paid service to provide free service. My question is, Am i miss something in libreoffice dependency to be able convert PDF to normal word document?

Thanks

1 Like

I was able to get this working:

subprocess.run(’{} --infilter=“writer_pdf_import” --convert-to doc:“MS Word 97” docx2.pdf’.format(lowriter))

Converted my pdf successfully into doc format, but looking to find the docx file

You are not missing anything. LibreOffice tries to keep correct placement of text elements, as they are defined in the PDF, including their exact on-page positions, and wrapping. Because of specifics how the data is stored in PDF, it’s mapped to text boxes, or otherwise it would not be possible to be accurate. The other way (that applications you mentioned use) is not implemented.

See this Q&A for a macro by @Lupp to convert text boxes in Draw to a single text.

1 Like

https://bugs.documentfoundation.org/show_bug.cgi?id=118370

1 Like