Convert pdf to doc in Linux

Fred15 · January 17, 2024, 9:31am

I’m trying to convert a pdf file to a word doc on the command line in Red Hat but I keep getting the following error:

Error: no export filter

I’m using the command:
soffice --convert-to doc test.pdf

Any help would be greatly appreciated

ajlittoz · January 17, 2024, 9:56am

What is the relationship with a database? You tagged base which is the tag for questions involving the DB front-end Base.

Independently from command line, what you request can’t be done with LO. LO considers PDFs as graphics file (which they are in fact). Therefore, they are managed by Draw. The main issue is to rebuild a text document structure from a set of unordered graphical objects, the most numerous of them being text boxes. The second issue is to export this supposedly rebuilt text document to an alien format DOC or DOCX for which LO provides only an approximation because the format is not totally public and is different from ODF.

You should try an OCR program. However any program won’t be able to tell which paragraph are headers, discourse or notes, … You can get a good approximate plain text equivalent but a lot of manual editing remains in your responsability.

robleyd · January 17, 2024, 9:57am

Try this instead

lowriter --headless --infilter='writer_pdf_import' --convert-to doc:"MS Word 2007 XML" yourfile.pdf

To my surprise, it worked on a random PDF I chose. However, the original and conversion are not identical - expect issues there. Note that if the PDF is a pure image, it may not work.

Fred15 · January 17, 2024, 10:07am

Thanks robleyd, that seems to work. I’ve only tested on a single page but hopefully it will work on a larger document too

robleyd · January 17, 2024, 10:10am

I just tried another document - initially created in Writer and exported as PDF - and it also had some issues. The PDF and exported file are attached.
testwp.doc (97.1 KB)
testwp.pdf (95.8 KB)

Version: 7.6.4.1 (X86_64) / LibreOffice Community
Build ID: 60(Build:1)
CPU threads: 16; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (wbp_AU.UTF-8); UI: en-US
SlackBuild for 7.6.4 by Eric Hameleers
Calc: threaded

ajlittoz · January 17, 2024, 10:21am

Apparently, it was not converted to a text flow. Text boxes remained text boxes, i.e. they are “decorations” which are unusable to manipulate text and can’t be styled.

robleyd · January 17, 2024, 10:26am

And the layout is messed up somewhat; some of the text boxes overflow the right margin. The conversion “works” for very small values of work

ajlittoz · January 17, 2024, 12:33pm

Text boxes usually autosize to contents. In this case, they overflow because the font face is perhaps not the same as the original one, also because the various the effective formatting (weight, kerning, space effective width, …) could not be guessed. These attributes are not encoded in PDF, at least not as directly as in ODF or DOC(X).

Note also that paragraph structure is not rebuilt.

I think that margin size are not correctly transferred. The footnote in page 2 is not a footnote but some text box (in fact a pair of text boxes) erroneously positioned in the bottom margin. Text at right middle height is truncated and I see no obvious reason for it. Footer is nowhere.

OCR programs would perhaps achieve a better result, at least regarding paragraph structure.