I’m trying to convert a pdf file to a word doc on the command line in Red Hat but I keep getting the following error:
Error: no export filter
I’m using the command:
soffice --convert-to doc test.pdf
Any help would be greatly appreciated
I’m trying to convert a pdf file to a word doc on the command line in Red Hat but I keep getting the following error:
Error: no export filter
I’m using the command:
soffice --convert-to doc test.pdf
Any help would be greatly appreciated
What is the relationship with a database? You tagged base which is the tag for questions involving the DB front-end Base.
Independently from command line, what you request can’t be done with LO. LO considers PDFs as graphics file (which they are in fact). Therefore, they are managed by Draw. The main issue is to rebuild a text document structure from a set of unordered graphical objects, the most numerous of them being text boxes. The second issue is to export this supposedly rebuilt text document to an alien format DOC or DOCX for which LO provides only an approximation because the format is not totally public and is different from ODF.
You should try an OCR program. However any program won’t be able to tell which paragraph are headers, discourse or notes, … You can get a good approximate plain text equivalent but a lot of manual editing remains in your responsability.
Try this instead
lowriter --headless --infilter='writer_pdf_import' --convert-to doc:"MS Word 2007 XML" yourfile.pdf
To my surprise, it worked on a random PDF I chose. However, the original and conversion are not identical - expect issues there. Note that if the PDF is a pure image, it may not work.
Thanks robleyd, that seems to work. I’ve only tested on a single page but hopefully it will work on a larger document too
I just tried another document - initially created in Writer and exported as PDF - and it also had some issues. The PDF and exported file are attached.
testwp.doc (97.1 KB)
testwp.pdf (95.8 KB)
Version: 7.6.4.1 (X86_64) / LibreOffice Community
Build ID: 60(Build:1)
CPU threads: 16; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (wbp_AU.UTF-8); UI: en-US
SlackBuild for 7.6.4 by Eric Hameleers
Calc: threaded
Apparently, it was not converted to a text flow. Text boxes remained text boxes, i.e. they are “decorations” which are unusable to manipulate text and can’t be styled.
And the layout is messed up somewhat; some of the text boxes overflow the right margin. The conversion “works” for very small values of work
Text boxes usually autosize to contents. In this case, they overflow because the font face is perhaps not the same as the original one, also because the various the effective formatting (weight, kerning, space effective width, …) could not be guessed. These attributes are not encoded in PDF, at least not as directly as in ODF or DOC(X).
Note also that paragraph structure is not rebuilt.
I think that margin size are not correctly transferred. The footnote in page 2 is not a footnote but some text box (in fact a pair of text boxes) erroneously positioned in the bottom margin. Text at right middle height is truncated and I see no obvious reason for it. Footer is nowhere.
OCR programs would perhaps achieve a better result, at least regarding paragraph structure.