--convert-to "txt:Text (encoded):UTF8" fails

This question has been asked many times, but I haven’t found an answer that works for me on my Mac.

/Applications/LibreOffice.app/Contents/MacOS/soffice --convert-to "txt:Text (encoded):UTF8" --headless Test.pdf

gets the response:

convert /path/to/Test.pdf -> /path/to/Test.txt using filter : Text (encoded):UTF8
Error: Please verify input parameters... (SfxBaseModel::impl_store <file:///path/to/Test.txt> failed: 0xc10(Error Area:Io Class:Write Code:16))
  • I’ve tried with and without --headless, with and without LibreOffice already running. (When not initially running, it temporarily appears in my task bar, then disappears after printing the error.)
  • I’ve also tried with and without an --outdir <dir> to which I have write permissions
  • The command line arguments all match those printed by soffice --help, including the specific output filter, "txt:Text (encoded):UTF8"

Given that the variants I’ve tried are included in already-provided answers to this question, I’m hoping somebody can tell me some missing part of the incantation.

MacOS 12.5 (Monterey)
LibreOffice 7.1.6.2 0e133318fcee89abacd6a7d077e292f1145735c3

Imho you simply ask for the wrong direction. While it is (comparably) easy to convert-to pdf (or print to pdf), it can be impossible to do the reverse process of recovering text.
.
On a more technical level: convert-to loads the file in the default part of LibreOffice for the type (pdf is loaded to draw). Then saves with the named filter, wich fails, if this filter is not from the used module. As another example you can not simply convert a text/ letter /.odt to a spreadsheet /.odc
For pdf the default is Draw, because a pdf may be nothing more than graphical instuctions to distribute paint on a page. Some of this instructions may be letters, but there is no constraint to enforce the sequence we usually read. And a graphics program like Draw usually can not save as text.
.
If you wish to convert pdf to text, try first, if you can copy the text from a pdf-viewer. If this is not possible you may need OCR-software. If it is possible, search for pdf2txt or pdf2doc to find special software for this.

This is a general topic, and can not be solved by switching the OS.
.
It may be solved by using another tool.
This is a link to a python-program, wich attempts to do your conversion ( but remember: this will also fail, if the pdf is only a scanned image of some text ):
https://pdfminersix.readthedocs.io/en/latest/tutorial/commandline.html

2 Likes

Thanks for the suggestion pdfminer.six. Looks like that can be a very helpful solution.

It has been also answered many times. And the answer is: you can only export a document to file types supported by the module that opened the source file. Which translates to human language as: open the document in interactive mode, just like you ask in command line (Open LibreOffice, and in it, open the PDF); see that PDF is opened in Draw; try to Save As, and see the repertoire of file formats available; realize that TXT is not among them.

Then see that you can explicitly specify to use PDF (Writer) in File Open dialog; and then Writer indeed would be able to save to TXT (try and see the result, which would surprise you). Similarly, you can explicitly specify the import filter in command line, using --infilter.

soffice --infilter=writer_pdf_import --convert-to "txt:Text (encoded):UTF8" Test.pdf

and see that, since PDF imports to insane amount of text boxes (graphical objects), and TXT export ignores all graphical objects, it works but is useless.

1 Like

Filed tdf#153683.

2 Likes

Thanks for your prompt and clear answer @mikekaganski, which goes a long way toward explaining why I was having little success. I am able to load the PDF into Draw, and it provides handy bounding boxes for each element of text and design, which I can manipulate as I wish. As you noted, from Draw there is no output option for text. I tried the --infilter=writer_pdf_import option you recommended, and no longer get an error. I get an empty .txt file, however.

The Open dialog I get is a standard MacOS open file widget, which does not provide an option for using Writer. The option likely exists, but after much experimentation I haven’t found it.

With these things noted, pdfminer.six, recommended by @Wanderer will solve my problem. It has an hocr option that provides words and bounding boxes, and a MIT license.

Thanks again.

1 Like