How to convert ODT file to TXT keeping the text of the footnotes?

vstepaniuk · May 13, 2020, 11:20am

ajlittoz · May 13, 2020, 1:21pm

Edit your question to tell us what you want to do with the resulting file. .txt might not be the best format.

vstepaniuk · May 13, 2020, 1:31pm

@ajlittoz, I need to convert specifically to TXT

anon73440385 · May 13, 2020, 1:57pm

On Linux: libreoffice --convert-to pdf <name>.odt && pdftotext <name>.pdf (on my openSUSE 15.1 system pdftotext is part of package poppler-tools)

vstepaniuk · May 13, 2020, 2:00pm

@anon73440385, ok, thanks, though, why not an “answer”?

ajlittoz · May 13, 2020, 2:05pm

@vstepaniuk: just out of curiosity, once it is converted to txt, how do you use it (in which context)? Do you want a txt equivalent of Writer formatting? Like old-time README files or IANA RFC specification?

anon73440385 · May 13, 2020, 2:11pm

@vstepaniuk - your wish is my command.

vstepaniuk · May 13, 2020, 3:53pm

@ajlittoz, no need for any formatting, I will use it for text processing later. But if you have a good solution how to retain the text formatting, feel free to add an answer!

ajlittoz · May 13, 2020, 4:08pm

@vstepaniuk: the best way to keep all information of your document is to save it .odt This is a zipped file, so it uses minimal disk space (in formatting preserving capability).

If you want to process document content in another application (awk, macro processor, …) as text not binary, you can save it .fodt. You get an exact XML representation of the document. But to process it efficiently, you need to know the details of ODF specification. This is not zipped, so a bit fatter.

You can recover the content without formatting by stripping blindly all XML markup, leaving only textual content. Notes are preserved in the process, but not the position in the page.

I also explored the idea of using a good old ASCII impact printer but CUPS wouldn’t let me create such an antique device. It insists on PostScript or PDF printers. Dead end. My goal was to “print” the document and retrieve the spool file.

LeroyG · May 13, 2020, 1:06pm

vstepaniuk,

It is not the most elegant solution, but… you can Export as PDF... (File ‣ Export as...), open the PDF file in your PDF viewer, Select all, Copy and Paste in a new document.

vstepaniuk · May 13, 2020, 3:30pm

Thanks! Very nice GUI solution!

anon73440385 · May 13, 2020, 2:10pm

Hello,

On Linux you may use: libreoffice --convert-to pdf <name>.odt && pdftotext <name>.pdf (on my openSUSE 15.1 system pdftotext is part of package poppler-tools)

Note(s):

Short search on the net yields that the tool pdtotext is also availabe for Windows (but I got absolutely no experience with the tool on Windows)
Using BASIC function Shell() is should be an easy hack to create a macro for that
Drawback: Number(s) of footnotes are shown first and then the text(s) of the footnotes on separate lines

Hope that helps.

If the answer helped to solve your problem, please click the check mark () next to the answer.

vstepaniuk · May 13, 2020, 3:39pm

Thanks, Nice command line solution! pdftotext is available either from poppler or from xpdf pdftotext - Wikipedia

vstepaniuk · May 13, 2020, 5:58pm

One option would be to save the file in FODT format (Flat XML ODF Text Document) and use the following perl command:

perl -wn0le 'print $1 if /<office:body>([\s\S]*?)<\/office:body>/' file.fodt | perl -p0e 's/<.*?>//g' | perl -p0e 's/^\s+//gm'

It extracts everything from the FODT document between the <office:body> and </office:body> tags, removes all tags from the result, and also removes all consecutive whitespace starting from the start of line, including newlines.

The footnotes will be IN PLACE of footnote anchors!

A full command-line solution (including the conversion to FODT) for file.odt:

in=file.odt; libreoffice --convert-to fodt:"OpenDocument Text Flat XML" "$in" && perl -wn0le 'print $1 if /<office:body>([\s\S]*?)<\/office:body>/' "${in/.odt/.fodt}" | perl -p0e 's/<.*?>//g' | perl -p0e 's/^\s+//gm'

Thanks @ajlittoz for the comments!

ajlittoz · May 13, 2020, 6:33pm

At @vstepaniuk’s request, I make my comment an answer.

The best way to keep all information of your document is to save it .odt This is a zipped file, so it uses minimal disk space (in formatting preserving capability).

If you want to process document content in another application (awk, macro processor, …) as text not binary, you can save it .fodt. You get an exact XML representation of the document. But to process it efficiently, you need to know the details of ODF specification. This is not zipped, so a bit fatter.

You can recover the content without formatting by stripping blindly all XML markup, leaving only textual content. Notes are preserved in the process, but not the position in the page.

I also explored the idea of using a good old ASCII impact printer but CUPS wouldn’t let me create such an antique device. It insists on PostScript or PDF printers. Dead end. My goal was to “print” the document and retrieve the spool file.

To show the community your question has been answered, click the ✓ next to the correct answer, and “upvote” by clicking on the ^ arrow of any helpful answers. These are the mechanisms for communicating the quality of the Q&A on this site. Thanks!

In case you need clarification, edit your question (not an answer) or comment the relevant answer.