TXT conversion shows text that is not visible in Writer

fiasko · September 19, 2023, 10:23am

Hello everyone,

We’re doing some text analysis and for that I’m using Libreoffice to convert from .docx → .txt

The documents we’re analysing contain cross references and seem to be completely fine in Writer, but once I’m exporting to TXT, the cross reference appears twice.

The text might look like this in Writer:
"In Exhibit (A) we [..]"
But exported to txt it’s
"In Exhibit (A)(A) we [..]".

To reproduce, please download the attached .odt file, open with Writer and Save As… → TXT. You will find the string “Anlage (A)(A)” instead of just “Anlage (A)”.

I’m not sure why this is happening. My original sources are contract docx files and unfortunately this happens very regularly. I can not reproduce this behaviour by creating a new writer file and inserting a cross reference. It only happens when a docx is converted to txt. Examining the content.xml file, I can see the reference twice with different styling, but to be honest I don’t know much about the format or internal document representation.

Any thoughts about this? Thanks for reading!
P24_D_double_reference.odt (23.5 KB)

ajlittoz · September 19, 2023, 12:21pm

If you move carefully the blinking caret with the arrow keys, you’ll notice there is something invisible at right of “Anlage (A)”. I copied this and pasted it into an empty paragraph. It reveals “(A)” with a gray background (meaning it is generatd through field). But as soon as I press Enter, it turns invisible.

You have two references to “(A)”, it is normal that you have two occurrences in .txt because all formatting is wiped out in plain text.

Now, looking at the .fodt, the XML is quite weird: you have 3 (!) bookmarks on “(A)” and these have clearly a Word signature, plus a strange fourth one supposed to be an OLE_LINK.

IMHO, these should not have bee bookmarks but simple references. But perhaps, in the XML, location definition is encoded as a bookmark.

When it comes to insert the reference, i.e. after word “Anlage”, A field is used. Writer seems to not understand M$ Word encoding (DOCX) because it tags the first reference as “UNHANDLED”. This could be the cause of the invisibility of this reference. XML standards state that if an application can’t understand an XML element, it is silently ignored but left in the XML string.

A second weird point is the field is qualified with parameters. ASAIK ODF fields have no parameters to alter their behaviour.

In my opinion, using Writer to translate DOCX documents to any alien format is faulty because LO doesn’t know all details of DOCX. You inevitably stumble on compatibility issues. For correct results, use Word. Your documents look like a business one. So, your company can afford one licence fee to make the conversion.

fiasko · September 19, 2023, 1:59pm

hey @ajlittoz , thanks for your detailed anwer!

Before posting here I also tried to understand what is happening by inspecting the XML file and using the developer tools in writer. I agree the xml is super weird and I don’t understand why those tags are there. But what writer actually shows to the user is what I’d call my “desired output”.

Can you think of a way to extract the text that we can “see” in writer instead of dumping the text content? A solution using pyuno would as well be totally feasible for us to implement.

I would also agree to use the right tool for the job, so use M$ Word to convert those files, but after lots of research, its probably not feasible for us for technical and licensing reasons (we’re running a webapp on linux either in the cloud or on-premises)

ajlittoz · September 19, 2023, 2:14pm

The “history” of your documents is missing: were they created with Word or with Writer? In the second case, the idea was faulty because ODF is not equal to DOCX. They have a non-void intersection but everything not in the intersection must be approximated some way or other.

Fields or list numbering are considered advanced features and don’t convert well. Thus, when combining both list numbering and cross-references, as is the case in your sample document, you can expect difficulties.

The simplest way to extract “visible” unformatted text is to export as plain text, as you did. You now have this duplication of cross-references. This can be solved with a regular expression if you can find a common pattern. If this pattern involves only “text” inside parentheses, you can try something like

($.+?$)$.+?$

to be replaced by $1. For more safety, you can modify the regexp so that the second “(xx)” is required to be identical to the first one.

fiasko · September 19, 2023, 2:35pm

Sorry, I wasn’t accurate about this aspect. They are all word documents. I’m using LO to convert them to both PDF and TXT. The txt for further analysis of the contents and PDF to be able to show the documents in a browser.
Btw, the converted PDF output is also good, and so is the text output from “pdftotxt” (poppler-utils), but this leads to other problems down the road…

I also thought about catching duplicates of cross-references, but this is a whole world of headaches on its own, because they don’t follow simple conventions. In general, something like Exhibit (E)(E) might be valid. Or, another occurrence where this happened was with “Exhibit 1.10”, which was converted to Exhibit 1.101.10, which in itself could be totally valid.

Lastly, the double cross-reference problem only occurs every once in a while.

Any ideas are super welcome, and thanks again for your reponses, @ajlittoz

ajlittoz · September 19, 2023, 6:42pm

Since this has to do with DOCX format subtleties, I"m afraid there is no other solutions than to use Word.