Representation of text in an ODT

isaacd · February 18, 2022, 1:19am

I was using a script I got from someone on AskUbuntu to search for text in some files when I came across what appears to me to be a strange phenomenon. The script uses this line to search for a string in an ODT:

unzip -ca "test.odt" 2>/dev/null | grep -i "Task 2.11"

If I open the ODT with Writer I see exactly those words “Task 2.11”. But grep doesn’t find it. I copied and pasted just that phrase into a test file to isolate it and I noticed that while grep still doesn’t find the original instance of the phrase, if I typed on the next line “Task 2.11” and ran the command above it finds it just fine:

grep: (standard input): binary file matches

It is only that precise instance of the phrase it doesn’t find. I saved the file as an html using Writer and I noticed another strange thing. The T in the first instance of “Task 2.11” appears separate from the rest of the phrase. Why would this be?

Thanks in advance.
test.odt (8.8 KB)

mikekaganski · February 18, 2022, 6:08am

Representation of text in an ODT
Why would this be?

To answer this, just follow the

unzip -ca "test.odt" 2>/dev/null | grep -i "Task 2.11"

Doing that, you would see in the /content.xml:

...
        <style:style style:name="T1"
                     style:family="text">
            <style:text-properties style:font-name="Liberation Serif"/>
        </style:style>
        <style:style style:name="T2"
                     style:family="text">
            <style:text-properties style:font-name="Liberation Serif"
                                   officeooo:rsid="00148e8d"/>
        </style:style>
...
            <text:p text:style-name="P1">
                <text:span text:style-name="T2">T</text:span>
                <text:span text:style-name="T1">ask 2.11:</text:span>
            </text:p>

And then you may see that the text is internally split into two pieces, formatted using two different autostyles, which are almost identical, except for officeooo:rsid attribute. The latter is to allow comparison of versions of documents. The picture shown here just shows that you had edited the paragraph in a following separate editing session, and replaced/added the capital T which was not there initially (maybe small t was there before).

Generally you need to realize that document format used by LibreOffice (ODF) is not plain text, and at any time, you may get anything split into different pieces in the inner representation because of formatting changes (that you may not be even aware of - e.g., the mentioned random number, or tracked changes, or formatting difference, or language applied by keyboard layout change, etc., etc.); and you should not rely on such simplistic search methods.

You may find better tools; but LibreOffice itself provides a --cat command line option that could allow you to search for body text in text documents (but that wouldn’t work in spreadsheets and presentations). It means that you would start LibreOffice instance to use that option, though (with corresponding overhead).

isaacd · February 18, 2022, 7:12am

Thank you for explaining this.

I tried libreoffice --headless --cat but that turned out to be very slow (as you mentioned) in the script. However, I did find a much faster alternative: odt2txt.