Where do the `text:style-name="Tnn"` span tags come from, and how do I get rid of them?

I have a LibreOffice Writer document saved as a “flat” XML file (*.fodt file type). I am trying to apply regular expressions to it, using an external text editor.

My efforts are hampered because the document is littered with dozens of <text:span text:style-name="Tnn"> ... </text:span> wrappers. They seem to appear haphazardly, even between characters of a single word without any apparent change of “style” in the Writer view of the document itself. The Tnn (e.g. T10) numbers appear to be related to style declarations including something like: officeooo:rsid="009e4655"/ numbers.

Of course, this makes it impossible to construct any regex that works across the document as a whole.

So, two questions:

  1. What are these wrappers and officeooo:rsid numbers?
  2. Is there an easy way to get rid of them?

Trying to remove them manually would be ridiculously difficult.


Note: this Q&A is related to the following:

· “Regular expressions to move punctuation from after to before superscripts

· “Writer: clarification needed about character attributes
 

This matter was discussed in a bug report raised in 2013, fdo#68183. In the comment trail it is explained that the “officeooo:rsid” of <style:text-properties> (and one or two other related attributes) are part of the “OASIS OpenDocument 1.2 extended” format. These numbers (it appears) carry revision IDs.

Although a “fix” for this was committed as long ago as v. 4.5.0 (in “tdf#68183 sw: config option for disabling the creation of automatic RSID marks”), it isn’t immediately clear how to get rid of them.

There are two approaches, I believe, one that is partial, and one that is more thorough:

  1. Partial: one can select text, and use Format > Clear Direct Formatting, and this will get rid of many (perhaps not all) of these wrappers. It will also get rid of direct formatting, of course (e.g., if “italics” have been applied), and so even this partial solution may come at a cost.

  2. Thorough: It is possible to save the document using a different ODF file-save setting. Go to Tools > Options > Load/Save > General to get to this dialog:

settings-screenshot

You want to choose one of the options that does not have “extended” with it. Choose, e.g., 1.2 (plain), then save your document. This should get rid of most of the offensive <text:span text:style-name="Tnn"> ... </text:span> wrappers, and in my tests at least produces a document which can be meaningfully manipulated by regex.

While the above account is as far as I have got, if there are other or, especially, better solutions, it would be good to know.

I’m looking at this from the position of wanting to use an external revision control tool on “fodt” files. This requires that style names are “stable” as sequentially numbered anonymous styles are likely to cause storms of differences in the saved file for even minor edits if the edit causes another “P” style to be created or destroyed. Elsewhere the suggested solution is “don’t use anonymous styles”, which is good provided the anon. styles are not being silently auto-generated.

On my end this solution was actually pretty effective. I have a long 100k words document with extensive formatting and revisions over the years, and it was becoming unwieldy. I feared the problem came from this tag littering all over the place. After saving from odt 1.2 ext to 1.2 plain, Contents.xml went from 3353 KB to 2523 KB. Of that I estimate around 575KB of raw text, amounting to an almost 30% less raw code alone, useless code as far as I’m concerned. Now I’m on to some further optimization if at all possible.