Is there any way to tidy the content.xml file?

When you edit a Writer document it changes the content XML by adding a text:span. eg

<text:p text:style-name="P1"> One two three 
        <text:span text:style-name="T1">three point five</text:span>
four five. </text:p>

In the above the original content was “One two three four five.” and I saved the doc then added “three point five” after the original “three”.

After many edits the content.xml is a mass of text:spans with varying text:style-name. Is there any way of tidying up the mass of text:spans? I do not record track changes and list track changes shows nothing.

I could nuke it to a new document via a text file (select all, copy, paste to text file then paste the text file to a new document), but I’d have to reformat the document.


Your question is the same as this one. A complete discussion is provided here.

You can minimise the number of such <text:span> by avoiding direct formatting with a consistent use of styles.

The question is arguably the same as the first one, but the questioner is using a flat XML file type and I am not. However I don’t believe that the answer given there actually applies to question asked by me. It is nothing to do with direct formatting or character styles, neither of which I use.

You can recreate the issue by 1) create a new document. 2) add some text and save the document. If you then look at the content.xml in the odt wrapper, all the text is in the paragraph style container (text:p). If you then go back and open the document, and 3) add some text in the middle of the existing text and save the document again. When you look at the content.xml there will be a text:scan container in the middle of the text:p container. Every single editing change results in a new text:span container using a new text:style, even if you only ever use one paragraph style. After a while the content.xml is bloated by the damn things!

The second reference doesn’t have anything to do with it.

Does the suggestion given in the first reference cure the issue? Namely, going to Tools>Options, Load/Save>General and choosing ODF format version 1.2 without extensions.

Answer by @CyanCG in the second reference explains why these officeooo attributes cause multiplication of apparently useless markup, thus leading to the clue about ODF version.

Thanks @ajlittoz. Using ODF format version 1.2 without extensions does indeed solve the issue for a simple one sentence document (with changes). I’ve not tried it with a big document yet. But what do I give up by using the ‘without extensions’ version of the format? I don’t completely understand the discussion in the second reference, but I don’t see why inserting text in an existing string causes LO to decide that a wrapper is necessary for the added characters. No change of character style is implied so far as I can see. If I insert characters at two places in the string I get two new text:spans and two new text:styles! I can see this an old problem. I came across this comment from 2012! @Michael1 Stahl says that He doesn’t believe that any user will edit and re-edit a string/sentence in a paragraph, but speaking for myself, I often do this! Write the paragraph. Review and change it. Review and change the changes ad lib

I too heavily edit my document and was not aware of the created mess. I’ll switch to ODF 1.2 without extension by default.

According to the discussion and bug report, the only loss seems to be a less extensive document comparison. As I don’t do that frequently, it is acceptable for my case.