Is there a way to "garbage collect" the internal formatting of an .odt document?

AndyBern · June 9, 2023, 2:51am

ODT documents accumulate internal leftover formatting remnants from editing. As a document is repeatedly edited, the file size grows significantly more than is required to render the document. To show what I mean, do the following:

Create a new document consisting of the sentence “This is a test.”
Save the document to “test1.odt”.
Change the word “is” to bold.
Save-As the document to a differnt file: “test2.odt”.
Close and reopen “test2.odt”.
Unbold the word “is”.
Save the document one more time.

You should have two apparently identical documents, but “test2.odt” is 157 bytes larger than “test1.odt”. This is due to the retained formatting remnants. They can be seen by converting the documents to RTF and comparing them.

When performing many edits on large documents, this can significantly increase the file size to many times what its should be. (Just the simple edit example above added 157 bytes). And I expect it slows down document loading and saving.

I’m working on a Bash script to clean up a documemt by removing redundant or unnecessary formatting codes, but I wonder if such a tool already exists. If not, Writer should automatically clean up the documemt as it saves it.

robleyd · June 9, 2023, 3:02am

I assume you are using the toolbar icon, or Ctrl+B to bold the text, i.e. direct formatting?

What happens if you use a style instead, which is the preferred method of formatting a document?

Note you can also see the structure of a Writer document by unzipping the .odt file and examining content.xml in the resulting file structure.

Villeroy · June 10, 2023, 9:39pm

Since 30+ years, using an office suite means working with styles and templates. Styles and templates let you customize thousands of details in all kinds of documents.

AndyBern · June 12, 2023, 7:27pm

The problem did not occur when I used the “emphasis” character style instead of direct formatting. There were no editing artifacts after unemphasizing the text. But this should also be the way direct formatting edits should work. If a sequence of characters are all formatted the same, they should no longer be broken up into separately formatted sub-sequences in the document file, regardless of what was needed in the past.

I do use styles when writing more formal documents. I’ve done so for years with M$O as well.

I don’t have change tracking enabled.

I was using RTF only to detect the editing artifacts. It’s easier for me to detect them by eye in RTF than XML.

LeroyG · June 13, 2023, 4:55pm

I think so. But this is not the way that it works, at least in version 7.5.3.2.

ajlittoz · June 9, 2023, 7:18am

There are two issues when you update a document.

Formatting (and your routine)
If you work as M$O Word conditioned the world, you direct format your text with the consequence that every “embellished” sequence of characters contains absolutely all attribute descriptions. Editing makes things worse because Writer (or any program) can’t know it you’re replacing, extending or creating formatting for the sequence, potentially ending with a description per character.

Improvement: learn to work with styles, and all of them! paragraph, character, page, frame and list. Don’t use toolbar buttons nor keyboard shortcuts, unless these shortcuts point to styles. Styling only writes the style name in the document. If your text is semantically styled, i.e. style names designate the significance not the formatting, formatting changes are done by modifying styles only, achieving a dramatic saving on document size.
Change tracking
Even if you don’t enable change tracking, every edit you make in your document is tagged with an ID, contributing to progressive document size increase, even if modest per iteration. This is required to ease document comparison.

But, if you don’t need Track changes and rarely compare documents, you can disable the feature in Tools>Options, LibreOffice Writer>Comparison: untick all boxes in the “Random number …” section.

Changing your formatting routine and disabling comparison feature should solve your concern without the need for a (risky) bash script which will be necessarily contorted to act against an .odt file.

PS: exporting a document to RTF is not the ideal format to make assessments about its encoding quality. RTF is a foreign non-native format involving a conversion, hence approximations and possibly loss of information. If you really want to have access to Writer internals, save as .fodt flat XML ODF text document. It is a linearized form of .odt without binary compression.

LeroyG · June 11, 2023, 3:21pm

I tested deactivating Random number…, but Compare Documents can’t be deactivated. So, I get bold and normal styles in the content.xml subfile.

<style:style style:name="T1" …> <style:text-properties fo:font-weight="bold" …>
<style:style style:name="T2" …> <style:text-properties fo:font-weight="normal" …>

I only tested with bold (and unbold).
.
If I Select All in the test2.odt file, Copy and Paste it in a new document, the normal style is not present in the new file.

mikekaganski · June 11, 2023, 3:26pm

But deactivating this function isn’t the goal. The random number is generated not to enable it, but to enhance it.

And not generating new numbers doesn’t mean removing already generated present in the document. To get rid of them, one would need to save as ODF non-extended (once), because this number is an extension, and a non-extended format drops all extensions.

To save to a non-extended ODF, you need to configure it under Options → Load/Save → General, and for ODF format version, choose e.g. 1.3 (as opposed to the default 1.3 Extended (recommended)), then save and reload, then change the option back, to restore the usual format and avoid loosing features that can only be saved in extended format.