RFE? final save. why so many text:span and font-face declarations?

When I:

  1. make a copy of one of my odf/odt documents

  2. rename it as a zip file

  3. unzip it

  4. look inside of the

    • .//content.xml
    • .//meta.xml
    • .//settings.xml
    • .//styles.xml

    data, I see a lot of extra, superfluous mark up such as:

I would wonder if th<text:span text:style-name=“T92”>o</text:span>s<text:span text:style-name=“T92”>e</text:span> people notice

All that unnecessary, funky mark up is transferred into the formatting of the new file when you “save as” RTF.

Even though I have only declared “Liberation Serif”, I see all those font-face declarations in styles.xml:

<office:font-face-decls>
  <style:font-face style:name="Lohit Devanagari1" svg:font-family="'Lohit Devanagari'"/>
  <style:font-face style:name="Liberation Serif" svg:font-family="'Liberation Serif'" style:font-family-generic="roman" style:font-pitch="variable"/>
  <style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/>
  <style:font-face style:name="Lohit Devanagari" svg:font-family="'Lohit Devanagari'" style:font-family-generic="system" style:font-pitch="variable"/>
  <style:font-face style:name="WenQuanYi Zen Hei" svg:font-family="'WenQuanYi Zen Hei'" style:font-family-generic="system" style:font-pitch="variable"/>
 </office:font-face-decls>

I thought all that extra fluff may be included in documents in case you need to undo the text, but what I see doesn’t make any sense.

There should be a “final save” to save your document in a minimally lint, canonical style set.

I work on corpora research with lots of text files. Dealing with all that extra junk doesn’t make any sense.

Why all that junk in odf formatted files?

Is there a way to “lint” odf formatted files?

lbrtchx

(reformatted by ajlittoz for better readability)
(One single semicolon added to get an opening angle bracket after the orphaned “e”. @Lupp)

Likely because you have edited those characters, which introduced an edit history recorded in form of random numbers useful for file version comparison (see Document Comparison Options). Uncheck the setting if you do not intend to use file comparison.

RTF font face declarations are unrelated, and are not redundant IIUC. They declare different font family/pitch declarations. And they are likely used in styles - even if they are not used in the text body.

The example of <text:span> … is biased. The markup is necessary because you change the format every other letter in word “those” resulting in “those” if I take T92 to mean “bold”.

If you use direct formatting, it yields this “superfluous” markup and more as it will define one “Txx” style name per occurrence.

The only way to reduce excessive markup transitions is to consistently and methodically format your document with styles. But this is not a 100% guarantee for a lean XML. It depends also on editing history. Applying a character style to a sequence create a “span-split” with <span:text>…</span:text> with up to 3 segments. If you later apply to the middle segment the same style as the one for the surrounding segments, it looks like Writer has some difficulty to merge all three into a single one (it may work but I had cases where it didn’t).

Saving to a non-ODF format will not clear the case because the internal representation has the “segments”. This formatting will be exported as faithfully as possible.

Regarding the font list, there are many factors to consider.

First, the configuration in Tools>Options is passed into the document. What was “default” on your machine might not be the same on another machine. It is therefore critical to enumerate everything so that the document will look the same on another machine (provided the fonts are installed of course).

There is also consideration for non-Latin scripts. Many styles have configuration variants where one font is preferred for Latin scripts and another one for East Asian scripts.

Usually, this is not really important. It only builds a “dictionary” in the XML with no other consequence, if you don’t use the corresponding features, than inflating a bit the ODF file. Normally, this is invisible to the end-user.

If this really matters for you, review all settings in Tools>Options to change those you don’t like or think pollute your files.

Regarding the markup mess, I repeat and insist on style usage, the only way to mitigate it.

To show the community your question has been answered, click the ✓ next to the correct answer, and “upvote” by clicking on the ^ arrow of any helpful answers. These are the mechanisms for communicating the quality of the Q&A on this site. Thanks!

In case you need clarification, edit your question (not an answer which is reserved for solutions) or comment the relevant answer.

I would wonder if those people notice…
that a text where two single letters got a different character format needs a lot of additional markup.
This would also be the case if in the original case the applied attributes don’t attract any attention.
For this text I assumed the markup <text:span text:style-name=“T92”> inititialized a text span in bold, and this is eye-catching, of course. A half point change in height may not be.

OK, I went Tools > Options > LO Writer and did some changes, but I still think that there should be an option to save the file without junk and funk, giving the user the option to remove unnecessary mark up.

Likely because you have edited those characters …
The example of text:span … is biased

I don’t think that I did and what would be the point of individually span(ning) three characters with the same formatting?

… introduced an edit history recorded in form of random numbers

Hmm! random numbers? Like when the TSA says you may be selected “randomly” for further inspection?

… Normally, this is invisible to the end-user.

which could be exploited to watermark documents …

The only way to reduce excessive markup transitions is to consistently and methodically format your document with styles. But this is not a 100% guarantee for a lean XML.
Regarding the markup mess, I repeat and insist on style usage, the only way to mitigate it.

Are there step-by-step styles guidelines you would suggest? Do you know of java SAX parsers/content Handlers which would produce a 100% lean XML document? I can’t see a reason why that would be difficult. Of course, the user will have to decide on certain aspects which could be later used as default.

lbrtchx

First, this is not a solution to your problem but a comment to my answer and other comments. So, please, click on the more link at the bottom of it and “repost as a comment”.

Styling is a matter of personal taste as is document formatting. Styles are tools to achieve formatting consistently. To discover the power of the feature, read the Writer Guide as a beginning. Then practice a bit and come back here with precise questions about “obscure” or “disturbing” points.

Some random phrases in this “answer”. What “TSA says you may be selected “randomly” for further inspection” could possibly mean? Anyway, mixing randomly different answers in such a way is not something that may be addressed/described/answered reasonably.