Spurious text:span with Internet_20_link style in content.xml: Why? Safe to remove?

I’ve got an ODT file with 1230 elements

<text:span text:style-name="Internet_20_link">

(without xlink attributes), which occur at random places in the content.xml file. I can’t see any logic, and they even sometimes split words. An example:

<text:span text:style-name="Internet_20_link">
  <text:span text:style-name="T12"><text:tab/>intF</text:span>
</text:span>
<text:span text:style-name="Internet_20_link">
  <text:span text:style-name="T31">orma</text:span>
</text:span>
<text:span text:style-name="Internet_20_link">
  <text:span text:style-name="T848">tOf</text:span>
</text:span>

Is there any purpose? Can I just get rid all of them (e.g. with a regexp replacement) without any consequence?

Note: The goal is cleanup and to be able to do further cleanup. For instance, the 3 styles T12, T31 and T848 above should actually be the same.

Yes it’s safe.

1 Like

This probably comes from some application of Internet Link (the space in the name is internally converted to “20” to avoid issues with multiple attribute parameters in the XML) without associated link. Internet Link is just another character style. You can use it freely. Writer applies it implicitly when detecting an hyperlink (and adding the xlink attribute).

What then happened is: someone with advanced skills (or through bad luck) forced a second character style over the Internet Link sequence, probably in several editing steps.

If T12, T31 and T848 have the same definition, then they are direct formatting (applied at three different moments).

Direct formatting can be removed with Format>Clear direct formatting or Ctrl+M. Select the offending sequence before applying the command (so that you don’t clear everything if other direct formatting is important for you)and clear).

Unfortunately, character styles can’t be looked for with Edit>Find & Replace. To locate quickly all Internet Link occurrences, modify the style definition by adding a background Highligting, such as a flashy yellow. You can then clear manually the style (applying No Character Style) where you don’t need it. Remember to restore the standard Internet Link after that.

I’ll need to do all the changes with regexp replacements and/or scripts, because there are too many of them (several hundreds; 1230 in the case of Internet_20_link).

Concerning T12, T31 and T848: T31 is just T12 with “Times New Roman” replaced by “Times New Roman3”, and T848 is just T31 with style:use-window-font-color=“true” replaced by fo:color="#000000". In both cases, the change seems useless and silly. Such changes sometimes yield unexpected results in the generated PDF on some platforms, such as poor font substitution when the new font in the replacement does not exist. A split of a word by different styles can introduce a spurious tiny space between them in the PDF (which breaks copy-paste and searching).

If you aren’t faint of heart, you can attempt this directly on the XML. Save your document as .fodt and work on this copy. Use a text editor with regexp capability or a full-fledged macro-generator.

Be very careful because you must match <text:span …> with the corresponding </text:span> (closest match). You must handle yourself nested applications.

Note also that names like “T123” do not necessarily betray direct formatting. I find unusual that there is a <text:span text:style-name="Internet_20_Link"> because this prevents renaming the style Using T123 is an indirection through a style definition. The style can then be renamed in the intermediate record and this is valid for all application occurrences without the burden of tracking them individually.

IMHO, your document has already be manually tampered with.

When you have modified your .fodt, force-reopen it with Writer to see if you’ve broken something. Save as .odt only when you’re sure, preferentially under another name so that you can still rework on the original.

1 Like

Thanks for suggesting the .fodt format for the cleanup. This avoids difficulties to reconstruct the .odt zip file as accepted by LO when working directly on the content.xml file. I’ve already done lots of XML modifications in various contexts, either by working at the text level (always checking that the result is well-formed with xmllint or similar) or by working at the XML level (e.g. with XSLT, xmlstarlet or libxml2), so I don’t think that this should be a problem for me.