Writer: clarification needed about character attributes

My memory is failing me, but I read somewhere (Bugzilla? new release feature page?) that this new use of span elements was meant to facilitate interoperability with OOXML: indeed, in Microsoft’s format, all text portions of a paragraph are part of something called a text run, even if no special formatting is applied to them (one of the many reasons why this format is awful). It might be related to change tracking. In any case, I don’t think this is a good reason to abuse span’s.

I agree with all your comments and I too find this matter of <text:span> elements troubling. ISO/IEC 29500-1:2012(E) § on p.293 outlines the <w:r> (Text Run) element. I can understand that <text:span> is used as a way of mapping to this element to cater for interoperability, but the behaviour of adding to existing text should not necessitate this. It should be checked whether the element is required. I cannot find a related bug or information on this change from v3.x to v4.x.

Update: I found a related LO User ML thread which points to a related bug: fdo#68183. Unfortunately the answer appears to be related to revision tracking (the officeooo:paragraph-rsid property, which is what @CyanCG suggested i.e., it is a OOXML compatibility feature). In my examples above I did not have revision tracking turned on, so I would think this unnecessary.

This is the bug I had in mind. The comment by Holger Schmithüsen nails it pretty well. Who do we need to convince to see this bug adressed? This officeooo:rsid attribute is a hack and is absolutely unnecessary for those who actually use the OpenDocument format because of its specific virtues. I think that’s how the issue should be presented. OOXML compatibility should never have negative side-effects for those who choose ODF.

Well, I did a bit more research and it appears I was wrong in my initial conjecture that this was an OOXML-related change. The change relates to the feature for comparing documents. I have updated my answer to be clearer about this. Bug fdo#52028 provides the details. This is still little comfort if you would like the underlying XML to be cleaner. All I can suggest is raising a bug to address this, but unfortunately you will need to be incredibly specific in your detail of the problem.

Good, at least this appears to be a better reason for introducing those rsid’s. I might eventually raise a bug and describe the rationale, the use cases, the best practices etc. Maybe I’ll ask for advice on TeX.SE first by asking a question along the lines of “is it advisable to define \newcommands for marking up subsequent additions and revisions to a document?”. That would give us some food for thought!

Applying more than one text (character) style to a portion

I am using LibreOffice 4.1 on OS X 10.8.

It is indeed possible to apply more than one character style to a given portion of text. Take the following example with the two styles you mention (Source Text and Emphasis):

<text:p text:style-name="P1">This is what
<text:span text:style-name="T1">emphasized </text:span>
<text:span text:style-name="Source_20_Text">
<text:span text:style-name="Emphasis">code</text:span></text:span>
looks like.</text:p>

In this first example, I have added the string emphasized afterwards, so that it is enclosed in its own span element with an automatic text style T1. Reminder: styles that apply to text strings are called character styles in LibO and text styles in the ODF spec. The T1 style is defined thus:

<style:style style:name="T1" style:family="text">
  <style:text-properties officeooo:rsid="000c8b60"/>

The only defined attribute is an officeooo:rsid, which shows that this style’s only purpose is document comparison (this makes me grumpy). Apart from that, we can see that it is quite possible to apply two character styles to the same portion. In fact, there are two ways to say it:

  • LibO speak: It is possible to apply multiple different character styles to one text portion;
  • Spec speak: It is possible to enclose a given text node in multiple nested span elements with different text:style-name attributes.

Remark on LibO’s behaviour: the effects are cumulative, so that the word code in this example is displayed in monospaced oblique type (for a monospaced font, the proper term is oblique, as there is no italic shape to speak of, but of course LibO does what is expected and chooses the oblique font).

Remark on implementation: according to the ODF 1.2 spec (part 1, section 19.770), a text:class-names attribute exists for the purpose of applying more than one text style to a node:

A text:class-names attribute specifies a white space separated list of style names. The referenced styles are applied in the order they are contained in the list.

If both text:style-name and text:class-names are present, the style referenced by the text:style-name attribute is applied before the styles referenced by text:class-names attribute. If a conditional style is specified together with a text:class-names attribute, but without a text:style-name attribute, the text:style-name attribute is assumed to have the value of the first style name in the list defined by the text:class-name attribute.

I find that very useful and desirable. One potential use case is to apply a style that defines the visual presentation of the element with a text:style-name attribute (say, "Emphasis" to denote a change in tone with italic type) and also apply other styles with a purely semantic meaning and without any particular visual distinction, with a text:class-names attribute (e.g. "Archaic Eponym"). Perhaps that sounds convoluted, but it’s the kind of thing I have to do in my work and I am sure I am not alone.

Unfortunately, in my experience, LibO does not write text:class-names attributes in the ODF it produces. Perhaps I just haven’t experienced enough, though.

Let us try with user-defined styles now. Here is the second paragraph in my document:

<text:p text:style-name="P2">I like
<text:span text:style-name="Plant">
<text:span text:style-name="Drink">tea</text:span></text:span>
very much.</text:p>

I have defined two character styles in LibO for this document: Plant and Drink. Plant makes the text green and Drink makes it italicized. I have applied them both to the word tea because tea is the name for both the plant and the drink made from the plant ;-). Again, the effects are cumulative: the word tea in my document is now both green and italicized, but again we have two span elements instead of one element with both a text:style-name and a text:class-names attribute. Note that I applied Plant first and then Drink and, accordingly, the span with Plant encloses the one with Drink (the span with the Plant attribute comes first in the document’s hierarchy).

Remark on the paragraph element in this second example: it has a different style (P2), even though the first and the second paragraph have the same level, the same appearance and the same purpose. Makes me grumpy. For people who want to produce structured and semantic documents, it is a pain. But this will change again in a future LibreOffice version, won’t it? :wink:

Now, let us try to understand how LibreOffice represents this document to itself.

How do we inquire about the character styles applied to a text portion?

A distinction that must be made clearly here is that, much like most applications that consume some kind of XML, LibreOffice does not use XML data structures “natively”. It is still true that ODF is LibO’s native format in a certain sense, but any application that supports many document formats needs some kind of internal representation. In our case, we could call it a StarWriter document object or something like that (but that might summon some old ghosts, so I’m not going to insist on that).

In order to give a full account, I would have to define concepts which are specific to OpenOffice/LibreOffice, but this is beyond my means. Suffice it to say that in the application, the text document contains objects that have properties and methods and that the objects that “wrap” others together are called services. The OpenOffice Developer’s Guide (a resource of very uneven quality) has a page that distinguishes further in Appendix A. To put it relatively simply, the structure of text paragraphs is so:

  • Paragraphs are services that include a TextContent service.
  • The actual text is stored in TextPortion services. The description of the TextPortion service in the API documentation is very unclear about the actual status of the service and its relation to the Paragraph service.
  • In any case, the TextPortion is not the thing that has a style: the TextPortion service includes a TextRange service. This service is the one that has the actual CharacterProperties service.
  • In the CharacterProperties service, there are two properties that are relevant to our inquiry: CharStyleName and CharStyleNames. Both are optional. The data type of the CharStyleName property is “string”, while the data type of the CharStyleNames property is a sequence of strings.

Logically, the CharStyleName property should correspond to the text:style-name ODF attribute and the CharStyleNames property should correspond to text:class-names. This is more or less the case, except that the description of the CharStylesnames property says:

It is not guaranteed that the order in the sequence reflects the order of the evaluation of the character style attributes.

If the ODF spec says that “[t]he referenced styles are applied in the order they are contained in the list”, then why is that? I have no answer…

However, now we know what properties we must inquire about. Here is a very dumb subroutine that only works in a very specific situation. When selecting a single styled word in the document, the subroutine prints the values of the style properties. Actually, it generates an error if the selection only has one style applied. The only purpose of this lousy code is to print the styles of a portion which we know has two styles:

sub print_char_styles_of_viewcursor()

dim CurrentDoc as object
dim FirstStyleofSelection as string
dim StyleListofSelection(2) as string

CurrentDoc = ThisComponent.CurrentController
FirstStyleofSelection = CurrentDoc.getViewCursor().CharStyleName
StyleListofSelection() = CurrentDoc.getViewCursor().CharStyleNames()

print "The values of the selected portion’s character style properties are: “"; _
FirstStyleofSelection; "” for CharStyleName; “"; _
StyleListofSelection(0); ", "; StyleListofSelection(1); "” for CharStyleNames."

end sub

Remarks on the code:

  • Even though I said that character styles actually apply to text ranges, here I am using StarBasic to access properties, I am working on portions and I do get the character styles.
  • I mentioned that in the API (IDL reference), the CharStyleNames property is defined as a sequence of strings. Here, from the perspective of StarBasic, it is an array of strings. My assignment of StyleListofSelection to the value of CurrentDoc.getViewCursor().CharStyleNames() is basically the assignment of an array to another array. When you do this in real life, you need to be careful.

When I select the word tea and then run this subroutine, the following message is printed:

The values of the selected portion’s character style properties are: “Drink” for CharStyleName; “Plant, Drink” for CharStyleNames.

When I select the word code, this message is printed:

The values of the selected portion’s character style properties are: “Emphasis” for CharStyleName; “Source Text, Emphasis” for CharStyleNames.

In both cases, the value of the CharStyleName property is the style that was applied last, but the order of the styles in CharStyleNames corresponds to the order in which they were applied. This suggests that the order is indeed not respected when it comes to deciding which style has priority.

This is all very interesting, but the question remains: how do we accurately control the assignment of character properties? When writing the ODF, LibO does respect the order of assignment, but this is not reflected in the document’s data structures inside the application. I think a bug should be raised about the specific issue of the character styles’ priority. If the CharStyleNames property was mapped to the text:class-names attribute, that would be a big improvement.

@CyanCG: Congratulation for the depth of this answer

  1. Character styles are cumulative (“or” in my wording): then how do we remove an attribute, like contour? (I don’t take bold since font family may come with several bold values, e.g. Univers). Re: your remark about conditional emphasis.
  2. My usage of character style is close to semantic markup. I experiment afterwards with visual attributes in the style definition until the distinctions are “visible” (simultaneously trying to keep the traditional typographic usages).
  3. Ergonomics: from a user point of view, there should be some highlighting in the style navigator to show which character styles are active in the selection, not the last (?) one only (the StarBasic hack is not a solution). Presently, to be sure, I always first reset everything to Default before applying new styles.
  4. Ergonomics bis: I try to avoid whenever possible double markup, I think it is better to have a single style with a meaningful (and possible mnemotechnic) name. Drawback: many styles are created.

I agree, and my solution for now is also to apply “Default” and then re-apply the style I really need. If a single attribute is removed with direct formatting (an automatic style in the spec terminology) then it takes precedence over any applied style (be it user-defined or application-defined). That complicates things further.

Terrific analysis with which I agree. A few small things (all near the beginning): “the string emphasis afterwards” should read “the string ‘emphasized’ afterwards”; near the bullet points perhaps “multiple character styles” rather than “two character styles” (twice); rather than “slanted” I think the correct term is “oblique”. The commentary about reverting to Default formatting and direct format overrides I also agree with (in despair). I too cannot obtain text:class-names from testing.

Answer amended. Slanted is also in common usage, but the article on Wikipedia suggests that oblique is indeed the preferred term :-).

From the user’s point of view this introduces some massive problems.
I like using LibreOffice for writing my books. In most respects it’s great.
However, one area which causes me big problems is producing the variant editions of a book.
To be specific, I have an A-format edition which uses 9pt text for the Chapter Body paragraph style; the same para style uses 10.5pt text for the B-format edition. Similarly for other para styles, like Chapter Heading.
However, when I copy the body of the MS from one file into the other (e.g. to create the A-format from the B-format) to create the other edition, it seems a random set of paragraphs fails to take the font size from the para style of the target
document. In addition, a small but (to the user) random amount of text is copied but loses the italic style.
Coupled with bugs in finding italic text, and bugs in comparing documents, the underlying problem of unexpected changes to the copied text’s format (font size, italics), is very difficult and time-consuming to fix.
I just thought I’d make a note of the issue here while I now go and look to see if there’s a bug report.

I initially came here from https://bugs.documentfoundation.org/show_bug.cgi?id=122215 in the hope of discovering the recipe that would avoid at least the problem of the italic property being lost on text apparently randomly. I assume it’s a problem that sometimes appear when using direct formatting.
I’m unclear what direct formatting is, but I have picked up hints that it’s bad and causes troubles.
But I don’t know what the preferred method of formatting is that avoids the troubles.
Since I always apply emphasis the same way (via selecting the text to be emphasised and then using Ctrl-I), I don’t understand why 90%+ of the text keeps the italic attribute, but some text doesn’t.
I want to be a good user, using paragraph and page and character styles correctly to support my workflows, but I haven’t found a solution to this problem yet.

As pointed out in the bug report and elsewhere, direct formatting is the usual cause of the problem. Direct formatting is any action aimed at changing text attributes without styles, such as keyboard shortcuts (for italics, bold, …) or toolbar buttons (same + lists, …). These seem “natural” because M$ Word does it this way, having no character style.

Direct formatting is “sticky” and “invisible”: in the layered styles model, it sits on top and has no hint in the various style panels and menus. It survives copy from area to area or document to document. The only way to get rid of it is to select a wide range of text and Ctrl+M or Format>Clear Direct Formatting.

I have a similar need to yours (though not exactly the same). My solution is to avoid direct formatting and exclusively use styles (para, character, page and list). This leans a very strict discipline while writing and probably some discomfort because of character style application instead of keyboard shortcuts…

(continued 1) unless you reonfigure in depth LO Writer to transfer the usual Ctrl+I or B to your character styles.

However, you cannot fully forbid direct formatting because some actions have no style equivalent, e.g. resetting list numbering. These events are rare enough in my documents I accept the risk of living with it.

My workflow is to consider I put a (semantic) markup of the text with styles, i.e; I don’t request bold or italics but I mark a sequence as “important” or “outstanding” (Emphasis or Strong emphasis are two candidate built-in styles). Then, afterwards (in fact rather beforehand when I designed my template), I decide whether such marked sequences should display bold or red. My goal is to separate the contents and its semantics from the appearance or presentation. This imposes many constraints but eliminates problems when reviewing or preparing for another output medium.

In this respect, what I miss most is …

(continued 2) … the possibility to mark a sequence with more than one character style (as can be done in Quark XPress®), e.g. a sequence may be marked up as “comment” and I want to put “emphasis” on a word without losing the “comment” markup. Presently I solve the issue with another character style merging both original ones (not satisfactory).

Similarly, I’d like to be able to negate an attribute: Emphasis is usually coded “bold”, but if base style of paragraph style is already “bold”, typographic rules say this emphasis should revert to “Roman”. Can’t be done today in Writer apart from creating a “complementary” style.

Despite these shortcomings, I haven’t experienced the random non-updates, probably because I struggle to avoid direct formatting. As I wrote, it is not very user-friendly while typing but it is rewarding on editing.

That’s a very helpful answer, thank you.
It’s depressing however, as it means there’s a severe and ongoing usability problem, as well as a subtle and largely invisible trap for most users.
If the bug in being able to find text by attribute were fixed (Find & Replace, using Format, makes F&R unreliable), then a workaround might be to Find All italics, then simply choose one’s Character Emphasis style and apply that.
Does that sound practical?
Or does Writer’s “layered attribute” (?) model of text mean that finding text by the visible attribute the user is able to detect, can never be reliable?

I use sparingly Find & Replace probably due to my careful use of styles. I just looked to the F&R dialog to refresh my memory. The Format button opens a font selection dialog. When you choose Italic in there, you’re in fact telling LO Writer to look for a font variant. If you applied your italics with direct formatting, i.e. Ctrl+I, toolbar button or munu equivalent, I’m note sure of what gets recorded in the XML or internal representation.

Styling with a style involving a teal italic font should not create problem et be relatively reliable. Direct formatting works even for font without the specific variant (when rendering, the font engine “manufactures” an italic or bold synthetic version of the font). The XML encoding is probably not the same, meaning the search strategy doesn’t consider the same “keys” as the previous case.

The erratic behaviour you encounter may be due to a change of strategy in the middle of a search …

(continued) when F&R sees a direct formatting. Then, it does not revert to the initial “pure” strategy and begins to get confused. That pure speculation of mine.

Maybe the layered style architecture is also a factor. That’s why I keep away from direct formatting to get one layer out of the game. Consider direct formatting is OK for experimenting but should never be used for production-quality documents.

Thanks. When the italic font exists for the regular font you’re using, I think it’s reasonable to expect that searching for the italic font will find occurrences of text that Writer shows (via text display and the Font toolbar), to be that italic font. I think that expectation is reasonable regardless of whether the italic font text was produced via a style or via direct formatting.

Given other bugs, I believe Writer does not properly “understand” or operate on its own representation. Hopefully these issues can be addressed. I can do my part by providing one or two more bug reports to help.

I understand what you’re saying about the problems of direct formatting, caused IMHO by Writer’s model and UI and documentation. I doubt I could convince the devs to redesign the model, but hopefully they can fix the bugs in its implementation, and I can do my small part to help with the documentation where I see it lacking, too.

Thanks for your ongoing comments and explanations - it helps!