Why non-blanking spaces rendered as gray background?

rhimbo · February 25, 2022, 12:14am

I’ll try to explain with an example. I’ve attached two screen grabs from some text in a real document I wrote in French. Originally I wrote the text using TextEdit on my Mac (no special reason why I didn’t start to write using Writer). I then copied and pasted the text into LO Writer. That’s when I I noticed the non-blanking spaces throughout the document.

Here is one particular instance:

1st screen grab (file ‘with-non-break-space.png’): there is a non-break space after the ‘«’ character at the beginning of the 3rd line of text; this was the result upon pasting some text into my Writer document.

2nd screen grab (file ‘without-non-break-space.png’): there is a regular space after the ‘«’ character at the right end of the 2nd line of text. I manually changed the non-break space to a regular space in Writer. The result was that Writer placed the '« ’ on the 2nd line, but placed the words ‘gagnant-gagnant’ on the 3rd line. This formatting is grammatically incorrect. I don’t know how and where Writer does auto-correct (correct on the fly) so I can’t comment on why this is happening.

So it appears to me that the Writer algorithm employs a non-break space to determine where to break that sequence of characters. This shouldn’t be necessary. Even in the presence of a regular space Writer should know to break the line so that the '« gagnant-gagnant ’ remains together. I don’t see why the non-break spaces are necessary in these scenarios.

Finally, to respond to your comment about fonts…. Ultimately any algorithm for breaking must consider the font for each character. Each character could be in a different font, size or set of attributes such as bold or italic, all affecting the width needed to display the character. But font issues should never take precedence over grammar rules for proper breaking. So highest precedence should go to the language rules. Then, upon calculating a specific break, the algorithm should determine if there is enough space to display the portion of text after which the break will appear. If not, then the algorithm should back up to the previous grammatically correct break in the text.

By the way, earlier when I said that “a space is a space” I meant that a space is just a space grammatically in the context of written language, not in regard to fonts.

Hope I’m explaining it adequately….

rhimbo · February 25, 2022, 12:14am

Sorry, I could only attach one image in the previous message. Here is the second screen grab.

ajlittoz · February 25, 2022, 10:28am

Unicode potentially contains 16×65536 characters. Presently 3 65536-planes are defined with another one under scrutiny. You must then understand that with such a cardinality generic properties must be defined to ease parsing a Unicode text. Handling a handful of properties is preferable to handling nearly 1M individual characters. The Unicode Consortium made a tremendous work in classifying the characters.

One of the properties is “no-break”, i.e. a line break can’t be done on both sides of a character with such a property. The ordinary U+0020 SPACE character is not tagged as “no-break” and is not even a real character. Any application is free to replace it with spacing, whose width is computed by a justify algorithm. Hence a space may end up with any width, either narrower or wider than what is specified in the font attributes.

Since some punctuations must never be separated from adjoining “words” (definition left to the application), some form of no-break space was invented very early in computer history. This was standardised as ISO-8859-x 0xA0 NBSP character and forwarded into Unicode as U+00A0 NO-BREAK SPACE and received the no-break property. So, in any language, when NO-BREAK SPACE is encountered, the app knows it must not allow for a linebreak on both sides of it and we get a uniformly correct grammatical result.

Unfortunately, the choice of this character has not been reconsidered when apps converted to Unicode. Existing documents can continue, for compatibility reason, to use NO-BREAK SPACE, but since it is rather English-culture-related, it doesn’t follow other national typographical usages. Unicode offers many more no-break spaces from HAIR SPACE to EM SPACE to cope with very “polished” work. These spaces, like NO-BREAK SPACE, don’t incur contraction/expansion under justification. They are used for their font-declared width, resulting in the exact expected effect.

The advantage of relying on character properties is obvious: parsing text is faster and more efficient. It is also independent of language, making the line wrap algorithm universal. And I add, since people nearly always forget to tag their text with the language in use, line wrap and no-break are also available for them, without knowing the reference grammar.

My main criticism against AutoCorrect, which has language-dependent customisations where language grammar can be taken into configuration, is the conservation of the wide-spread use of NO-BREAK SPACE instead of a thinner space which would be typographically correct while the former is not. And this is not a matter of grammar.

mikekaganski · February 25, 2022, 10:51am

First of all, please note that these specific non-breaking spaces didn’t come from Writer; from your description, it looks like they came from TxtEdit, which likely uses them in a similar manner.
Then - and here I might be completely wrong, since I’m not a French speaker, - I seem to recall that there is some grammar rules in French, that prescribe NBSP in writing using electronic devices (again, I might be wrong at this).
But most important is that your idea of

i.e., Writer using grammar rules to layout text, is (a) really difficult to implement properly, and (b) means that the document will look differently on different systems.

(a) It means that writer will absolutely need correct grammar checker. The correct here is crucial, and it must mean something very specific - according to which rules it is “correct”? Does the engine fits these rules perfectly? Does it mean that you can’t write using the rules that were used in 1960s, or in XVII century? Or with rules that some scientific magazine requires from its authors, which happen to differ from some established rules? It is both hard to expect that a volunteer-driven project would be able to guarantee that all its languages have so perfect grammar engines (to be honest, I’m not aware of a single perfect engine of that kind, be it free or commercial, open or proprietary); and to put such constraints on the software users (“you only can do what the grammar tells you, no matter what you want”). I don’t even want to start talking about the importance that proper language property of specific text part will have - so you’d have to make sure that spaces and punctuation has correct language, to become layouted properly - and that interfering with user switching keyboard layouts at their will, which affects the characters language, which is very convenient for people, but will become burden in case your proposal is implemented. Note that we already have a small piece that has the mentioned problems - namely, hyphenation.

(b) But when you open the document on a system where LibreOffice is installed without French grammar module, it will show what? How would Writer know where to break the line? Or maybe you open it in Writer at a later point, after you upgraded, and the grammar module was changed to account for some (new) rules, and suddenly you see your document “broken” (images and text shifting to other pages, and references of your colleagues like “see line 3 at page 4” make no sense anymore). Again - this is what happening already today when you open a document with hyphenation, when you don’t have the corresponding language installed - or when the module is different.

And I don’t even mention possible performance issues here, and how could it jump left and right at the input time, depending on how much was already typed, and which rules happen to apply to this partial text …

It might look appealing, and hopes that computer can be smart; but in fact, current behavior is better in very many ways. When computers try to be too smart, they outsmart users in bad ways.

Wanderer · February 25, 2022, 8:50pm

Expecting perfect grammar for all languages LibreOffice supports, so Writer can do this by rules… I don’t think this will come in my lifetime. There is not even spell-checking for all languages.

If you really think this is possible I suggest to search elsewhere. If you find an open-source routine you may give the developers of LO an request for enhancement through bugzilla.

mikekaganski · February 23, 2022, 10:16am

No. As you have already learned, a space is not always the space.
Taking into account the question preamble

it seems that your question is about data pasted from other sources; let me guess it’s about copying from web pages. And use of   or   (the codes for non-breaking spaces) in HTML is widespread. So some spaces on web pages are non-breaking, which is shown with the background, telling you that this character may behave differently than you expect (besides being non-breaking, it also e.g. doesn’t change its width when you justify your text).

You may simply ignore it (it’s just a visual clue that helps editing, but not present when you print or export to PDF); or you may disable field shading in View menu, or you may replace it using Find & Replace - just copy the character from the document, and paste into Find field in the dialog, and press Space in replacement field.

rhimbo · February 24, 2022, 4:23am

Right a space is not a space.

Actually I started a new mail message (Apple Mail application) just to use it as a scratch pad before copying and pasting into my LO file. I was playing around as I was trying to write in grammatically correct French and just started in the mail window before I opened a new LO Writer file.

rhimbo · February 25, 2022, 5:33pm

In short, I am not criticizing (and don’t believe I have criticized) any work that anyone has done to build LO. I just gave a suggestion for a design improvement.