Why non-blanking spaces rendered as gray background?

ajlittoz · February 23, 2022, 9:21am

The gray background (field shading) is a non-printing clue that what you see is not the standard character and therefore has special properties.

French AutoCorrect options allow you to insert a non-breaking space before some punctuation to follow usage of traditional typography. French Code typographique dictates that some punctuation should be preceded by une espace fine (thin space).

why non-breaking space? This is to avoid linewrap between the word and the subsequent punctuation because a punctuation must never appear alone at the start of line (or end of line in case of opening quotation mark).
unfortunately, due to history of character set development, the space inserted to abide by the typographical rule is U+00A0 NO-BREAK SPACE which is a carry-over from limited ISO-8859-x 256-character sets. They offered only SPACE and NO-BREAK SPACE. And NO-BREAK SPACE is much too wide to be qualified as thin space, giving an ugly result where, frequently, the space before punctuation is wider than inter-word spacing (the exact opposite of desired effect).
Fot better and complying result, you should manually insert U+202F NARROW NO-BREAK SPACE instead of the implemented NO-BREAK SPACE (after disabling AutoCorrect option). This narrow space is available as Insert>Formatting Mark>Insert Narrow No-Break Space or Alt+Shift+Space under Linux (may differ under MacOS)

This where developers could be of some help by changing the AutoCorrect parameters. I’m afraid, though, they will invoke compatibility with existing documents to wave away the request.

rhimbo · February 24, 2022, 4:18am

Thanks for the detail. LO should upgrade to UTF-8 at least or wide character (4-bytes per character) to avoid all these problems. Unicode fell victim to all the international politics from nations who wanted their own native code sets to be replicated in the Unicode standard. I had to deal with that garbage politics for years when I worked on the SunOS and Solaris utilities while at Sun Microsystems.

Another approach is to use parsing rules to determine grammatically correct line breaks. There is a pretty damn good Java library (developed by IBM if I remember correctly) that enables you to subclass the basic package framework classes to define declaratively break rules for any language.

Perhaps LO should go that route if there are recurring problems with Unicode and various code sets such as ISO-8859-x.

Wanderer · February 24, 2022, 6:43am

and I thought they were already using Unicode … Where in this thread you got the impression LO is nit using unicode.

The problem is, there is a lot of old stuff on storage and people want to access and copy from. And a lot of them will not be happy when their created documents will be converted by '“better” algorithm to remove/change non-breaking-spaces/ narrow spaces. And as most text is not really tagged to give meta-info like language or even " don’t touch, was aligned with 4 hours manual editing for my printer" there us no general rule to change this.

mikekaganski · February 24, 2022, 7:01am

It looks as if @rhimbo is under impression that Unicode and UTF-8 are something very different (not that UTF-8 is one of Unicode encodings):

So it looks as if @rhimbo thinks that LO uses Unicode, but instead should use UTF-8 “or wide character (4-bytes per character)”

rhimbo · February 24, 2022, 5:13pm

I don’t know where you got the idea that I think LO “uses Unicode.” ‘Unicode’ is a standard that happens to include many code sets.

It would be more accurate to say that LO should use a standard code set. In reality it would have to ‘use’ several as it would have to be able to read files whose contents were encoded using various code sets.

UTF-8 would be one option. I only said that because, even today where processors are fast and memory is cheap, someone always says “Unicode takes up too much space in memory.”

But I haven’t looked at the source code to know how LO encodes text in memory. I assumed it was not an adequately large code set with accommodation for enough code points as someone mentioned the problems with ISO8859-x…

ajlittoz · February 24, 2022, 5:33pm

LO Writer is indeed Unicode. Whether it uses UTF-8, UTF-16 or UTF-32 is irrelevant for the problem I mentioned about NO-BREAK SPACE. It is the result of computing history and the reluctance to change things which are definitely wrong and not questioned again because they are so familiar as being accepted as permanent truth though Unicode allows to manage it correctly.

I wouldn’t say that. Unicode doesn’t concatenate code sets. The first 256-block is made of ASCII and most of common intersection of ISO-8859-x. For the rest, it is structured on “scripts”, i.e. writing systems. And the standardisation body forbids duplications (except those introduced at inception related to pre-composed glyphs, again to retain compatibility with ISO-8859-x). It clearly states that every codepoint describes a glyph and not a precise shape for the codepoint. It follows that any codepoint may be displayed in a variety of shapes (leading to a wealth of font faces) while still being conformant with Unicode.

mikekaganski · February 24, 2022, 5:57pm

Heh. Unicode defines exactly one code set.

rhimbo · February 24, 2022, 6:20pm

That’s not quite accurate.

Unicode defines one encoding per character. A code set is a collection of code points. The encoding differs with code set.

rhimbo · February 24, 2022, 6:22pm

You’re right; it’s irrelevant. The mistake the LO designers made is similar to the gross disaster of HTML.

A character encoding should not be used to determine formatting. LO should fix this.

mikekaganski · February 24, 2022, 6:41pm

Unicode is a single set of codes (codepoints), each uniquely defining a character. It is a single “code set”. It does not define “one encoding per character” (who could ever imagine such a strange thing!) - encoding is e.g. UTF-8 (one encoding that can encode the whole set of Unicode codepoints), or UTF-7, or GB18030. Not even every Unicode encoding is defined by Unicode standard itself, or even by the organization.

You seem to talk about things you don’t know. and claiming strange things, you make one mistake after another. It’s better to just stop.

mikekaganski · February 24, 2022, 7:06pm

Bit pattern is a matter of encoding, it’s the encoding that defines that.

A codepoint is the numeric value that uniquely defines a character, and it is independent of encoding that may be used to represent it on the wire, or it may even be written using a pen, without use of “bits”. Unicode defines a set of codes (codepoints), and various encodings define bit patterns.

Unicode has a single code set (set of codepoints).
Unicode is not ISO 10646 (although they are synchronized constantly); and ISO 10646 was not discussed up to this point (so introducing it here out of the blue is … funny).

Unicode does not “accomodate” different code sets (other than 7-bit ASCII, or rather 8-bit ISO-8859-1, since it’s the only subset of codes (numeric values) that is shared in the Unicode and in another character set).

The intelligence is luckily available to everyone.

Wanderer · February 24, 2022, 8:49pm

Actually the character encoding does not decide formatting alone. Based on the encoding we select a graphical representation from a font. The designer of the font decided on the actual space the character needs.

The use of non-breaking-space is wide-spread. It is not necessarily a character, but could be a code, but i don’t think this would solve your complaints and you didn’t tell how to decide formatting your way…

rhimbo · February 25, 2022, 12:14am

I’ll try to explain with an example. I’ve attached two screen grabs from some text in a real document I wrote in French. Originally I wrote the text using TextEdit on my Mac (no special reason why I didn’t start to write using Writer). I then copied and pasted the text into LO Writer. That’s when I I noticed the non-blanking spaces throughout the document.

Here is one particular instance:

1st screen grab (file ‘with-non-break-space.png’): there is a non-break space after the ‘«’ character at the beginning of the 3rd line of text; this was the result upon pasting some text into my Writer document.

2nd screen grab (file ‘without-non-break-space.png’): there is a regular space after the ‘«’ character at the right end of the 2nd line of text. I manually changed the non-break space to a regular space in Writer. The result was that Writer placed the '« ’ on the 2nd line, but placed the words ‘gagnant-gagnant’ on the 3rd line. This formatting is grammatically incorrect. I don’t know how and where Writer does auto-correct (correct on the fly) so I can’t comment on why this is happening.

So it appears to me that the Writer algorithm employs a non-break space to determine where to break that sequence of characters. This shouldn’t be necessary. Even in the presence of a regular space Writer should know to break the line so that the '« gagnant-gagnant ’ remains together. I don’t see why the non-break spaces are necessary in these scenarios.

Finally, to respond to your comment about fonts…. Ultimately any algorithm for breaking must consider the font for each character. Each character could be in a different font, size or set of attributes such as bold or italic, all affecting the width needed to display the character. But font issues should never take precedence over grammar rules for proper breaking. So highest precedence should go to the language rules. Then, upon calculating a specific break, the algorithm should determine if there is enough space to display the portion of text after which the break will appear. If not, then the algorithm should back up to the previous grammatically correct break in the text.

By the way, earlier when I said that “a space is a space” I meant that a space is just a space grammatically in the context of written language, not in regard to fonts.

Hope I’m explaining it adequately….

rhimbo · February 25, 2022, 12:14am

Sorry, I could only attach one image in the previous message. Here is the second screen grab.

ajlittoz · February 25, 2022, 10:28am

Unicode potentially contains 16×65536 characters. Presently 3 65536-planes are defined with another one under scrutiny. You must then understand that with such a cardinality generic properties must be defined to ease parsing a Unicode text. Handling a handful of properties is preferable to handling nearly 1M individual characters. The Unicode Consortium made a tremendous work in classifying the characters.

One of the properties is “no-break”, i.e. a line break can’t be done on both sides of a character with such a property. The ordinary U+0020 SPACE character is not tagged as “no-break” and is not even a real character. Any application is free to replace it with spacing, whose width is computed by a justify algorithm. Hence a space may end up with any width, either narrower or wider than what is specified in the font attributes.

Since some punctuations must never be separated from adjoining “words” (definition left to the application), some form of no-break space was invented very early in computer history. This was standardised as ISO-8859-x 0xA0 NBSP character and forwarded into Unicode as U+00A0 NO-BREAK SPACE and received the no-break property. So, in any language, when NO-BREAK SPACE is encountered, the app knows it must not allow for a linebreak on both sides of it and we get a uniformly correct grammatical result.

Unfortunately, the choice of this character has not been reconsidered when apps converted to Unicode. Existing documents can continue, for compatibility reason, to use NO-BREAK SPACE, but since it is rather English-culture-related, it doesn’t follow other national typographical usages. Unicode offers many more no-break spaces from HAIR SPACE to EM SPACE to cope with very “polished” work. These spaces, like NO-BREAK SPACE, don’t incur contraction/expansion under justification. They are used for their font-declared width, resulting in the exact expected effect.

The advantage of relying on character properties is obvious: parsing text is faster and more efficient. It is also independent of language, making the line wrap algorithm universal. And I add, since people nearly always forget to tag their text with the language in use, line wrap and no-break are also available for them, without knowing the reference grammar.

My main criticism against AutoCorrect, which has language-dependent customisations where language grammar can be taken into configuration, is the conservation of the wide-spread use of NO-BREAK SPACE instead of a thinner space which would be typographically correct while the former is not. And this is not a matter of grammar.

mikekaganski · February 25, 2022, 10:51am

First of all, please note that these specific non-breaking spaces didn’t come from Writer; from your description, it looks like they came from TxtEdit, which likely uses them in a similar manner.
Then - and here I might be completely wrong, since I’m not a French speaker, - I seem to recall that there is some grammar rules in French, that prescribe NBSP in writing using electronic devices (again, I might be wrong at this).
But most important is that your idea of

i.e., Writer using grammar rules to layout text, is (a) really difficult to implement properly, and (b) means that the document will look differently on different systems.

(a) It means that writer will absolutely need correct grammar checker. The correct here is crucial, and it must mean something very specific - according to which rules it is “correct”? Does the engine fits these rules perfectly? Does it mean that you can’t write using the rules that were used in 1960s, or in XVII century? Or with rules that some scientific magazine requires from its authors, which happen to differ from some established rules? It is both hard to expect that a volunteer-driven project would be able to guarantee that all its languages have so perfect grammar engines (to be honest, I’m not aware of a single perfect engine of that kind, be it free or commercial, open or proprietary); and to put such constraints on the software users (“you only can do what the grammar tells you, no matter what you want”). I don’t even want to start talking about the importance that proper language property of specific text part will have - so you’d have to make sure that spaces and punctuation has correct language, to become layouted properly - and that interfering with user switching keyboard layouts at their will, which affects the characters language, which is very convenient for people, but will become burden in case your proposal is implemented. Note that we already have a small piece that has the mentioned problems - namely, hyphenation.

(b) But when you open the document on a system where LibreOffice is installed without French grammar module, it will show what? How would Writer know where to break the line? Or maybe you open it in Writer at a later point, after you upgraded, and the grammar module was changed to account for some (new) rules, and suddenly you see your document “broken” (images and text shifting to other pages, and references of your colleagues like “see line 3 at page 4” make no sense anymore). Again - this is what happening already today when you open a document with hyphenation, when you don’t have the corresponding language installed - or when the module is different.

And I don’t even mention possible performance issues here, and how could it jump left and right at the input time, depending on how much was already typed, and which rules happen to apply to this partial text …

It might look appealing, and hopes that computer can be smart; but in fact, current behavior is better in very many ways. When computers try to be too smart, they outsmart users in bad ways.

Wanderer · February 25, 2022, 8:50pm

Expecting perfect grammar for all languages LibreOffice supports, so Writer can do this by rules… I don’t think this will come in my lifetime. There is not even spell-checking for all languages.

If you really think this is possible I suggest to search elsewhere. If you find an open-source routine you may give the developers of LO an request for enhancement through bugzilla.

mikekaganski · February 23, 2022, 10:16am

No. As you have already learned, a space is not always the space.
Taking into account the question preamble

it seems that your question is about data pasted from other sources; let me guess it’s about copying from web pages. And use of   or   (the codes for non-breaking spaces) in HTML is widespread. So some spaces on web pages are non-breaking, which is shown with the background, telling you that this character may behave differently than you expect (besides being non-breaking, it also e.g. doesn’t change its width when you justify your text).

You may simply ignore it (it’s just a visual clue that helps editing, but not present when you print or export to PDF); or you may disable field shading in View menu, or you may replace it using Find & Replace - just copy the character from the document, and paste into Find field in the dialog, and press Space in replacement field.

rhimbo · February 24, 2022, 4:23am

Right a space is not a space.

Actually I started a new mail message (Apple Mail application) just to use it as a scratch pad before copying and pasting into my LO file. I was playing around as I was trying to write in grammatically correct French and just started in the mail window before I opened a new LO Writer file.

rhimbo · February 25, 2022, 5:33pm

In short, I am not criticizing (and don’t believe I have criticized) any work that anyone has done to build LO. I just gave a suggestion for a design improvement.