Why non-blanking spaces rendered as gray background?

rhimbo · February 23, 2022, 1:27am

LibreOffice 7.1.6.2 on macOS Monterey 12.2.1

This topic has appeared many times, but the threads are all old and closed so I’m reposting to hopefully get some answers from the actual development team…

When pasting text into a Writer document, there are quite a few gray spaces where the original text just had white space. In my experience, this happens even when there is clearly just a space, not a tab, not a break, not a font change, etc. Why would Writer need to do render as gray background? A space is a space.

All the other threads on this topic seem to surmise that Writer feels the need to render the space with a gray background to inform the user that it is non-breaking white space.

I did some experimenting. There is one instance in a document I’m currently editing where I pasted translation of some original English text into French. One phrase ‘win-win’ translated as « gagnant-gagnant » had a gray space rendered between the angle quotation marks and the letter ‘g’.

This could be a case where the non-breaking space was needed for French because the word ‘gagnant’ is longer than the English ‘win’ and the text was wrapped in the French language version of the doc whereas the English version fit ‘win-win’ all on the current line.

That being said, there are many instances in the same document where a gray background was rendered for a space in the middle of a line. In most of the cases the gray space preceded a punctuation mark such as a ‘:’ or semi-colon ‘;’

That is, the English text with no space between the end of a word and a colon was converted to the same word followed by a space preceding the colon. Even if this is correct grammatically, why would there need to put a gray rendering where the space character appears?

Could someone from the development team comment please?

EarnestAl · February 23, 2022, 2:22am

I am not from the development team; they don’t live here, just us users.

Click Tools > AutoCorrect > AutoCorrect Options > Localised Options and you will see that there are settings to Add non-breaking space before specific punctuation marks in French text.

Why? I assume because that is what the French language authorities in those countries have determined should be done.

Press Alt+X immediately after the grey to see what character it is, it might be

U+2060 WORD JOINER (HTML ⁠ · &NoBreak; · WJ). Encoded in Unicode since version 3.2. The word-joiner does not produce any space and prohibits a line break at its position.
No-break thin space, known in Unicode as “NARROW NO-BREAK SPACE” (U+202F). This is required for French punctuation before ?, ! or ;

The field backgrounds don’t print nor export to pdf, they are there so you can identify them. If you don’t want to see them you can click View > Field shadings (Ctrl+F8). Cheers, Al

rhimbo · February 23, 2022, 3:20am

Thank you. Yours is a good explanation. And I am now introduced to the auto correct options to which you directed me…

Alt+X does not seem to work on macOS (at least not for me). Is there another way to view the Unicode character code (or human-comprehensible mnemonic) for a character? I can’t find anything relevant under the Format menu or the View menuor the Tools menu when placing my cursor before or after the character in question

EarnestAl · February 23, 2022, 3:51am

A quick run around the internet and apparently Mac doesn’t do the Alt+X trick, I’m pretty sure it worked in Linux Mint. Try an online viewer View non-printable unicode characters I don’t know anything about the site except it came up in the search.
It might come as U+00a0 because from my test U+202F doesn’t show a grey background.

rhimbo · February 23, 2022, 4:07am

Yes, thanks. I also did a quick search but found nothing that worked.

Alt (Option) + X works in other apps on macOS. I guess the LibreOffice folks did not map it. But I think there is a way to map key combinations to functions. I’ll play with this when I have a moment. If I figure something out I’ll post on the forum…

Thanks again for your help…

EarnestAl · February 23, 2022, 4:14am

I don’t think it is LO specific, Word for Mac can’t do it either. It is probably for Mac apps only.

According one article I skimmed through the system character map works only for Mac apps, not for LO nor Word for Mac, so it seems to be a Mac decision.

LeroyG · February 23, 2022, 12:14pm

Shortcut Keys: Search for the function Toggle Unicode Notation (for LibreOffice, not for Writer only).

rhimbo · February 24, 2022, 4:20am

Thanks, I’ll look into this. I thought there must be. That being said, perhaps Apple deep in the bowels of its UI libraries checks for authentic Apple-approved signed applications? I don’t know, just guessing as they could do that technically.

ajlittoz · February 23, 2022, 9:21am

The gray background (field shading) is a non-printing clue that what you see is not the standard character and therefore has special properties.

French AutoCorrect options allow you to insert a non-breaking space before some punctuation to follow usage of traditional typography. French Code typographique dictates that some punctuation should be preceded by une espace fine (thin space).

why non-breaking space? This is to avoid linewrap between the word and the subsequent punctuation because a punctuation must never appear alone at the start of line (or end of line in case of opening quotation mark).
unfortunately, due to history of character set development, the space inserted to abide by the typographical rule is U+00A0 NO-BREAK SPACE which is a carry-over from limited ISO-8859-x 256-character sets. They offered only SPACE and NO-BREAK SPACE. And NO-BREAK SPACE is much too wide to be qualified as thin space, giving an ugly result where, frequently, the space before punctuation is wider than inter-word spacing (the exact opposite of desired effect).
Fot better and complying result, you should manually insert U+202F NARROW NO-BREAK SPACE instead of the implemented NO-BREAK SPACE (after disabling AutoCorrect option). This narrow space is available as Insert>Formatting Mark>Insert Narrow No-Break Space or Alt+Shift+Space under Linux (may differ under MacOS)

This where developers could be of some help by changing the AutoCorrect parameters. I’m afraid, though, they will invoke compatibility with existing documents to wave away the request.

rhimbo · February 24, 2022, 4:18am

Thanks for the detail. LO should upgrade to UTF-8 at least or wide character (4-bytes per character) to avoid all these problems. Unicode fell victim to all the international politics from nations who wanted their own native code sets to be replicated in the Unicode standard. I had to deal with that garbage politics for years when I worked on the SunOS and Solaris utilities while at Sun Microsystems.

Another approach is to use parsing rules to determine grammatically correct line breaks. There is a pretty damn good Java library (developed by IBM if I remember correctly) that enables you to subclass the basic package framework classes to define declaratively break rules for any language.

Perhaps LO should go that route if there are recurring problems with Unicode and various code sets such as ISO-8859-x.

Wanderer · February 24, 2022, 6:43am

and I thought they were already using Unicode … Where in this thread you got the impression LO is nit using unicode.

The problem is, there is a lot of old stuff on storage and people want to access and copy from. And a lot of them will not be happy when their created documents will be converted by '“better” algorithm to remove/change non-breaking-spaces/ narrow spaces. And as most text is not really tagged to give meta-info like language or even " don’t touch, was aligned with 4 hours manual editing for my printer" there us no general rule to change this.

mikekaganski · February 24, 2022, 7:01am

It looks as if @rhimbo is under impression that Unicode and UTF-8 are something very different (not that UTF-8 is one of Unicode encodings):

So it looks as if @rhimbo thinks that LO uses Unicode, but instead should use UTF-8 “or wide character (4-bytes per character)”

rhimbo · February 24, 2022, 5:13pm

I don’t know where you got the idea that I think LO “uses Unicode.” ‘Unicode’ is a standard that happens to include many code sets.

It would be more accurate to say that LO should use a standard code set. In reality it would have to ‘use’ several as it would have to be able to read files whose contents were encoded using various code sets.

UTF-8 would be one option. I only said that because, even today where processors are fast and memory is cheap, someone always says “Unicode takes up too much space in memory.”

But I haven’t looked at the source code to know how LO encodes text in memory. I assumed it was not an adequately large code set with accommodation for enough code points as someone mentioned the problems with ISO8859-x…

ajlittoz · February 24, 2022, 5:33pm

LO Writer is indeed Unicode. Whether it uses UTF-8, UTF-16 or UTF-32 is irrelevant for the problem I mentioned about NO-BREAK SPACE. It is the result of computing history and the reluctance to change things which are definitely wrong and not questioned again because they are so familiar as being accepted as permanent truth though Unicode allows to manage it correctly.

I wouldn’t say that. Unicode doesn’t concatenate code sets. The first 256-block is made of ASCII and most of common intersection of ISO-8859-x. For the rest, it is structured on “scripts”, i.e. writing systems. And the standardisation body forbids duplications (except those introduced at inception related to pre-composed glyphs, again to retain compatibility with ISO-8859-x). It clearly states that every codepoint describes a glyph and not a precise shape for the codepoint. It follows that any codepoint may be displayed in a variety of shapes (leading to a wealth of font faces) while still being conformant with Unicode.

mikekaganski · February 24, 2022, 5:57pm

Heh. Unicode defines exactly one code set.

rhimbo · February 24, 2022, 6:20pm

That’s not quite accurate.

Unicode defines one encoding per character. A code set is a collection of code points. The encoding differs with code set.

rhimbo · February 24, 2022, 6:22pm

You’re right; it’s irrelevant. The mistake the LO designers made is similar to the gross disaster of HTML.

A character encoding should not be used to determine formatting. LO should fix this.

mikekaganski · February 24, 2022, 6:41pm

Unicode is a single set of codes (codepoints), each uniquely defining a character. It is a single “code set”. It does not define “one encoding per character” (who could ever imagine such a strange thing!) - encoding is e.g. UTF-8 (one encoding that can encode the whole set of Unicode codepoints), or UTF-7, or GB18030. Not even every Unicode encoding is defined by Unicode standard itself, or even by the organization.

You seem to talk about things you don’t know. and claiming strange things, you make one mistake after another. It’s better to just stop.

mikekaganski · February 24, 2022, 7:06pm

Bit pattern is a matter of encoding, it’s the encoding that defines that.

A codepoint is the numeric value that uniquely defines a character, and it is independent of encoding that may be used to represent it on the wire, or it may even be written using a pen, without use of “bits”. Unicode defines a set of codes (codepoints), and various encodings define bit patterns.

Unicode has a single code set (set of codepoints).
Unicode is not ISO 10646 (although they are synchronized constantly); and ISO 10646 was not discussed up to this point (so introducing it here out of the blue is … funny).

Unicode does not “accomodate” different code sets (other than 7-bit ASCII, or rather 8-bit ISO-8859-1, since it’s the only subset of codes (numeric values) that is shared in the Unicode and in another character set).

The intelligence is luckily available to everyone.

Wanderer · February 24, 2022, 8:49pm

Actually the character encoding does not decide formatting alone. Based on the encoding we select a graphical representation from a font. The designer of the font decided on the actual space the character needs.

The use of non-breaking-space is wide-spread. It is not necessarily a character, but could be a code, but i don’t think this would solve your complaints and you didn’t tell how to decide formatting your way…