symbola + zero-width-joiner + linewrap

When using a Symbola font and inserting the zero-width-joiner (ZWJ) character (u+200d) between the characters of a word, Writer correctly does not split the word.

However, line-wrap is occasionally incorrect, in that it is “too aggressive”, and will move an (intact, Symbola word with ZWJ) to the next line, yet there is room on the current line for that word.

In the image below, the word underlined in orange should not have been moved to the next line, as there was room on the previous line (in green) for that word:

image

Is there a setting or configuration that controls the way Writer line-wraps a unicode font with ZWJ in it?

Version: 25.8.4.2 (X86_64)
OS: Windows 11 X86_64
UI render: Skia/Raster

In cases like this please upload your test-file. (Especially to check for more “unusual” spaces.

Imho the first place to look is not font, but language, where hyphenation-rules exist. But for your symbols language should be “none”.

Formatting Aids

Attached is the sample file. Here are the steps:

  • I compose a paragraph in Writer by pasting in ascii text; that looks fine.
  • I change the font to Symbola; that looks fine
  • I select the text and process it with a Python macro that selects a random set of (other) Symbola symbols and replaces each original symbol with its replacement (this is where the word-wrap and line-wrap cause problems; using the ZWJ keeps each word together)
  • the page double-spaced and printed & given to others to ‘solve’ the puzzle

line_wrap_problem.odt (24.8 KB)

Are you sure your Python macro does what you expect? I suppose this macro inserts the ZWJ. When I analyse your sample text, I see that ZWJ always appear in pairs. It is possible this duplication disturbs the layout algorithm.

Since your “symbols” are not in a script-related Unicode block, are ZWJ really necessary?

Thank you for the response & suggestions. I tried to make sure that the Python macro does what was intended, but at this point everything is suspect (including the script).

I noticed that sometimes a word was getting split, so I thought it would be a good idea to insert ZWJ. Doing that seemed to solve the word-split problem, which is good, but now I noticed the line-wrap problem.

Is there a better alternative? Would some other marker be better?

 '\u200d' : zero-width joiner
 '\u2060' : word joiner
 '\ufeff' : zero-width no-break space

Also, the macro translation encodes some ASCII characters into UTF-16 and others into UTF-32. Might there be an issue with mixing UTF-16 and UTF-32 characters in the same document?

This file should have exactly one ZWJ between each Symbola character in each word.

image

line_wrap_7.odt (27.7 KB)

Why not do the randomisation of the text, then change the font to Symbola. Maybe there are some emoji type compounds in the Symbola characters, if so, your macro doesn’t need to deal with it if it is just plain text at that point.

Your glyphs are taken from Unicode Supplementary Multilingual Plane (Plane 1) and lie in a block of emojis. Unicode Technical Standard #51 (UTS #51) Unicode Emoji specifies rules relative to emojis. Some rules define how emoji can be combined to display a single glyph from several “base components”. The emojis are “glued” with ZWJ.

Some sequences are clearly declared as invalid. But interpretation of most of them is left to the implementation of the rendering engine.

From this perspective, I’d say that most of your text does not make sense and we’re in the vast terra incognita of implementation choice. I have no idea how the underlying font renderer behaves (Harfbuzz – in Linux at least, different on W$?).

Since ZWJ has been assigned a specific role in emoji sequnces in UTS #51, I suggest you experiment with ZERO WIDTH NO-BREAK SPACE. Also I recommend you don’t mix various UTS encoding in the same document. ODF uses UTF-8. You didn’t explain the step where your macro inserts code. If you work directly on the ODF, inserting UTF-16 or UTF-32 inside a UTF-8 flow will lead to misinterpretation (because UTF-16 and UTF-32 will be seen as 2 and 4 bytes). However, I don’t think this is the problem because we would have display of “strange” glyphs.

1 Like

Thanks again for the suggestions.

The problem is much more challenging that I expected.

The Python macro is my own getString() inside of the Capitalize.py example.

I will try making all of the text the same encoding (UTF-16 or UTF-32), but the File->Save->Character set only allows UTF-16 but not UTF-32.

It seems that the ‘zero width no-break space’ is deprecated, and is only used as a Byte Order Mark. I’ll experiment with ‘word joiner’.