Bidirectional text and closing bracket bug

When writing a sentence which contains LTR and RTL languages, there is a problem with closing bracket.
if closing bracket is in apposite direction language to the paragraph direction and text immediately after the bracket is in paragraph direction.

EDITED:
Here is an image produced on Windows 7 with LO 4.0.3.3 using “Lucida Sans Unicode” TTF font

When using “Linux Biolinum G” font then things go wild.
(Here I spacial used Hebrew characters “דהו” and English “def” witch suppose to be inside the parentheses)

and here is how it should be (using MS-word)

BiDi text (English\Hebrew for testing):

LTR paragraph

abc אבג (דהו)

abc אבג (דהו) abc

abc (אבג)

abc אבג (דהו) אבג

RTL paragraph

אבג abc (def)

אבג abc (def) אבג

אבג (abc)

אבג abc (def) abc

Can you please indicate your platform, LO version, and particular font you are experiencing this issue with? Thanks.

I’m using Windows 7, LO 4.0.3.3, all Hebrew supporting TTF fonts like (Aharoni, Arial, David, FrankRuehl, Gisha, Levenim MT, Lucida Sans Unicode, Miriam, Narkisim, Rod, Tahoma, Times New Roman) and there is a special problem with (Linux Biolinum G, Linux Libertine G) fonts, as you can see in screen shots that I added to the question.

Plus I’ll try it on Linux Mint when I’ll be back at home.

Thank you for expanding on your question and for providing the graphics. Much clearer. Based on the fonts you are using I imagine the experience under Linux Mint will be similar, but I will be interested in your findings either way.

This is a problem with a difficult history. It appears to depend on several factors, including platform, locale, the font (and font technology), and the entry sequence of Unicode directional formatting codes in particular. I have almost complete rewritten my original pair of answers both for better accuracy and to provide a better understanding of how Complex Text Layout (CTL) is “handled” within LO/ODF and what your particular issue may be.

While there are associated bugs, I am no longer certain that this is a bug. It would seem to me more likely that the particular issue raised by this question relates to how characters are input and the correct sequence for this. This is however a technical matter with scope for great variance and probable improvement, at least in terms of User Experience (UX). I refer here in particular to the comments in fdo#61795, which indicate that a reliance on Unicode directional formatting codes, while technically correct, is not necessarily the most user-friendly approach.

Bugs

This is a summary, and probably not a comprehensive one, of related bugs at the present time. They appear to be inter-related as well as varied in their relation to the font and Unicode directional formatting code aspects. To what degree each is a genuine bug is IMO a matter for the developers.

  • fdo#33302, RTL text: parentheses and brackets “(…) […]” inverted to “)…( ]…[” with some fonts. MacOSX only. Appears to be the original bug and contains lots of detail. Refer comments #15, #18, #19, #22-23, and #35. Verified as fixed (as a result of patches to other bugs e.g., fdo#59892) for v4.1beta1.
  • fdo#56408, Brackets are not handled correctly with mixed English/Latin and Hebrew/Arabic texts. Still open with no clear resolution.
  • fdo#60533, Brackets (…),{…},[…] inverted )…(,}…{,]…[ when switch to RTL text direction with all fonts. Still open and present status is unclear. A patch was submitted for v4.0.3 but it caused other bugs so was reverted.
  • fdo#60534, Brackets (…),{…},[…] inverted to )…(,}…{,]…[ when switch to RTL text direction with Graphite fonts only. Companion bug to fdo#60533 but unlike fdo#33302 this one applies to all platforms. Still open and present status indicates this appears to be a difficult bug that apparently requires a rework of the LO Graphite integration code.[1]
  • fdo#61795, Weak Characters (like brackets) are mispositioned with mixed RTL and LTR. Still open and bug is marked as relating to fdo#56408 but it is unclear to what degree it is a duplicate of the other bugs listed here. It is worth noting that brackets are not classified as “weak” by Unicode, but rather as “neutral” so the title of this bug at least is wrong. Comment #1 indicates that there may be a shortcoming in the ODF specification for handling CTL at the span element level i.e., “Put simply, HTML has <div dir="rtl"> and <span dir="rtl">, and OpenDocument only has something like <div dir="rtl">, but not <span dir="rtl">. Using [Unicode] directionality marks like RLM, RLE and PDF is not a robust way to resolve this, although if they are used internally and the user doesn’t have to use them directly, it’s probably OK.”[2] This is an important comment as it essentially states that the way in which LO handles embedded bi-directional strings is to rely entirely on Unicode directional formatting codes. Without these codes results are likely to be unpredictable as indicated in the question.
  • fdo#65508, RTL: RTL Bracket. Still open, but likely a duplicate of one of the other bugs listed here Marked as a duplicate of fdo#33302.

[1] Is related to fdo#59892 which deals with mirroring of text in Graphite fonts. How Graphite fonts handle CTL differs from other font formats i.e., in a Graphite font directionality is handled in the font and not the rendering engine. This is a distinct advantage of Graphite (over OpenType) but it does demand that the font author appropriately encode the handling of Weak and Neutral Unicode characters. A fix to this related bug was made for v4.0.3 which evidently partially fixed fdo#33302.

[2] A comment by Caolán McNamara about this in the linked OASIS commentary thread is worth footnoting:

[…] all text is divided into four categories, Latin, Asian, Complex and Weak. Problems arise when encountering Weak characters, e.g. spaces, punctuation and mathematical symbols. They generally get assigned to one of the other three categories depending on context of surrounding text. […] one example scenario is a document comprising of a paragraph that consists of only weak characters, something like .:?". There isn’t a way to state that these weak characters should be biased towards one script category or another. If you open that in a version of LibreOffice/OpenOffice.org then the final fallback is to bias towards the locale the user is in […]

It is important to note that the term “weak” here is probably not the same as the Unicode bi-directional category. From my understanding the term here is a catch-all bucket that means the character is neither Latin, Asian, or Complex (i.e., broadly equivalent to Strong in Unicode), thus all Weak and Neutral Unicode characters would appear to be included.

Unicode directional formatting codes

The comments in fdo#61795 and the ensuing discussion on the OASIS mailing list indicate that the preferred method for handling the embedding of bi-directional text (e.g., as a RTL string in a LTR paragraph or vice-versa) is the appropriate use of Unicode directional formatting codes. The detail for these codes can be found in Unicode Technical Report 9. A summary understanding of the categories and codes follows.

Under 3.2 in the report is a table classifying characters as having a directionality attribute that is either Strong (most text), Weak (numbers, punctuation), or Neutral (carriage return, newline, tab, spaces, brackets). Note that there are particular exceptions in certain scripts. The directional formatting codes are:

  • Left-to-right embedding (LRE / U+202a): Start a LTR string in a RTL paragraph.
  • Right-to-left embedding (RLE / U+202b): Start a RTL string in a LTR paragraph.
  • Left-to-right override (LRO / U+202d): Force following characters to be treated as LTR (i.e., Strong).
  • Right-to-left override (RLO / U+202e): Force following characters to be treated as RTL (i.e., Strong).
  • Pop directional formatting (PDF / U+202c): End the previous LRE, RLE, RLO, or LRO.
  • Left-to-right mark (LRM / U+202e): Implied joiner character that acts as LTR (i.e., Strong).
  • Right-to-left mark (RLM / U+202f): Implied joiner character that acts as RTL (i.e., Strong).

My original example was not as accurate as it could have been, due the use of LRM/RLM in place of the other five codes, so I have re-done the example for greater clarity.

Example

Let’s take a basic Latin character string “abc” and a basic Hebrew character string “אבג”. These consist of Latin small letter a (U+0061), Latin small letter b (U+0062), Latin small letter c (U+0063), and in the second instance Hebrew aleph (U+05d0), Hebrew bet (U+05d1), and Hebrew gimel (U+05d2), although in RTL order as displayed. The Hebrew string can be set in a Latin paragraph e.g., “abc אבג abc”. Alternatively the Latin string can be set in a Hebrew paragraph e.g., “אבג abc אבג”.

The question specifically relates to setting characters that Unicode classifies as either bi-directionally Weak or Neutral either adjacent to the alternate-direction string in the paragraph or as a Weak/Neutral-multiple within the string. In particular the left parenthesis (U+0028) and right parenthesis (U+0029) character are cited, sometimes in combination with a normal space (U+0020). In situations where a bracket is wholly set between Strong characters there does not appear to be an issue
e.g., “abc א(ב)ג abc” or “אבג a(b)c אבג”. In these cases the brackets assume the directionality of the surrounding Strong characters.

When the brackets are required to surround the entire alternate-direction string there may be a problem because a normal space (U+0020), like a bracket, has Neutral directionality. Usually though this can be rectified through the use of an appropriate pair of Unicode directional formatting codes. I will use the Unicode points for reference to be clear. If I type only these characters (no Unicode directional formatting codes) in the order shown:

U+0061, U+0062, U+0063, U+0020, U+0028, U+05d0, U+05d1, U+05d2, U+0029, U+0020, U+0061, U+0062, U+0063
U+05d0, U+05d1, U+05d2, U+0020, U+0028, U+0061, U+0062, U+0063, U+0029, U+0020, U+05d0, U+05d1, U+05d2

… there is no issue with the brackets being rendered correctly e.g., “abc (אבג) abc” and “אבג (abc) אבג”. Similarly, if I type the same characters, but surround the entire bracketed string (as I enter it), with a RLE+PDF pair (Latin string) or LRE+PDF pair (Hebrew string), as shown:

U+0061, U+0062, U+0063, U+0020, U+202b, U+0028, U+05d0, U+05d1, U+05d2, U+0029, U+202c, U+0020, U+0061, U+0062, U+0063
U+05d0, U+05d1, U+05d2, U+0020, U+202a, U+0028, U+0061, U+0062, U+0063, U+0029, U+202c, U+0020, U+05d0, U+05d1, U+05d2

… again, there is no issue with the brackets being rendered correctly e.g., “abc (אבג) abc” and “אבג (abc) אבג”. I can even type the same characters, but surround only the string within the brackets (as I enter it), with a RLE+PDF pair (Latin string) or LRE+PDF pair (Hebrew string), as shown:

U+0061, U+0062, U+0063, U+0020, U+0028, U+202b, U+05d0, U+05d1, U+05d2, U+202c, U+0029, U+0020, U+0061, U+0062, U+0063
U+05d0, U+05d1, U+05d2, U+0020, U+0028, U+202a, U+0061, U+0062, U+0063, U+202c, U+0029, U+0020, U+05d0, U+05d1, U+05d2

… and once again, there is no issue with the brackets being rendered correctly e.g., “abc (אבג) abc” and “אבג (abc) אבג”.

Conclusion

I have tested all these chains of characters under GNU/Linux TDF/LO v4.0.3.3 using several different fonts e.g., Linux Libertine G, Arial, Times New Roman, FreeSerif, Taamey Ashkenaz, and Taamey David CLM. So long as a RLM+PDF or LRM+PDF pair surround the required alternate-direction string there should be no issue under the current version of LO (at least under GNU/Linux). For more complex alternate-direction strings involving Weak or Neutral Unicode classified characters it may be necessary to employ RLO/LRO/RLM/LRM codes to help “force” or “glue” the directionality of the character in question, although I would avoid this unless absolutely necessary. Going back and editing content that employs Unicode directional formatting codes is challenging to say the least.

Thanks for such comprehensive answer I guess you are right my case is the last one.

That’s an awesome answer, @oweng. It won’t add much but, FWIW, here’s my experience with Linux Biolinum G, and also with Taamey David which is available from the Culmus Project. This is OS Linux Mint 13/Xfce and LibO 4.0.3.3. Btw, in Windows, just put a number character at the bracket boundary with LTR, and brackets go whacky, too. (Or did, in Word 2010 on XP.)

TBH my answer is not that “awesome” as CTL is “complex” for a reason (and I wish I understood it better). Numbers and most punctuation marks are classified as Weak, while space and brackets are Neutral (TR9). The RLM/LRM I indicated are for use with individual characters while RLE (U+202b) and LRE (U+202a) are used to surround strings. I have tested this and the results are far worse. I will re-edit my answer none-the-less.

Not that “awesome”? Maybe–but it shows a level of engagement with OP’s issue that is exemplary, and you did a real service to a wider readership in the thoroughness of your reply. That’s how I see it, anyway. :slight_smile: Btw, for Windows, a Unicode editor called Babelpad shows character-at-cursor in the status bar (see mid-bottom of screeshot at that link). Very helpful feature for sorting bidi issues. Would be useful in Writer: or does it already?

@dajare, thanks for your encouragement. I loved Babelpad when I ran Windows. These days I mainly start Win7 for testing only. I have re-written my answer and the conclusion is now different. If you and @jenka1980 get the time, please verify the use of the RLE+PDF / LRE+PDF code pairs as I now indicate. I think this will solve the issue.

@dajare, thanks for your encouragement. I loved Babelpad when I ran Windows. These days I mainly start Win7 for testing only. I have re-written my answer and the conclusion is now different. If you and @jenka1980 get the time, please verify the use of the RLE+PDF / LRE+PDF code pairs as I now indicate. I think this will solve the issue.