What string > "Z"? (Or, string comparison defined?)

JimDeLaHunt · August 6, 2022, 11:15pm

I am making a table to use with the Lookup() function in Calc. The values I use in my search criterion are made up of letters “A-Z”, “a-z”, “ä”, “é”, etc. I want to put a value in the final row of my search vector which is guaranteed to be greater than any of my search criterion values.

I tried using “}” and “~~~~” as the final value. To my surprise, they compare as less than letters. ="{"<"A" is TRUE. ="~~~~"<"A" is also TRUE.

Using some arcane knowledge of the Unicode text encoding standard, I was able to construct a string which reliably compares greater than letters: =UNICHAR(HEX2DEC("10ffff")). This string >"Z" is TRUE. But it seems there ought to be a simpler string than this.

What is a simple string which Calc will reliably consider to be greater than any string made up of letters?

And, what are the rules which Calc uses for string comparison, e.g. in the LOOKUP() function or in the < and > operators in expressions?

Zizi64 · August 7, 2022, 4:24am

Please upload a real, ODF type sample file here.

Zizi64 · August 7, 2022, 5:09am

Check the old ASCII table and check the = UNICODE() values of the characters what you want to use.

ASCII-Unicode.ods (12.7 KB)

erAck · August 7, 2022, 6:21pm

Unicode collation is not by code values, it even depends on the locale, i.e. looking at an ASCII table is not helpful at all, as already the comparison of { (123) before Z (90) indicates (btw your “ASCII” are not ASCII as ASCII is defined only <=127, everything else is either 8-bit codepages or some other encoding). Sorting your example data by the Character: column still leaves the lower case z at the end of the list (might be different in your (Hungarian?) locale) and the resulting order for me in an en-CA locale is

Character:	Unicode:
0	48
9	57
	32
-	45
,	44
;	59
:	58
?	63
.	46
@	64
*	42
/	47
+	43
<	60
=	61
>	62
A	65
a	97
á	225
ä	228
é	233
í	237
ö	246
ő	337
ü	252
ű	369
Z	90
z	122

mikekaganski · August 7, 2022, 5:10am

Ordering is defined by collation; LibreOffice uses Unicode Collation Algorithm; this means, that it depends on locale. For instance, this formula

="LWCUS">"LLONG"

would be FALSE for Welsh locale.
Comparison depends also on Options|Calc|Calculate|Case sensitive. This setting is enabled by default, but in Excel comparisons are case-insensitive, so it would be disabled for Excel file formats.

JimDeLaHunt · August 7, 2022, 6:55pm

Thank you. This takes me a step forward, but does not give me an answer.

It turns out that I am a little bit familiar with the Unicode Collation Algorithm. It is more than a framework that a strict ordering. You combine it with some parameters, which are probably drawn up differently for each locale, and the result is the strict ordering.

So: where can I find out the parameters which LibreOffice feeds to the Unicode Collation Algorithm for various locales? I will look up my locale, which is English as used in Canada.

And, for English as used in Canada, what is a simple string which Calc will reliably consider to be greater than any string made up of letters?

mikekaganski · August 7, 2022, 8:11pm

It is exactly the string corresponding to the largest natural number.

But honestly, I smell an XY problem.

erAck · August 7, 2022, 8:44pm

LibreOffice generally uses ICU collation with default settings, which is UCA with tailored collation for some locales as defined by the Common Locale Data Repository (CLDR), see also their Locale Explorer. Additionally LibreOffice has a few languages’ collations tailored to what users expect: i18npool/source/collator/data/ (path) in projects: core - OpenGrok search results .

You’ll find the full en-CA collation rules and a demo in the en-CA locale explorer but I doubt it will be helpful for your case.

For your “largest character” problem probably 0x10FFFF (decimal 1114111) is the best bet as that also is of Supplementary Private Use Area-B so won’t get a character assigned later that would sort between some others. However, if you only need something that compares greater than Latin script letters then you could use another script’s character that is greater like U+0370 Ͱ GREEK CAPITAL LETTER HETA. (unless you work in a Greek locale maybe :p) Note this might still interfere with characters of other Unicode Latin blocks than Basic Latin, Latin-1 Supplement, Latin Extended-A and Latin Extended-B (e.g. Latin Extended Additional and Latin Extended-C to Latin Extended-G), I didn’t try.

mikekaganski · August 8, 2022, 4:20am

So will the string consisting of one 0x10FFFF be compared greater than two 0x10FFFFs?

JimDeLaHunt · August 8, 2022, 7:24am

I’m sorry, I don’t follow you. What are you referring to by “it”? What does the “largest natural number” have to do with comparing a string of letters to some reference string? Could you please restate your comment, with reference to: “What is a simple string which Calc will reliably consider to be greater than any string made up of letters?”

Sorry about the smell. I tried to be clear about my problem to avoid an XY problem. Here is what I am trying to do:

Thank you for the specifics. Apropos of documentation, is this aspect of LibreOffice collation documented in the LibreOffice documentation — of the comparison operators, of sort, of lookup, of other such features? If not, I’m glad I asked the question. It is documented now, in this thread.

I presume by that you mean the expression UNICHAR(HEX2DEC("10ffff")), right? I want a string, not a number. And I’m not looking for a “largest” character, just a string which will reliably sort after any string of letters. I know that my search criterion consists only of letters, so it’s OK if my desired string sorts before emoji and Chinese characters.

mikekaganski · August 8, 2022, 7:59am

Heh, sorry for a bad joke that was intended to describe that having a string that is reliably compared “greater than” any other string is the same task as finding the largest number (i.e., impossible). That was based on wide range of meaning of “letter” in a spoken language; and was emphasized in my “So will the string consisting of one 0x10FFFF be compared greater than two 0x10FFFFs?” after @erAck’s comment.

However, seeing that you specifically refer to the letter’s precise definition of a symbol representing a phoneme, and the explicit definition of 0x10FFFF as noncharacter, it is highly likely that a string consisting of that codepoint will be always compared greater than any string of letters. So now I consider the answer by @erAck as precisely correct. (Note that since e.g. 0x10FFFD is part of PUA, you can’t be sure that this private-use codepoint does not represent a letter in a specific case.)

Yes, I read that - but that doesn’t explain why you need a row in your data used in lookup, that will never match in the lookups. The LOOKUP spreadsheet function requires its lookup array be sorted ascending, and it will return the data that matches the lookup value not greater than the criterion. That was why I was mentioning the XY problem - it really looks like you try to do something that might make no sense at all.

erAck · August 8, 2022, 8:57am

Erm… of course not.

Yes, and you can simplify that to UNICHAR(1114111), or copy-paste-special the result as character if you don’t want a function, or just copy-paste this 􏿿 .

Try the Greek script U+0370 Ͱ I mentioned if whyever you don’t want to use the U+10FFFF character. And as Mike already mentioned, you may be trying to find a solution to a problem that does not exist.

JimDeLaHunt · August 8, 2022, 7:45pm

I am using a technique of software development craft, to put a sentinel item at the end of my lookup list which should never be found. If my formulas do find it, then it is a clue that I have a bug in my formulas. I have already found one bug, from a formula returning the final sentinel row instead of the usable row it should have found.

Once I figure out how to write formulas without mistakes in them, then I probably will no longer need the final sentinel row. But as long as I use the final sentinel, I need a value for it, which will reliably appear at the end of the search vector.

mikekaganski · August 8, 2022, 8:20pm

A valid idea. But then, you would need not just “a string that is greater than any string of letters”, but “a string that is greater than any string of letters, but less than other possible strings” - otherwise, you likely would miss the majority of possible bugs…

JimDeLaHunt · August 8, 2022, 8:40pm

Option 1: =UNICHAR(HEX2DEC("10ffff")), or =UNICHAR(1114111). This is a string consisting of the largest character value which LibreOffice allows. The formula is ugly, but it is likely to work in most language contests. The HEX2DEC(“10ffff”) alternative is just another way of writing decimal number 1114111. It alludes to the underlying Unicode character encoding, which may make the intention a little more clear to some readers, and more mystifying to others.

Option 2: Ω, which is a Greek script character, U+03A9 GREEK CAPITAL LETTER OMEGA. English-language collations are likely to put Greek after all Latin letters. Omega is reasonably recognisable to English-readers, and clearly not Latin script. Some English-language keyboards can type Omega directly (e.g. on macOS 12 Monterey). In any case, Omega is a simple character to copy, paste, and use in cells. (Thanks to @erAck for suggesting use of Greek.)

LibreOffice Calc’s user-oriented documentation has surprisingly little to say about this. The definition of the < and > operators appeal to the reader’s intuitive understanding of what “less than” and “greater than” mean. The developer documentation has more.

String comparison is related to collation. “In the context of LibreOffice, [collation is] ordering of textual information according to ordering rules local to a cultural community. The rules for alphabetical ordering in Latin writing differ from country to country, and there are completely different ordering rules in other cultural communities. LibreOffice supports localized collation through its I18N API.” That ordering can control both the comparative operators like “less than”, and large-scale operations like Calc’s sort. The options for Calc’s sort include language, “for the sorting rules”, and numerous options.

Thus, LibreOffice uses carefully developed rules for string comparison (and collation generally). But, they are complex and not easily summarised. They certainly depend on what language LibreOffice is using in the context of your formula.

Here is some more information about collation in the LibreOffice technology internals.

Character:	Unicode:
0	48
9	57
	32
-	45
,	44
;	59
:	58
?	63
.	46
@	64
*	42
/	47
+	43
<	60
=	61
>	62
A	65
a	97
á	225
ä	228
é	233
í	237
ö	246
ő	337
ü	252
ű	369
Z	90
z	122

Character:	Unicode:
0	48
9	57
	32
-	45
,	44
;	59
:	58
?	63
.	46
@	64
*	42
/	47
+	43
<	60
=	61
>	62
A	65
a	97
á	225
ä	228
é	233
í	237
ö	246
ő	337
ü	252
ű	369
Z	90
z	122

Character:	Unicode:
0	48
9	57
	32
-	45
,	44
;	59
:	58
?	63
.	46
@	64
*	42
/	47
+	43
<	60
=	61
>	62
A	65
a	97
á	225
ä	228
é	233
í	237
ö	246
ő	337
ü	252
ű	369
Z	90
z	122