Strings, lists and sets in Libre Calc

Joolz · June 19, 2025, 6:25am

I have a spreadsheet that creates a text string from cells in a row using CONCATENATE. Duplicate entries in the list are common and are used in the spreadsheet for data integrity checking. Blanks are also common. I am looking for a function that would convert the string :

;;;mammal;dog;Rover;;;;black;Rover;;;29/01/2020;Rover;;;

into

mammal;dog;Rover;black;29/01/2020;

or

;mammal;dog;Rover;black;29/01/2020;

(i.e., the leading “;” indicating an empty set member)

To date this was done by search and replace for empty sets and manual deletion for complex fields. As the material has got more extensive I am looking / hoping for a SET function that can be called in a cell, perhaps like

=SET(Range,"delimiter’,IncludeEmptySet)

The Range could be the original cells or the concatenated string. If it’s the string, in my mind the process might be to convert the string to a list, identify a set of list items (optionally with or without the empty set as a list member) and return the set as a similarly delimited string. For my application order is not important, however I can imagine applications where ordering (such as alphabetically, or by length of list item, or by position in the original string) is useful. In Python the set() function offers a (roughly) equivalent functionality.

I have not been able to find a Calc function offering the set() functionality. Can anyone offer a way forward please?

JohnSUN · June 19, 2025, 6:36am

Welcome!
TEXTJOIN() ?
Or =REGEX(A1;";+";";";"g") to result of your CONCATENATE() ?

Upd. Oh, also see UNIQUE()

Joolz · June 19, 2025, 12:01pm

Thank you JohnSUN! TEXTJOIN() will skip blanks but not duplicates (in my input string in the question “Rover” appears three times, but only once in the desired output). The REGEX() will drop the multiple “;;”, but also misses the duplicate "Rover"s. UNIQUE() has possibilities - I’ll have to play with it, and it will force a spreadsheet restructure - and almost a doubling of the spreadsheet area - but that’s a small price. The UNIQUE() function is not available in the Libre version in the Ubuntu 24.04 repo, so had to install the Libre ppa and update as well. Many thanks for giving a direction - I’ll tick it as a solution after I’ve played with it to confirm it solves the challenge.

JohnSUN · June 19, 2025, 12:30pm

No, you seem to have misunderstood my comment. I meant something like =TEXTJOIN(";";0;UNIQUE(2:2;1;0)) instead of your CONCATENATE() - it wouldn’t require much change of sheet

Joolz · June 20, 2025, 5:01am

Fabulous - THANK YOU. No internet examples show UNIQUE() applied other than in an array; this is a really neat solution and helps reduce a lot of the CONCATENATE() rubbish I had assembled, simplifying other parts of the spreadsheet as well.

mikekaganski · June 20, 2025, 5:52am

Independent of the great solution by @JohnSUN, just a remark.
Please never overestimate the cost of spreadsheet area: it is often cheap, compared to the cost of shorter “all-calculations-in-one-cell” solutions. When you have a repeated calculation over a dynamically created range, that dynamic creation may be the bottleneck, which can easily be eliminated by an auxiliary range (maybe on a dedicated sheet, maybe even hidden), which would only require to be calculated once, and reused in thousands of dependent calculations, instead of re-creating it again and again in each formula.

Which may not apply to your case: I have no idea, how many times do you use that formula

Joolz · June 21, 2025, 6:52am

Mike Kaganski - indeed! In fact this question and the demo I posted are an abstraction; the actual application is a “database” sourcing data from multiple locations (lists of parks and people and stuff) and integrating to a set of hierarchical keywords that consistently allocates a set of “keywords” for a trigger word in a photo caption. The spreadsheet is 124 columns by 12000 rows - so not huge. It started small and simple, I was trying to be consistent on tagging a few hundred photos. As it grew, checks for consistency and completeness were built in. I’m now at edition 602 and over 200 hours of editing time. On demand I pull a CSV of trigger words and consequent tags and run it against a CSV of photoname and comment using a Python script, returning photoname, missed comments and proposed tags. With thousands of photos now in a batch, this saves hours of hunt and peck for a comprehensive and consistent set of keywords, highlights misses and can even issuing warnings where disambiguation of a comment might be needed. Sure, I could do it all again in a database, but with a spreadsheet it was simple to envisage, was readily scalable and didn’t require extensive (re)programming when specifications emerged.

koyotak · June 19, 2025, 12:19pm

its easier to seperate manually than this complex uprooting, play with this as no easy result:
000_TEXTstring isolate_v0000_021557.ods (22.9 KB)
recognize the many auxiliary cells

Joolz · June 20, 2025, 5:04am

JohnSUN’s solution fits the spreadsheet situation better, but I like what you’ve managed with regex and this has inspired me to combine this approach with the TEXTJOIN(UNIQUE()) proposeed above in another part of the spreadsheet - Many thanks for putting so much into a solution

koyotak · June 20, 2025, 8:38am

I look forward to your solution and am excited about it because …
=TEXTJOIN(";";0;UNIQUE(2:2;0;0))
or
=TEXTJOIN(";";0;UNIQUE(2:2;1;0))
… does not lead to the desired result.

JohnSUN · June 20, 2025, 8:47am

Without a sample of the data, it’s hard to say what exactly went wrong. What’s in your second row? Mine was the original ;;;mammal;dog;Rover;;;;black;Rover;;;29/01/2020;Rover;;; split into separate cells at the semicolon with Text to Columns. What about yours?

koyotak · June 22, 2025, 9:48am

here is my sample of study
00_LO-CALC_TEXT string to fractionate_114600.ods (37.4 KB)

please check it

JohnSUN · June 25, 2025, 8:19am

Thanks @koyotak, interesting solution. It took me a while to follow the sequence of actions, but eventually I figured out how the result in cell B35 is obtained. A little confusing is that the result removes empty rows and rows of “spaces only” - I think @Joolz wanted to keep them.
I think it would be a bit easier to put in B35 a formula like

=TEXTJOIN(";";1;IFERROR(FILTERXML("<t><s>"&SUBSTITUTE(B8;";";"</s><s>")&"</s></t>";"//s[not(preceding::*=.)]["&SEQUENCE(1;LEN(B8)/2+1)&"]");""))

Yes, after @mikekaganski’s warning that individual cells should not be overloaded with calculations, but rather divided into several successive steps, such a formula seems a bit brazen.
But it seemed to me that @Joolz is not going to use these calculations for constant recalculation, these formulas will be used only once for the initial cleaning of raw data. So I think this solution has a right to exist. Especially since it really works.
UniqJoin1.ods (13.3 KB)

koyotak · June 25, 2025, 11:47am

… it’s too complex for me to understand or change any datum to another format like 2025-06-25 or 25.6.25 . i gave up. i prefer definitely auxiliary cells within short formulas with one or two functions. Therefore i refrain from unraveling them. Nevertheless your rat tail formula appeals to me, i can’t resist … 🫣:smiling_face_with_three_hearts:

JohnSUN · June 25, 2025, 12:12pm

I prefer short solutions - to press keys as little as possible. For example, such code has much fewer symbols than many formulas, but the result is the same

Function getUniq(sSource As String, Optional sDelimiter As String) As String
Dim aTemp As Variant
Dim iNext As Long, j As Long, k As Long
Dim bUnique As Boolean
	If IsMissing(sDelimiter) Then sDelimiter = ";"
	aTemp = Split(sSource, sDelimiter)
Rem Move unique values to the beginning of the array
	iNext = LBound(aTemp)
	For j = iNext + 1 To UBound(aTemp)
		bUnique = True
		For k = LBound(aTemp) To iNext
			If aTemp(k) = aTemp(j) Then
				bUnique = False
				Exit For
			EndIf
		Next k
		If bUnique Then
			iNext = iNext + 1
			aTemp(iNext) = aTemp(j)
		EndIf
	Next j
Rem Trimming unnecessary values
	ReDim Preserve aTemp(LBound(aTemp) To iNext)
	getUniq = Join(aTemp, sDelimiter)
End Function

If desired, you can add date recognition and conversion to another format or some other additional processing.

The scary FILTERXML() formula is actually quite simple when you break it down into its component parts:

This complex Calc formula used to extract unique values from cell B8, where these values are separated by semicolons, and then merge them back into a single string, also separated by semicolons.

Here’s how it works, step by step:

SUBSTITUTE(B8,";","</s><s>"): It’s assumed that cell B8 contains text where different values are separated by semicolons (e.g., “apple;banana;apple;orange”). This function replaces each semicolon with the string </s><s>. The result will be something like “apple</s><s>banana</s><s>apple</s><s>orange”.

"<t><s>"& ... &"</s></t>": The result from the previous step is wrapped with XML tags <t><s> at the beginning and </s></t> at the end. This creates a properly formed XML string, for example: <t><s>apple</s><s>banana</s><s>apple</s><s>orange</s></t>. This is done so that FILTERXML can process the data as an XML document, where each value is inside an <s> tag.

FILTERXML(..., "//s[not(preceding::*=.)]"): This is the most interesting part.

FILTERXML is an Calc function that allows you to extract data from an XML string using XPath expressions.

//s means select all <s> elements in the document.

[not(preceding::*=.)] is an XPath condition. It selects only those <s> elements that do not have preceding <s> elements with the same content. Essentially, this is a way to get unique values.

["&SEQUENCE(1,LEN(B8)/2+1)&"]": This part ensures that FILTERXML can process all potential unique values.

LEN(B8)/2+1 is an approximate calculation of the maximum number of values that could be in string B8. Dividing by 2 is because ; takes 1 character, and if we have N values, we have N-1 delimiters, and we want to cover all possible values. Adding +1 ensures we get enough indices for all possible unique elements.

SEQUENCE(1, ...) generates a sequence of numbers (e.g., 1, 2, 3, …) which are used as indices for FILTERXML to “iterate” through all potential unique elements. This is necessary because FILTERXML can only return one value at a time unless an explicit index is provided.

IFERROR(..., ""): The IFERROR function catches any errors that might occur if FILTERXML doesn’t find a corresponding element (e.g., if SEQUENCE generates more indices than unique values). In case of an error, it returns an empty string "".

TEXTJOIN(";",1,...): Finally, TEXTJOIN takes all the unique values extracted by FILTERXML (and any potential empty strings from IFERROR), combines them back into a single string, using ; as a delimiter. The 1 means that empty values should be ignored.

In summary: This formula is a clever way to get a list of unique, semicolon-separated values from a single cell, using XML transformations and XPath queries.

koyotak · June 26, 2025, 9:15am

That’s great!

One intended or unintended effect of your rat-tail formula is that it recognizes a DATE “June 25, 2025,” caches it as a NUMBER “45833,” and inserts its digits into the string as TEXT. Perhaps just a gimmick? The constant conflict between TEXT “abc” and DECIMAL “123.456” and DATE “2025-06-25” and SIZE VALUE/currency “15.50 €” is challenging and prompted me to package them in separate cells, which are much easier to handle. On the plus side, it differentiates uppercase from lowercase letters, which ‘UNiQUE()’ doesn’t.

Even the perennial problem of a seemingly empty, optically empty cell containing only some space or non-printable character like “\n” isn’t recognized and would have to be removed as a first step.
«Am I shitting myself because of this?».

Joolz · June 21, 2025, 6:47am

Thank you All. There’s some discussion on the use of the spreadsheet and the wider data model. I attach a model of the spreadsheet as per the question:

DemoStringSet.ods (13.7 KB)

The solution from JohnSUN fits this application perfectly.