Strings, lists and sets in Libre Calc

I have a spreadsheet that creates a text string from cells in a row using CONCATENATE. Duplicate entries in the list are common and are used in the spreadsheet for data integrity checking. Blanks are also common. I am looking for a function that would convert the string :

;;;mammal;dog;Rover;;;;black;Rover;;;29/01/2020;Rover;;;

into

mammal;dog;Rover;black;29/01/2020;

or

;mammal;dog;Rover;black;29/01/2020;

(i.e., the leading “;” indicating an empty set member)

To date this was done by search and replace for empty sets and manual deletion for complex fields. As the material has got more extensive I am looking / hoping for a SET function that can be called in a cell, perhaps like

=SET(Range,"delimiter’,IncludeEmptySet)

The Range could be the original cells or the concatenated string. If it’s the string, in my mind the process might be to convert the string to a list, identify a set of list items (optionally with or without the empty set as a list member) and return the set as a similarly delimited string. For my application order is not important, however I can imagine applications where ordering (such as alphabetically, or by length of list item, or by position in the original string) is useful. In Python the set() function offers a (roughly) equivalent functionality.

I have not been able to find a Calc function offering the set() functionality. Can anyone offer a way forward please?

Welcome!
TEXTJOIN() ?
Or =REGEX(A1;";+";";";"g") to result of your CONCATENATE() ?

Upd. Oh, also see UNIQUE()

3 Likes

Thank you JohnSUN! TEXTJOIN() will skip blanks but not duplicates (in my input string in the question “Rover” appears three times, but only once in the desired output). The REGEX() will drop the multiple “;;”, but also misses the duplicate "Rover"s. UNIQUE() has possibilities - I’ll have to play with it, and it will force a spreadsheet restructure - and almost a doubling of the spreadsheet area - but that’s a small price. The UNIQUE() function is not available in the Libre version in the Ubuntu 24.04 repo, so had to install the Libre ppa and update as well. Many thanks for giving a direction - I’ll tick it as a solution after I’ve played with it to confirm it solves the challenge.

No, you seem to have misunderstood my comment. I meant something like =TEXTJOIN(";";0;UNIQUE(2:2;1;0)) instead of your CONCATENATE() - it wouldn’t require much change of sheet

1 Like

Fabulous - THANK YOU. No internet examples show UNIQUE() applied other than in an array; this is a really neat solution and helps reduce a lot of the CONCATENATE() rubbish I had assembled, simplifying other parts of the spreadsheet as well.

Independent of the great solution by @JohnSUN, just a remark.
Please never overestimate the cost of spreadsheet area: it is often cheap, compared to the cost of shorter “all-calculations-in-one-cell” solutions. When you have a repeated calculation over a dynamically created range, that dynamic creation may be the bottleneck, which can easily be eliminated by an auxiliary range (maybe on a dedicated sheet, maybe even hidden), which would only require to be calculated once, and reused in thousands of dependent calculations, instead of re-creating it again and again in each formula.

Which may not apply to your case: I have no idea, how many times do you use that formula :slight_smile:

1 Like

Mike Kaganski - indeed! In fact this question and the demo I posted are an abstraction; the actual application is a “database” sourcing data from multiple locations (lists of parks and people and stuff) and integrating to a set of hierarchical keywords that consistently allocates a set of “keywords” for a trigger word in a photo caption. The spreadsheet is 124 columns by 12000 rows - so not huge. It started small and simple, I was trying to be consistent on tagging a few hundred photos. As it grew, checks for consistency and completeness were built in. I’m now at edition 602 and over 200 hours of editing time. On demand I pull a CSV of trigger words and consequent tags and run it against a CSV of photoname and comment using a Python script, returning photoname, missed comments and proposed tags. With thousands of photos now in a batch, this saves hours of hunt and peck for a comprehensive and consistent set of keywords, highlights misses and can even issuing warnings where disambiguation of a comment might be needed. Sure, I could do it all again in a database, but with a spreadsheet it was simple to envisage, was readily scalable and didn’t require extensive (re)programming when specifications emerged.

its easier to seperate manually than this complex uprooting, play with this as no easy result:
000_TEXTstring isolate_v0000_021557.ods (22.9 KB)
recognize the many auxiliary cells

JohnSUN’s solution fits the spreadsheet situation better, but I like what you’ve managed with regex and this has inspired me to combine this approach with the TEXTJOIN(UNIQUE()) proposeed above in another part of the spreadsheet - Many thanks for putting so much into a solution

1 Like

I look forward to your solution and am excited about it because …
=TEXTJOIN(";";0;UNIQUE(2:2;0;0))
or
=TEXTJOIN(";";0;UNIQUE(2:2;1;0))
… does not lead to the desired result.

Without a sample of the data, it’s hard to say what exactly went wrong. What’s in your second row? Mine was the original ;;;mammal;dog;Rover;;;;black;Rover;;;29/01/2020;Rover;;; split into separate cells at the semicolon with Text to Columns. What about yours?

1 Like

here is my sample of study
00_LO-CALC_TEXT string to fractionate_114600.ods (37.4 KB)

please check it

Thanks @koyotak, interesting solution. It took me a while to follow the sequence of actions, but eventually I figured out how the result in cell B35 is obtained. A little confusing is that the result removes empty rows and rows of “spaces only” - I think @Joolz wanted to keep them.
I think it would be a bit easier to put in B35 a formula like

=TEXTJOIN(";";1;IFERROR(FILTERXML("<t><s>"&SUBSTITUTE(B8;";";"</s><s>")&"</s></t>";"//s[not(preceding::*=.)]["&SEQUENCE(1;LEN(B8)/2+1)&"]");""))

Yes, after @mikekaganski’s warning that individual cells should not be overloaded with calculations, but rather divided into several successive steps, such a formula seems a bit brazen.
But it seemed to me that @Joolz is not going to use these calculations for constant recalculation, these formulas will be used only once for the initial cleaning of raw data. So I think this solution has a right to exist. Especially since it really works.
UniqJoin1.ods (13.3 KB)

1 Like

… it’s too complex for me to understand or change any datum to another format like 2025-06-25 or 25.6.25 :face_with_monocle::woozy_face:. i gave up. i prefer definitely auxiliary cells within short formulas with one or two functions. Therefore i refrain from unraveling them. Nevertheless your rat tail formula appeals to me, i can’t resist … 🫣:smiling_face_with_three_hearts:

I prefer short solutions - to press keys as little as possible. For example, such code has much fewer symbols than many formulas, but the result is the same

Function getUniq(sSource As String, Optional sDelimiter As String) As String
Dim aTemp As Variant
Dim iNext As Long, j As Long, k As Long
Dim bUnique As Boolean
	If IsMissing(sDelimiter) Then sDelimiter = ";"
	aTemp = Split(sSource, sDelimiter)
Rem Move unique values ​​to the beginning of the array
	iNext = LBound(aTemp)
	For j = iNext + 1 To UBound(aTemp)
		bUnique = True
		For k = LBound(aTemp) To iNext
			If aTemp(k) = aTemp(j) Then
				bUnique = False
				Exit For
			EndIf
		Next k
		If bUnique Then
			iNext = iNext + 1
			aTemp(iNext) = aTemp(j)
		EndIf
	Next j
Rem Trimming unnecessary values
	ReDim Preserve aTemp(LBound(aTemp) To iNext)
	getUniq = Join(aTemp, sDelimiter)
End Function

If desired, you can add date recognition and conversion to another format or some other additional processing.

The scary FILTERXML() formula is actually quite simple when you break it down into its component parts:

1 Like

That’s great!


One intended or unintended effect of your rat-tail formula is that it recognizes a DATE “June 25, 2025,” caches it as a NUMBER “45833,” and inserts its digits into the string as TEXT. Perhaps just a gimmick? The constant conflict between TEXT “abc” and DECIMAL “123.456” and DATE “2025-06-25” and SIZE VALUE/currency “15.50 €” is challenging and prompted me to package them in separate cells, which are much easier to handle. On the plus side, it differentiates uppercase from lowercase letters, which ‘UNiQUE()’ doesn’t.


Even the perennial problem of a seemingly empty, optically empty cell containing only some space or non-printable character like “\n” isn’t recognized and would have to be removed as a first step.
«Am I shitting myself because of this?».

Thank you All. There’s some discussion on the use of the spreadsheet and the wider data model. I attach a model of the spreadsheet as per the question:

DemoStringSet.ods (13.7 KB)

The solution from JohnSUN fits this application perfectly.