Strings, lists and sets in Libre Calc

I have a spreadsheet that creates a text string from cells in a row using CONCATENATE. Duplicate entries in the list are common and are used in the spreadsheet for data integrity checking. Blanks are also common. I am looking for a function that would convert the string :

;;;mammal;dog;Rover;;;;black;Rover;;;29/01/2020;Rover;;;

into

mammal;dog;Rover;black;29/01/2020;

or

;mammal;dog;Rover;black;29/01/2020;

(i.e., the leading “;” indicating an empty set member)

To date this was done by search and replace for empty sets and manual deletion for complex fields. As the material has got more extensive I am looking / hoping for a SET function that can be called in a cell, perhaps like

=SET(Range,"delimiter’,IncludeEmptySet)

The Range could be the original cells or the concatenated string. If it’s the string, in my mind the process might be to convert the string to a list, identify a set of list items (optionally with or without the empty set as a list member) and return the set as a similarly delimited string. For my application order is not important, however I can imagine applications where ordering (such as alphabetically, or by length of list item, or by position in the original string) is useful. In Python the set() function offers a (roughly) equivalent functionality.

I have not been able to find a Calc function offering the set() functionality. Can anyone offer a way forward please?

Welcome!
TEXTJOIN() ?
Or =REGEX(A1;";+";";";"g") to result of your CONCATENATE() ?

Upd. Oh, also see UNIQUE()

2 Likes

Thank you JohnSUN! TEXTJOIN() will skip blanks but not duplicates (in my input string in the question “Rover” appears three times, but only once in the desired output). The REGEX() will drop the multiple “;;”, but also misses the duplicate "Rover"s. UNIQUE() has possibilities - I’ll have to play with it, and it will force a spreadsheet restructure - and almost a doubling of the spreadsheet area - but that’s a small price. The UNIQUE() function is not available in the Libre version in the Ubuntu 24.04 repo, so had to install the Libre ppa and update as well. Many thanks for giving a direction - I’ll tick it as a solution after I’ve played with it to confirm it solves the challenge.

No, you seem to have misunderstood my comment. I meant something like =TEXTJOIN(";";0;UNIQUE(2:2;1;0)) instead of your CONCATENATE() - it wouldn’t require much change of sheet

1 Like

Fabulous - THANK YOU. No internet examples show UNIQUE() applied other than in an array; this is a really neat solution and helps reduce a lot of the CONCATENATE() rubbish I had assembled, simplifying other parts of the spreadsheet as well.

Independent of the great solution by @JohnSUN, just a remark.
Please never overestimate the cost of spreadsheet area: it is often cheap, compared to the cost of shorter “all-calculations-in-one-cell” solutions. When you have a repeated calculation over a dynamically created range, that dynamic creation may be the bottleneck, which can easily be eliminated by an auxiliary range (maybe on a dedicated sheet, maybe even hidden), which would only require to be calculated once, and reused in thousands of dependent calculations, instead of re-creating it again and again in each formula.

Which may not apply to your case: I have no idea, how many times do you use that formula :slight_smile:

1 Like

Mike Kaganski - indeed! In fact this question and the demo I posted are an abstraction; the actual application is a “database” sourcing data from multiple locations (lists of parks and people and stuff) and integrating to a set of hierarchical keywords that consistently allocates a set of “keywords” for a trigger word in a photo caption. The spreadsheet is 124 columns by 12000 rows - so not huge. It started small and simple, I was trying to be consistent on tagging a few hundred photos. As it grew, checks for consistency and completeness were built in. I’m now at edition 602 and over 200 hours of editing time. On demand I pull a CSV of trigger words and consequent tags and run it against a CSV of photoname and comment using a Python script, returning photoname, missed comments and proposed tags. With thousands of photos now in a batch, this saves hours of hunt and peck for a comprehensive and consistent set of keywords, highlights misses and can even issuing warnings where disambiguation of a comment might be needed. Sure, I could do it all again in a database, but with a spreadsheet it was simple to envisage, was readily scalable and didn’t require extensive (re)programming when specifications emerged.

its easier to seperate manually than this complex uprooting, play with this as no easy result:
000_TEXTstring isolate_v0000_021557.ods (22.9 KB)
recognize the many auxiliary cells

JohnSUN’s solution fits the spreadsheet situation better, but I like what you’ve managed with regex and this has inspired me to combine this approach with the TEXTJOIN(UNIQUE()) proposeed above in another part of the spreadsheet - Many thanks for putting so much into a solution

1 Like

I look forward to your solution and am excited about it because …
=TEXTJOIN(";";0;UNIQUE(2:2;0;0))
or
=TEXTJOIN(";";0;UNIQUE(2:2;1;0))
… does not lead to the desired result.

Without a sample of the data, it’s hard to say what exactly went wrong. What’s in your second row? Mine was the original ;;;mammal;dog;Rover;;;;black;Rover;;;29/01/2020;Rover;;; split into separate cells at the semicolon with Text to Columns. What about yours?

1 Like

Thank you All. There’s some discussion on the use of the spreadsheet and the wider data model. I attach a model of the spreadsheet as per the question:

DemoStringSet.ods (13.7 KB)

The solution from JohnSUN fits this application perfectly.