How do I get Calc `REGEX` to return nothing when there is no match?

I’m using a Calc spreadsheet and REGEX to parse some dictated data input that includes some basic data and various optional fields.

I’m stumbling over the default behavior of the REGEX function. I’m relatively new to LibreOffice but I’ve been using Regular Expressions in various flavors for over 40 years, so I have some deeply embedded presumptions about how they “ought” to behave that influence how I wrap my mind around my intended RE constructions. In particular, based on decades of using VI/VIM/MacVIM, SED, PERL, etc., I have an assumption that if your RE does not match the input string, you get nothing back.

But LO Calc makes the opposite assumption. (Since Excel does not support REs, this is clearly not an Excel-compatibility decision.) In Calc, if nothing in the RE matches the Text, you get all of the Text.

The REGEX doc https://help.libreoffice.org/latest/en-US/text/scalc/01/func_regex.html says, about the Replacement part of the REGEX function:

Replacement: Optional. The replacement text and references to capture groups. If there is no match, Text is returned unmodified.

Now I can’t be the first person who has found this behavior problematical. Someone has to have come up with a standard practice or an idom for how to make this work simply, reliably, and most importantly, understandably (when reading the formula). But what is it?

So far, all I’ve come up with is to run the whole REGEX function twice, first with just the Text and Expression, without the Replacement argument, wrap that in an IF, and if that does not produce a #N/A, then run the REGEX again, this time with the Replacement argument present.

For me, that’s an absurd complication — the REGEX Expression argument is long, complex, difficult to read and this method means I have to edit two of them, identically, every time I make a change while I’m working out the RE. (Never mind computational efficiency of running a complex formula twice, the machine is so fast that’s not the issue it was years ago, but I still think about it, even though my sheet will only have hundreds of lines, not millions.)

I have recently found some examples of using COUNTIF with a Regular Expression to text if there is a match. That might produced a slightly more readable formula than running REGEX twice, but that still requires spelling out the full expression twice, with the inescapable, error-prone double editing that demands.

Is there some structure of Calc functions, an idom or standard practice, that lets a non-match have a non-result, while still being able to do field extraction/substitution on the Text, which doesn’t require having the RE spelled out twice?

In any program, when in some text you replace some search string with another (using regex or otherwise), and there was no match in the original text, the original text is not modified - it’s not replaced with an empty string as a whole, no?

You can only expect to get an empty result without a match, when your operation is not a replacement, but a search - i.e., you wouldn’t use the “replacement” optional argument. Then the “no match” would be reported as #N/A - i.e., the common spreadsheet convention for “no result”.

What would be the use case, where you indeed need to replace, but at the same time, expect the replacement result to be completely empty without a match?

My task is not so much one of Replacement but rather a task of Extraction. RE “capture groups” work particuarly well for this. If you think of the problem as one of extracting an optional delimited field from the input string, I think the use case of returning nothing if there is no match may be clearer. If there is a pattern match delimited by the optional field separator, I want to identify that with a capture group and if there is no such field in the data, then the result should be empty.
.

In my project, as I mentioned in the OP, data is entered via dictation, and I have yet to find a way to easily dictate directly into a spreadsheet using standard, free, dictation systems. (Dragon Naturally Speaking is the only system I’ve tried that will actually transmit the keystrokes to move the insertion point to the next field, and it’s prohibitively pricey for a personal project.) So I am dictating all the input data as a single string with field separators. I say “dash” and I variously get a hyphen, minus, en-dash, or em-dash, but with REs it’s a simple matter to search for one character of that set and use that same set ([-–—]) for every instance of a delimiter in the RE.
.

Some of the fields are optional, they are not explicitly dictated as being empty, they are just not used. Optional fields use a different separator, They start with a dictated “slash slash” which pretty reliably gets transcribed as “//” and they terminate with a period, comma, close parenthesis, or end of line.
.

So the input is not syntactically well formed if there are not two “dash” delimiters in it. Within the last field I can have an optional “slashes” field, and I’m allowing the syntax to have up to three. Mostly my data has one, sometimes two, and three has not yet occurred in real life, but I’m allowing for it ahead of time.
.

My formulas so far have used REGEX to replace the whole input with the extracted fields in different columns, using capture groups. I originally explored the idea of using SEARCH, but rejected it because it only gives me the starting position of the matched string and does not support capture groups (that I can find), so it seems to make the job a whole lot harder.
.

If you have a way of using capture groups within SEARCH I’m all ears…

I also have a problem to see a realistic use-case.
However, I see a solution:

=LEFT(REGEX($A1; $J1; $K1; "g"); IF(CURRENT()=$A1; 0; 999999))

Where columns J and K contain the search string and the replace string respctively.
Unfortunately this only works for results up to 999999 characters.
See
disask117007_StrangeReplacement.ods (25.6 KB)

Or maybe a bit cleaner solution using LET:

=LET(result; REGEX($A1; $J1; $K1; "g"); IF(result=$A1; ""; result))

Note though, that these are not strictly the same as “return empty for no match” - they will return empty also for “replacement equal to match”.

1 Like

Thanks @Lupp for this suggestion.
.

First, the idea of defining the RE in a cell and referencing is a great technique that I had not seen before. I can have the RE for the entire column at the top of each column, which will save a lot of copying while I’m debugging and fine tuning the RE. It would also solve one of the issues in my OP, I could use the same RE twice in a nested formuala and only have to edit one RE string.
.

In most cases my Replacement String is "$1", simply extracting the (first) capture group. However, it may be useful to have all three REGEX arguments, after the Input Text, at the top of their corresponding results columns for simplicity in editing. Thanks for the idea.
.

Second, the idea of using LEFT to extract my data, instead of using an RE capture group, may turn out to be handy to use with SEARCH. For one thing, as of now I have nested IFs to handle either a terminating character or the EOL. Using LEFT could unnest that construction.

Thanks @mikekaganski, LET is new in Calc in this version and this is a great example of using it for organization and clarity.

That is not an issue for me. Because any well formed line will have at least a couple of dash delimiters separating the three required fields, any field pattern match necessarily cannot match the whole input string.
.

I hope I’ve answered the questions about “use case”. You all have defintely given me directions to explore. Thanks.

@Lupp — When I commented above, I misunderstood how your formula worked. I didn’t understand what CURRENT() was in that formula, how at that point it is the output of REGEX so you don’t need a second REGEX to do the comparison, to choose between all or nothing using LEFT.

Now that I’m seeing what this actually does, I can’t recall what I meant by LEFT and SEARCH.

Thanks again.

I have not yet figured out how this forum handles threads of conversation, particularly how it orders things. I’ve looked and all I’ve found has been notes on how to quote and use fonts.

I wrote a longish reply/comment that I though would simply appear in chronological order, like the forums I’ve been on a lot in the past few years. But it turns out (I think) that I should have put it in three different places: replying to a comment on my original question, replying to an answer to my question, and replying to a comment on that answer. So I’ve broken up my longish answer that way.

Originally I had all three pieces in a single reply. When I saw that that was not as I expected I tried moving it “later in the thread” but I used the Answer format – but I was not answering the original question.

My apologies for the confusion and the juggling. I’ll get the hang of it. I’m sure I will.

I would not mind simply deleting this “Answer” after all contributors have seen it since it’s off topic and of no lasting value.

This place in the thread is where all three of my long, merged reply was after I moved it, which is what Mike is commenting on below.

You initially wrote “moved later in thread”, but in fact, re-posted this as an independent “solution” - which may move up and down, and be out of order. If you want to be “in thread”, then you should use proper comments to the other solutions / comments…

I must say, that the use case is still unclear to me, but well.