I’m using a Calc spreadsheet and REGEX
to parse some dictated data input that includes some basic data and various optional fields.
I’m stumbling over the default behavior of the REGEX
function. I’m relatively new to LibreOffice but I’ve been using Regular Expressions in various flavors for over 40 years, so I have some deeply embedded presumptions about how they “ought” to behave that influence how I wrap my mind around my intended RE constructions. In particular, based on decades of using VI/VIM/MacVIM, SED, PERL, etc., I have an assumption that if your RE does not match the input string, you get nothing back.
But LO Calc makes the opposite assumption. (Since Excel does not support REs, this is clearly not an Excel-compatibility decision.) In Calc, if nothing in the RE matches the Text, you get all of the Text.
The REGEX
doc https://help.libreoffice.org/latest/en-US/text/scalc/01/func_regex.html says, about the Replacement part of the REGEX
function:
Replacement: Optional. The replacement text and references to capture groups. If there is no match, Text is returned unmodified.
Now I can’t be the first person who has found this behavior problematical. Someone has to have come up with a standard practice or an idom for how to make this work simply, reliably, and most importantly, understandably (when reading the formula). But what is it?
So far, all I’ve come up with is to run the whole REGEX
function twice, first with just the Text and Expression, without the Replacement argument, wrap that in an IF
, and if that does not produce a #N/A
, then run the REGEX
again, this time with the Replacement argument present.
For me, that’s an absurd complication — the REGEX
Expression argument is long, complex, difficult to read and this method means I have to edit two of them, identically, every time I make a change while I’m working out the RE. (Never mind computational efficiency of running a complex formula twice, the machine is so fast that’s not the issue it was years ago, but I still think about it, even though my sheet will only have hundreds of lines, not millions.)
I have recently found some examples of using COUNTIF
with a Regular Expression to text if there is a match. That might produced a slightly more readable formula than running REGEX
twice, but that still requires spelling out the full expression twice, with the inescapable, error-prone double editing that demands.
Is there some structure of Calc functions, an idom or standard practice, that lets a non-match have a non-result, while still being able to do field extraction/substitution on the Text, which doesn’t require having the RE spelled out twice?