Lookup matching regular expression in Calc

I want to implement classification based on regular expressions.
In sheet1 I have

A       B       C
Item    Class   Wanted Class
apples1 =???    apples
apples2         apples
apples3         apples
Pear A          pears
Pear B          pears
Melon           other

In sheet2 I have:

A           B
Regex       Class
.*apples.*  apples
.*Pear.*    pears
.*          other

What formula to put in sheet1.B so they have the same values like C?
Does Calc have a function that can be used in formula that makes regular expression replace?

Do you mean this?

Yes, I want to use Calc exactly for the same purpose :slight_smile:
However the solution there is not what is needed as far as I understand. It matches a pattern made of all words of the item against the class. When the item and class contain common words there will be wrong classifications. What is better the class to be a pattern that can be tuned to be unique even containing common words. I managed to get what i want using the array formula:

=INDEX(classes.$B$2:$B$10000,MATCH(0,ISERROR(SEARCH(regex,C406)), 0),1). The idea is that we have an array of regular expressions tested against the current row and then the row of the array element that produced a match is returned as a single result. regex is a named range defined as $classes.$A$2:$A$17, the class names are in the B column

Hi @nb, Could you please post the same content into an Answer? That would help us to resolve this question, and make it easier for any user who reads your question to find your Answer.

Thanks!

Does the INDEX method proposed in comment #3 above work? I couldn’t get it to work here with the named ranges indicated.

I do not know how @nb solved the problem, but I offer a solution to the problem as described with COUNTIF. There are only a handful of functions that interpret strings as regular expressions, and COUNTIF is one of them. So, treating 1 as a TRUE value, and 0 as a FALSE value:

  A           B
Item        Class
apples1  =IF(COUNTIF(A2, ".*apple.*"), "apples", IF(COUNTIF(A2, ".*pear.*"), "pears", "other")
apples2  =IF(COUNTIF(A3, ".*apple.*"), "apples", IF(COUNTIF(A3, ".*pear.*"), "pears", "other")
apples3  =IF(COUNTIF(A4, ".*apple.*"), "apples", IF(COUNTIF(A4, ".*pear.*"), "pears", "other")
Pear A   =IF(COUNTIF(A5, ".*apple.*"), "apples", IF(COUNTIF(A5, ".*pear.*"), "pears", "other")
Pear B   =IF(COUNTIF(A6, ".*apple.*"), "apples", IF(COUNTIF(A6, ".*pear.*"), "pears", "other")
Melon    =IF(COUNTIF(A7, ".*apple.*"), "apples", IF(COUNTIF(A7, ".*pear.*"), "pears", "other")

The basic IF statement format is TEST, TRUE_ACTION, FALSE_ACTION. Since Calc doesn’t care if we use multiple lines in the formula, we can break it up into a more manageable version with nested IF functions. Consider:

=IF(TEST1, TRUE_TEST1,
 IF(TEST2, TRUE_TEST2,
 IF(TEST3, TRUE_TEST3, ... )))

This makes writing the above formula more approachable, in, say, a text editor (Notepad, Gedit, TextWrangler, etc.). For example:

=IF(COUNTIF(A2, ".*apple.*"), "ANSWER IF CONTAINS APPLES",
 IF(COUNTIF(A2, ".*pear.*"), "ANSWER IF CONTAINS PEARS",
 "UNKNOWN FALLBACK"))

Then, once the formula is written, just fill down.

You are missing parenthesis in the first example.

1 Like