Counting rows where specific combination of values from two columns occur

Hi!

I will try to explain what I’m looking for with an example.

Given the table below

 |  A       B       C
-|---------------------
1| AA-     XYZ      a
2| AD-     ZYX      b
3| AA-     AB-      c
4| XYZ     ZYX      d
5| ZYX     DB_      e
6| AC-     DA_      f
7| DA_     AB-      g

(see CSV at the end)

I need to count all the rows where the first (A) and second (B) column both contain a value that belongs to the set {AA-, AB-, AC-, AD-, DA_, DB_, DC_, DD_}. For the example above only row 3, 6 and 7 will be taken into consideration because both of the “important” columns contain values that are from that set.

Using COUNTIFS I tried to come up with a regular expression. I’ve used the ? since the values are limited and there is not chance of having something like AX- or DG_. Even the values that are not allowed are a part of a set on their own but since it’s a larger set I have decided to use inclusion instead of exclusion:

=COUNTIFS(A1:A7, "[A?-|D?_]", B1:B7, "[A?-|D?_]")

Since I’m new to LibreOffice and Calc in particular I was hoping to get an explanation why this expression always returns and how to fix it.

I need this in order to verify some results from a software I’m working on which uses a similar table to extract the respective rows and check some of their values. Manual verification is really tedious. Just one table only contains ~90K rows and the set of allowed values is much larger which leads to a huge pool of possible combinations. That is why I decided to check some of the stats my software prints out and compare those with the results from LibreCalc.


The example table from above as CSV (delimiter is space):

AA- XYZ a
AD- ZYX b
AA- AB- c
XYZ ZYX d
ZYX DB_ e
AC- DA_ f
DA_ AB- g

E.g., your reference values AA-, AB-, AC-, AD-, DA_, DB_, DC_, DD_ are in D1-D8. Then, formula

=SUMPRODUCT(NOT(ISNA(VLOOKUP(A1:A7;$D$1:$D$8;1;0)));NOT(ISNA(VLOOKUP(B1:B7;$D$1:$D$8;1;0))))

does what you need.

If you enable regular expressions in formulas, then

=SUMPRODUCT(ISNUMBER(SEARCH("^(A.-|D._)$";A1:A7));ISNUMBER(SEARCH("^(A.-|D._)$";B1:B7)))

will also work. As well as this:

=COUNTIFS(A1:A7;"A.-|D._";B1:B7;"A.-|D._")

The explanation why the [A?-|D?_] expression is wrong:

  1. The [] brackets in regex define a set - i.e., “in this place any of the following is allowed”. So, in your example (=COUNTIFS(A1:A7, "[A?-|D?_]", B1:B7, "[A?-|D?_]")), you looked for cells containing one-char values in the set defined in the square brackets. Aside: the expression in square brackets has special treatment of -, as @PYS noted: it means range. So, your charset included A, all chars between ? and | inclusive, D, and _.
  2. The intended usage of ? is also wrong. You implied it to mean “any character in this position; mandatory present”. If used in regexes (outside of sets), the meaning of unescaped ? is to mean “zero or one of previous expression”.

See here the full list of regular expressions in ICU (used in LO).

Is it possible to really use A?-? As I wrote this is just a simple example but in reality the number of allowed values is rather big.

I have enabled reg ex in the preference of my LibreCalc but the given pattern you have added in your answer returns 0 matches. I have used both reg ex versions - with SUMPRODUCT and with COUNTIFS.

Please attach a spreadsheet with the formula giving 0 (here is mine: regex.ods)

Ah, I’ve just noticed that I’m using \- instead of -. I guess I took something from Pierre’s answer after all. :smiley: I will now adapt your answer to my specific needs. :stuck_out_tongue:

Hi

[EDIT]

I delete my answer that does not correspond to the need of the OP

You have to escape the character - (“between” operator, see help)

=COUNTIFS(A1:A7; "[A?\-|D?_]"; B1:B7; "[A?\-|D?_]")

Regards

Hmm, doesn’t work. I even removed the --- and ___ from a couple of rows and still nothing. Then I went further and created 5 rows with only AAA and BBB in each column (of course I adapted the reg ex you’ve kindly provided to fit the new pattern) - nothing.

This regex is incorrect wrt what OP wants: it returns true only if the cell contain only one character in the set A, ?, -, |, D, _

I’m looking for matching the complete AA-, AB- etc. and not just a chunk of it.

@mikekaganski - You’re right of course, I read the question too quickly :frowning:

@rbaleksandar - Sorry for the loss of time.

No problem. :smiley: