Ask Your Question
2

Counting rows where specific combination of values from two columns occur

asked 2017-11-21 09:44:10 +0100

rbaleksandar gravatar image

updated 2017-11-21 09:50:24 +0100

Hi!

I will try to explain what I'm looking for with an example.

Given the table below

 |  A       B       C
-|---------------------
1| AA-     XYZ      a
2| AD-     ZYX      b
3| AA-     AB-      c
4| XYZ     ZYX      d
5| ZYX     DB_      e
6| AC-     DA_      f
7| DA_     AB-      g

(see CSV at the end)

I need to count all the rows where the first (A) and second (B) column both contain a value that belongs to the set {AA-, AB-, AC-, AD-, DA_, DB_, DC_, DD_}. For the example above only row 3, 6 and 7 will be taken into consideration because both of the "important" columns contain values that are from that set.

Using COUNTIFS I tried to come up with a regular expression. I've used the ? since the values are limited and there is not chance of having something like AX- or DG_. Even the values that are not allowed are a part of a set on their own but since it's a larger set I have decided to use inclusion instead of exclusion:

=COUNTIFS(A1:A7, "[A?-|D?_]", B1:B7, "[A?-|D?_]")

Since I'm new to LibreOffice and Calc in particular I was hoping to get an explanation why this expression always returns and how to fix it.

I need this in order to verify some results from a software I'm working on which uses a similar table to extract the respective rows and check some of their values. Manual verification is really tedious. Just one table only contains ~90K rows and the set of allowed values is much larger which leads to a huge pool of possible combinations. That is why I decided to check some of the stats my software prints out and compare those with the results from LibreCalc.


The example table from above as CSV (delimiter is space):

AA- XYZ a
AD- ZYX b
AA- AB- c
XYZ ZYX d
ZYX DB_ e
AC- DA_ f
DA_ AB- g
edit retag flag offensive close merge delete

2 Answers

Sort by » oldest newest most voted
2

answered 2017-11-21 09:58:18 +0100

Mike Kaganski gravatar image

updated 2017-11-21 11:00:36 +0100

E.g., your reference values AA-, AB-, AC-, AD-, DA_, DB_, DC_, DD_ are in D1-D8. Then, formula

=SUMPRODUCT(NOT(ISNA(VLOOKUP(A1:A7;$D$1:$D$8;1;0)));NOT(ISNA(VLOOKUP(B1:B7;$D$1:$D$8;1;0))))

does what you need.

If you enable regular expressions in formulas, then

=SUMPRODUCT(ISNUMBER(SEARCH("^(A.-|D._)$";A1:A7));ISNUMBER(SEARCH("^(A.-|D._)$";B1:B7)))

will also work. As well as this:

=COUNTIFS(A1:A7;"A.-|D._";B1:B7;"A.-|D._")

The explanation why the [A?-|D?_] expression is wrong:

  1. The [] brackets in regex define a set - i.e., "in this place any of the following is allowed". So, in your example (=COUNTIFS(A1:A7, "[A?-|D?_]", B1:B7, "[A?-|D?_]")), you looked for cells containing one-char values in the set defined in the square brackets. Aside: the expression in square brackets has special treatment of -, as @pierre-yves samyn noted: it means range. So, your charset included A, all chars between ? and | inclusive, D, and _.
  2. The intended usage of ? is also wrong. You implied it to mean "any character in this position; mandatory present". If used in regexes (outside of sets), the meaning of unescaped ? is to mean "zero or one of previous expression".

See here the full list of regular expressions in ICU (used in LO).

edit flag offensive delete link more

Comments

Is it possible to really use A?-? As I wrote this is just a simple example but in reality the number of allowed values is rather big.

rbaleksandar gravatar imagerbaleksandar ( 2017-11-21 10:00:54 +0100 )edit

I have enabled reg ex in the preference of my LibreCalc but the given pattern you have added in your answer returns 0 matches. I have used both reg ex versions - with SUMPRODUCT and with COUNTIFS.

rbaleksandar gravatar imagerbaleksandar ( 2017-11-21 10:28:41 +0100 )edit

Please attach a spreadsheet with the formula giving 0 (here is mine: regex.ods)

Mike Kaganski gravatar imageMike Kaganski ( 2017-11-21 10:32:48 +0100 )edit

Ah, I've just noticed that I'm using \- instead of -. I guess I took something from Pierre's answer after all. :D I will now adapt your answer to my specific needs. :P

rbaleksandar gravatar imagerbaleksandar ( 2017-11-21 10:51:44 +0100 )edit
0

answered 2017-11-21 10:04:15 +0100

pierre-yves samyn gravatar image

updated 2017-11-21 10:33:02 +0100

Hi

[EDIT]

I delete my answer that does not correspond to the need of the OP

You have to escape the character - ("between" operator, see help)

=COUNTIFS(A1:A7; "[A?\-|D?_]"; B1:B7; "[A?\-|D?_]")

Regards

edit flag offensive delete link more

Comments

Hmm, doesn't work. I even removed the --- and ___ from a couple of rows and still nothing. Then I went further and created 5 rows with only AAA and BBB in each column (of course I adapted the reg ex you've kindly provided to fit the new pattern) - nothing.

rbaleksandar gravatar imagerbaleksandar ( 2017-11-21 10:19:00 +0100 )edit

This regex is incorrect wrt what OP wants: it returns true only if the cell contain only one character in the set A, ?, -, |, D, _

Mike Kaganski gravatar imageMike Kaganski ( 2017-11-21 10:20:14 +0100 )edit

I'm looking for matching the complete AA-, AB- etc. and not just a chunk of it.

rbaleksandar gravatar imagerbaleksandar ( 2017-11-21 10:23:26 +0100 )edit

@Mike Kaganski - You're right of course, I read the question too quickly :(

@rbaleksandar - Sorry for the loss of time.

pierre-yves samyn gravatar imagepierre-yves samyn ( 2017-11-21 10:31:05 +0100 )edit

No problem. :D

rbaleksandar gravatar imagerbaleksandar ( 2017-11-21 10:33:12 +0100 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2017-11-21 09:44:10 +0100

Seen: 79 times

Last updated: Nov 21 '17