How does this formula work? {=SUM(1/COUNTIF(B2:B400,B2:B400))} Why can it fail to return the correct "countdistinct" result?

Tass · June 8, 2021, 10:44am

Hi Hivemind,

I have been using this array formula before to determine the number of items in a list: {=SUM(1/COUNTIF(B2:B400,B2:B400))}

However, I don’t actually understand really what it does, could someone explain how it works, please?

The reason this has become important to me (apart from simple curiosity) is that I have now long item lists of several tens of thousands of lines and the result I get is not an integer but a decimal (which for counting distinct items obviously doesn’t make sense). So, I assume this formula is a workaround and an approximation to counting items and I would like to know how accurate it is.

Thanks a lot.

PS: Here is a test file. It is an extract of the original data I am working with. I check for empty cells first then do the counting of items, which are project titles in this case. Sheet 1 shows a small deviation, Sheet 2 a significant one, and sheet 3 shows a div/0 error even though there are no empty cells in that column.

So, I don’t know if I do something fundamentally wrong or if there is an issue with the data or…

Thanks a lot for looking into this. Best regards.
Test sheet.ods

Lupp · June 8, 2021, 3:06pm

You have blank cells and also 0 (zero) values in your range B2:B400. The trick you are using can’t work in this case. It’s surely pure fun to find out why.
The mentioned trick is rather old. The dangerous automatisms causing the isssue also are.
How many users will have trusted in wrong results having rounded the decimals away (probably suppressing the display by formatting) and the propagation by Precision as shown e.g.?
Spredsheets are very dangerous. They once were usful. Till Excel came?
Please also consider what I wrote in

Tass · June 8, 2021, 3:31pm

No, Lupp, I actually don’t, The cells are not integers but text. They are project titles so never zeros. And I filled all empty cells, at least that is what it tells me when I use =COUNTIF(B2:B7451,""). The reason must be a different one. Maybe the titles are too long or something like that?

Tass · June 8, 2021, 3:46pm

I updated my question with a file to clarify the issue. It sometimes seems to load quite slowly (i.e. several minutes), sorry for that. It seems to recalculate all formulas and that sucks.

Lupp · June 8, 2021, 4:15pm

Well, spreadsheets know many tricks to baffle users.
Hope I’ll find out.

PKG · June 8, 2021, 11:10am

Untitled.ods

Tass · June 8, 2021, 11:26am

Thanks! That makes it pretty clear. But how can I get decimals as a result? Rounding error?

karolus · June 8, 2021, 11:35am

Exactly, because Calc does only Floating-point-math, but the Formula returns by Design something very close to an Integer, Round it and you will be fine.

erAck · June 8, 2021, 11:47am

While that illustrates nicely how the formula works, maybe a little further explanation that 1/COUNTIF(A1:A9;A1:A9) in array context would create an array of the values visible here in B1:B9. The COUNTIF(A1:A9;A1:A9) for each value in A1 to A9 counts the number of occurrences in A1:A9, counting them multiple times; taking the reciprocal of those counts and summing them results in a count of unique items.

erAck · June 8, 2021, 11:48am

The result not being integer for a large set of items may happen because of inaccuracies with IEEE 754 double binary floating point values that can not represent all decimal values exactly, see here for explanations. You may have to round the final result.

Lupp · June 8, 2021, 3:34pm

Don’t round! Read my comment on the question.
Rounding can only be an optioin in this case if the distances to integer values are tiny. (This at least in any case where the formula would return a result in finite time.)

Tass · June 8, 2021, 3:37pm

Ok erAck, that get’s far more technical now than I can grasp without sitting down and reading up basic concepts in programming that I am not familiar with. I posted a file below to clarify the issue. But from what you say it sounds to me that it is quite likely that the problem is because the item names are too long? That could easily be a problem as those are project titles, often of the length of a full sentence.

karolus · June 8, 2021, 11:20am

for Example:

a	2	1/2   # a is count twice  
b	2	1/2   # b also
c	1	1/1   # c occurs only 1 time
d	4	1/4   # d occurs 4times
a	2	1/2   # and so on
b	2	1/2
d	4	1/4
d	4	1/4
d	4	1/4

the countif-part counts how many times an entry occurs,
the 1/… returns the inverted counts,
The =SUM(…) returns the Sum of all the Fractions, that is by math the count of unique entries,
which is … obviously 4

Tass · June 8, 2021, 11:29am

ok, but how could I get decimals then? The sum should always be 1. I get results like 6593.4 .

karolus · June 8, 2021, 11:42am

that should not happen, can you provide an Example-document to have a look on this

mikekaganski · June 8, 2021, 12:45pm

I get results like 6593.4

The “.4” is impossible in case of correct formula. The inaccuracy could be around 0.00000…01, because of inaccuracies of calculation of N*(1/N), and even for N = 10^6 (a whole column of identical numbers in a sheet), the error would be no greater than ~~10^6*10^6*10^(-16), i.e. no greater than 10^-4 (i.e., no greater than 0.0001)~~ - and that is an exaggerated estimation.

I am unsure how could I come with 0.0001 estimation: it is as if the error of 1/N would be 10^6*10^-16, when in fact it would be respectively smaller than 10^-16 - so no, in no case it would be that great. It just can’t come even close to 0.0001, let alone 0.4.

mikekaganski · June 8, 2021, 3:49pm

Opening the file attached to the question (thanks!), I see 1013.000000000030 in B7455. So indeed 0.000000000030 a very small deviation, much smaller than 0.0001 that I wrongly mentioned above, and I don’t see anything similar to 6593.4. It is normal, and rounding is fine here.

Tass · June 9, 2021, 3:02am

Second sheet, Mike. But I think Lupp found the answer already. Thanks, everybody. Really appreciated!

eeigor · June 8, 2021, 4:30pm

In addition try smth else… Maybe #3 is easier than {SUM(1/COUNTIF(…))}. SUMPRODUCT doesn’t require {}.

#1-2
A5:=SUMPRODUCT(1/COUNTIF(A1:A4;A1:A4))
A5:=SUMPRODUCT(FREQUENCY(MATCH(A1:A4;A1:A4;0);ROW(A1:A4)-ROW(A1)+1)>0)

Even though the FREQUENCY formula (2nd) is much longer and contains more functions than the COUNTIF formula (1st), the formula calculation time is astonishingly faster!

#3
A5:=SUMPRODUCT((A1:A4<>"")/COUNTIF(A1:A4;A1:A4**&""**))

NOTE: By itself, the COUNTIF function converts an empty cell to a zero in the criteria agrument. This results in a division by 0 error: #DIV/0! So pay attention to &"".
Empty cells will also be counted, but in the first condition, empty cells will give FALSE, and when dividing FALSE by the total number of empty cells, we will get zero. As a result, empty cells are not counted. If we don’t want to treat the empty cells as one of the possible data states.

Attention! If there are empty cells in the data in formulas #1 and #2, it is necessary to handle the error #DIV/0! and #N/A, respectively, using the IFERROR function.

=SUMPRODUCT(IFERROR(1/COUNTIF(A1:A4;A1:A4);""))

Tass · June 9, 2021, 4:17am

Thank you, that is quite useful!