We will be migrating from Ask to Discourse on the first week of August, read the details here

Ask Your Question
0

How does this formula work? {=SUM(1/COUNTIF(B2:B400,B2:B400))} Why can it fail to return the correct "countdistinct" result?

asked 2021-06-08 12:44:52 +0200

Tass gravatar image

updated 2021-06-09 00:25:43 +0200

Lupp gravatar image

Hi Hivemind,

I have been using this array formula before to determine the number of items in a list: {=SUM(1/COUNTIF(B2:B400,B2:B400))}

However, I don't actually understand really what it does, could someone explain how it works, please?

The reason this has become important to me (apart from simple curiosity) is that I have now long item lists of several tens of thousands of lines and the result I get is not an integer but a decimal (which for counting distinct items obviously doesn't make sense). So, I assume this formula is a workaround and an approximation to counting items and I would like to know how accurate it is.

Thanks a lot.

PS: Here is a test file. It is an extract of the original data I am working with. I check for empty cells first then do the counting of items, which are project titles in this case. Sheet 1 shows a small deviation, Sheet 2 a significant one, and sheet 3 shows a div/0 error even though there are no empty cells in that column.

So, I don't know if I do something fundamentally wrong or if there is an issue with the data or...

Thanks a lot for looking into this. Best regards. C:\fakepath\Test sheet.ods

edit retag flag offensive close merge delete

Comments

You have blank cells and also 0 (zero) values in your range B2:B400. The trick you are using can't work in this case. It's surely pure fun to find out why.
The mentioned trick is rather old. The dangerous automatisms causing the isssue also are.
How many users will have trusted in wrong results having rounded the decimals away (probably suppressing the display by formatting) and the propagation by Precision as shown e.g.?
Spredsheets are very dangerous. They once were usful. Till Excel came?
Please also consider what I wrote in
https://ask.libreoffice.org/en/questi...

Lupp gravatar imageLupp ( 2021-06-08 17:06:22 +0200 )edit

No, Lupp, I actually don't, The cells are not integers but text. They are project titles so never zeros. And I filled all empty cells, at least that is what it tells me when I use =COUNTIF(B2:B7451,""). The reason must be a different one. Maybe the titles are too long or something like that?

Tass gravatar imageTass ( 2021-06-08 17:31:44 +0200 )edit

I updated my question with a file to clarify the issue. It sometimes seems to load quite slowly (i.e. several minutes), sorry for that. It seems to recalculate all formulas and that sucks.

Tass gravatar imageTass ( 2021-06-08 17:46:43 +0200 )edit

Well, spreadsheets know many tricks to baffle users.
Hope I'll find out.

Lupp gravatar imageLupp ( 2021-06-08 18:15:46 +0200 )edit

4 Answers

Sort by » oldest newest most voted
2

answered 2021-06-08 13:20:11 +0200

karolus gravatar image

updated 2021-06-08 13:23:55 +0200

for Example:

a   2   1/2   # a is count twice  
b   2   1/2   # b also
c   1   1/1   # c occurs only 1 time
d   4   1/4   # d occurs 4times
a   2   1/2   # and so on
b   2   1/2
d   4   1/4
d   4   1/4
d   4   1/4

the countif-part counts how many times an entry occurs,
the 1/… returns the inverted counts,
The =SUM(…) returns the Sum of all the Fractions, that is by math the count of unique entries,
which is … obviously 4

edit flag offensive delete link more

Comments

ok, but how could I get decimals then? The sum should always be 1. I get results like 6593.4 .

Tass gravatar imageTass ( 2021-06-08 13:29:02 +0200 )edit

that should not happen, can you provide an Example-document to have a look on this

karolus gravatar imagekarolus ( 2021-06-08 13:42:10 +0200 )edit

I get results like 6593.4

The ".4" is impossible in case of correct formula. The inaccuracy could be around 0.00000...01, because of inaccuracies of calculation of N*(1/N), and even for N = 10^6 (a whole column of identical numbers in a sheet), the error would be no greater than 10^6*10^6*10^(-16), i.e. no greater than 10^-4 (i.e., no greater than 0.0001) - and that is an exaggerated estimation.

I am unsure how could I come with 0.0001 estimation: it is as if the error of 1/N would be 10^6*10^-16, when in fact it would be respectively smaller than 10^-16 - so no, in no case it would be that great. It just can't come even close to 0.0001, let alone 0.4.

Mike Kaganski gravatar imageMike Kaganski ( 2021-06-08 14:45:43 +0200 )edit

Opening the file attached to the question (thanks!), I see 1013.000000000030 in B7455. So indeed 0.000000000030 a very small deviation, much smaller than 0.0001 that I wrongly mentioned above, and I don't see anything similar to 6593.4. It is normal, and rounding is fine here.

Mike Kaganski gravatar imageMike Kaganski ( 2021-06-08 17:49:30 +0200 )edit

Second sheet, Mike. But I think Lupp found the answer already. Thanks, everybody. Really appreciated!

Tass gravatar imageTass ( 2021-06-09 05:02:33 +0200 )edit
2

answered 2021-06-08 13:10:20 +0200

PKG gravatar image

updated 2021-06-08 13:23:28 +0200

edit flag offensive delete link more

Comments

Thanks! That makes it pretty clear. But how can I get decimals as a result? Rounding error?

Tass gravatar imageTass ( 2021-06-08 13:26:51 +0200 )edit

Exactly, because Calc does only Floating-point-math, but the Formula returns by Design something very close to an Integer, Round it and you will be fine.

karolus gravatar imagekarolus ( 2021-06-08 13:35:00 +0200 )edit

While that illustrates nicely how the formula works, maybe a little further explanation that 1/COUNTIF(A1:A9;A1:A9) in array context would create an array of the values visible here in B1:B9. The COUNTIF(A1:A9;A1:A9) for each value in A1 to A9 counts the number of occurrences in A1:A9, counting them multiple times; taking the reciprocal of those counts and summing them results in a count of unique items.

erAck gravatar imageerAck ( 2021-06-08 13:47:13 +0200 )edit

The result not being integer for a large set of items may happen because of inaccuracies with IEEE 754 double binary floating point values that can not represent all decimal values exactly, see here for explanations. You may have to round the final result.

erAck gravatar imageerAck ( 2021-06-08 13:48:55 +0200 )edit

Don't round! Read my comment on the question.
Rounding can only be an optioin in this case if the distances to integer values are tiny. (This at least in any case where the formula would return a result in finite time.)

Lupp gravatar imageLupp ( 2021-06-08 17:34:12 +0200 )edit

Ok erAck, that get's far more technical now than I can grasp without sitting down and reading up basic concepts in programming that I am not familiar with. I posted a file below to clarify the issue. But from what you say it sounds to me that it is quite likely that the problem is because the item names are too long? That could easily be a problem as those are project titles, often of the length of a full sentence.

Tass gravatar imageTass ( 2021-06-08 17:37:04 +0200 )edit
1

answered 2021-06-08 18:30:26 +0200

eeigor gravatar image

updated 2021-06-08 18:51:58 +0200

In addition try smth else... Maybe #3 is easier than {SUM(1/COUNTIF(...))}. SUMPRODUCT doesn't require {}.

1-2

A5:=SUMPRODUCT(1/COUNTIF(A1:A4;A1:A4)) A5:=SUMPRODUCT(FREQUENCY(MATCH(A1:A4;A1:A4;0);ROW(A1:A4)-ROW(A1)+1)>0)

Even though the FREQUENCY formula (2nd) is much longer and contains more functions than the COUNTIF formula (1st), the formula calculation time is astonishingly faster!

3

A5:=SUMPRODUCT((A1:A4<>"")/COUNTIF(A1:A4;A1:A4&""))

NOTE: By itself, the COUNTIF function converts an empty cell to a zero in the criteria agrument. This results in a division by 0 error: #DIV/0! So pay attention to &"". Empty cells will also be counted, but in the first condition, empty cells will give FALSE, and when dividing FALSE by the total number of empty cells, we will get zero. As a result, empty cells are not counted. If we don't want to treat the empty cells as one of the possible data states.

Attention! If there are empty cells in the data in formulas #1 and #2, it is necessary to handle the error #DIV/0! and #N/A, respectively, using the IFERROR function.

=SUMPRODUCT(IFERROR(1/COUNTIF(A1:A4;A1:A4);""))

edit flag offensive delete link more

Comments

Thank you, that is quite useful!

Tass gravatar imageTass ( 2021-06-09 06:17:40 +0200 )edit

Note that ignores any and all errors, not just a #DIV/0! because of an empty cell.

erAck gravatar imageerAck ( 2021-06-09 12:37:41 +0200 )edit
1

answered 2021-06-09 00:18:51 +0200

Lupp gravatar image

updated 2021-06-09 13:13:51 +0200

This does not answer the question "How does it work?", which already was answered by others. This explains the reasons for getting wrong (and fractional) resulst under certain conditions. After all, the errors were the reason to ask the question. I will edit the question slightly to make this clear.

The fractional parts are due to unsymmetric cases where a COUNTIF(Cell;Cell) with twice the same cell would return 0, or where an empty string is accepted as a match for blank but not reversely. (Similar with 0-values in some cases).
The first variant won't occur with exact matching for strimgs, but if the Option
>Tools>Options>LibreOffice Calc>Calculation>>Formulas Wildcards>>>"Enable regular expressions in formulas"
is enabled strings containing special characters of RegEX can as well spoil the recognition of the string itself as a match, as also count a not exact match as if it was exact. To a reduced extent such an effect can also occur with Wildcards (MS style) enabled.
A different setting may also cause related issues: Search criteria....whole cells not checked.

In the given case the "unescaped parentheses" were "the culprit".

Conclusion: Always consider thoroughly the options you enable/disable.
Bad: RegEx (or Wildcad) can only be enabled/disabled per documnent. Actually needed is a way to decide it per formula where functions are capable of RegEx (Wildcard) sensitivity.
Also: Try to work without tricks. Counting should be done by counting (1 by 1), not by adding fractions.

[Edit 2021-06-09 about 11:15 GMT: The announced user function]

Option VBAsupport 1

Function countDistinct1D(pRange)
countDistinct = -1
rg = pRange.CellRange
da = rg.getDataArray()
uR = Ubound(da)
Dim args(0) As New com.sun.star.beans.PropertyValue
args(0).Name = "Hidden"
args(0).Value = True
hDoc = StarDesktop.loadComponentFromURL("private:factory/scalc", "_blank", 0, args)
hSheet = hDoc.Sheets(0)
hRg = hSheet.getCellRangeByPosition(0, 0, 0, uR)
hRg.setDataArray(da)
fd = hRg.CreateFilterDescriptor(True)
Dim ffs(0) As New com.sun.star.sheet.TableFilterField
ffs(0).IsNumeric = False
With fd
  .SkipDuplicates = True
  .CopyOutputData = True
  tRa = .OutputPosition
  tRa.Sheet = 0
  tRa.Column = 1
  tRa.Row = 0
  .OutputPosition = tRa
  .FilterFields = ffs
  .FilterFields(0).IsNumeric = False
End With
hRg.Filter(fd)
hRg.clearContents(7)
cc = hSheet.createCursor()
cc.gotoStartOfUsedArea(False)
cc.gotoEndOfUsedArea(True)
countDistinct1D = cc.RangeAddress.EndRow + 1
hDoc.Close(True)
fail:
End Function

(It's actually tested!)
[/Edit]

edit flag offensive delete link more

Comments

I just checked, I actually had "enable wildcards in formulas" checked, which seems to be the default, because I surely never changed these settings, not understanding their consequences

Tass gravatar imageTass ( 2021-06-09 05:23:18 +0200 )edit

"Also: Try to work without tricks. Counting should be done by counting (1 by 1), not by adding fractions."

How can I do this?

Tass gravatar imageTass ( 2021-06-09 06:20:27 +0200 )edit

I gave all the options above. Including speed. Thus, I answered a question that was not asked, but "was in the air"

eeigor gravatar imageeeigor ( 2021-06-09 07:44:25 +0200 )edit

Actually, the Frequency/Match approach is pretty good as it doesn't come up with decimals but integers. It came up with a different result (1 item less) on a list of 66,000 entries. I use this now. In terms of speed, I notice no difference though, I can go and make coffee in both cases before the process is finished. ;)

Tass gravatar imageTass ( 2021-06-09 08:13:15 +0200 )edit

Be cautious:
The working of MATCH() also basically depends on the setting concerning "Wildcards" or "Regular expressions" or "Neither nor" (radio biuttons).
All the solutions suggested by @eeigor use a function of that famiily.
If somebody worries about the question if a result of any "countdistinct" formula is correct, it's necesary to use the standard filter without any criterion, but wit No duplicates enabled under the options. There also is an option to output the filtered result elsewhere (given by the top-left cell of a range assured to be empty in advance - otherwise overwriting will occur). The results there are all unique now, and can simply be counted using COUNTA().
For specifically interesd users I will append to my answer a user Function based on ideas probably looking strange, but providing te fastest solution I can offer.

Lupp gravatar imageLupp ( 2021-06-09 13:11:14 +0200 )edit

Indeed, with regular expressions switched on and items containing operators or metacharacters items can be counted multiple times, resulting in unexpected fractional values, or not be counted at all. For this the search criteria can be forced to a quoted literal string by prepending \Q and appending \E metacharacters, so that would be

=SUM(1/COUNTIF(B2:B400,"\Q"&B2:B400&"\E"))

(not tested with your document but in general works).

Note that if Excel compatible Wildcards are enabled instead of regular expressions, there is no such simple remedy. And the formula does also not work with no wildcards or regular expressions enabled.

erAck gravatar imageerAck ( 2021-06-09 13:24:13 +0200 )edit

"use the standard filter without any criterion, but wit No duplicates enabled under the options. There also is an option to output the filtered result elsewhere (given by the top-left cell of a range assured to be empty in advance - otherwise overwriting will occur). The results there are all unique now, and can simply be counted using COUNTA()"

I have tried this before and now again, but that doesn't work for me. I get only the first row copied to a new sheet, nothing else. With advanced filter, I get an error message. :(

Tass gravatar imageTass ( 2021-06-09 15:12:26 +0200 )edit

"For this the search criteria can be forced to a quoted literal string by prepending \Q and appending \E metacharacter"

Gosh, there is so much to know for such a simple operation. How can a standard user who does something more than just basic operations only once a year or so, ever be proficient in that?! I mean without you guys I would have had no way to know or ever find out. That's kind of disturbing...

Tass gravatar imageTass ( 2021-06-09 15:16:33 +0200 )edit

@Tass wrote: “I get only the first row copied to a new sheet, nothing else”.

Filters are fine. The range must be contiguous, that is, no blank lines.

eeigor gravatar imageeeigor ( 2021-06-09 15:43:54 +0200 )edit

I have tried this {stanard filter with No duplicates} before and now again, but that doesn't work for me.

It works for me, and it works for the user function I posted. We are using the same software. You must have missed the "No criterion at all" or you misinterpreted it. Select "-none-" also for the very first field.

As already posted in different places: The choice concerning wildcards and regular expressions must be offered per function call. A trailing optional parameter would do. As long as Excel doesn't start to misuse that position, it would even not reduce compatility beyond what's unavoidable anyway due to the shortcomings of Excel.

Lupp gravatar imageLupp ( 2021-06-09 15:45:15 +0200 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2021-06-08 12:44:52 +0200

Seen: 128 times

Last updated: Jun 09