How does this formula work? {=SUM(1/COUNTIF(B2:B400,B2:B400))} Why can it fail to return the correct "countdistinct" result?

karolus · June 8, 2021, 11:35am

Exactly, because Calc does only Floating-point-math, but the Formula returns by Design something very close to an Integer, Round it and you will be fine.

erAck · June 8, 2021, 11:47am

While that illustrates nicely how the formula works, maybe a little further explanation that 1/COUNTIF(A1:A9;A1:A9) in array context would create an array of the values visible here in B1:B9. The COUNTIF(A1:A9;A1:A9) for each value in A1 to A9 counts the number of occurrences in A1:A9, counting them multiple times; taking the reciprocal of those counts and summing them results in a count of unique items.

erAck · June 8, 2021, 11:48am

The result not being integer for a large set of items may happen because of inaccuracies with IEEE 754 double binary floating point values that can not represent all decimal values exactly, see here for explanations. You may have to round the final result.

Lupp · June 8, 2021, 3:34pm

Don’t round! Read my comment on the question.
Rounding can only be an optioin in this case if the distances to integer values are tiny. (This at least in any case where the formula would return a result in finite time.)

Tass · June 8, 2021, 3:37pm

Ok erAck, that get’s far more technical now than I can grasp without sitting down and reading up basic concepts in programming that I am not familiar with. I posted a file below to clarify the issue. But from what you say it sounds to me that it is quite likely that the problem is because the item names are too long? That could easily be a problem as those are project titles, often of the length of a full sentence.

karolus · June 8, 2021, 11:20am

for Example:

a	2	1/2   # a is count twice  
b	2	1/2   # b also
c	1	1/1   # c occurs only 1 time
d	4	1/4   # d occurs 4times
a	2	1/2   # and so on
b	2	1/2
d	4	1/4
d	4	1/4
d	4	1/4

the countif-part counts how many times an entry occurs,
the 1/… returns the inverted counts,
The =SUM(…) returns the Sum of all the Fractions, that is by math the count of unique entries,
which is … obviously 4

Tass · June 8, 2021, 11:29am

ok, but how could I get decimals then? The sum should always be 1. I get results like 6593.4 .

karolus · June 8, 2021, 11:42am

that should not happen, can you provide an Example-document to have a look on this

mikekaganski · June 8, 2021, 12:45pm

I get results like 6593.4

The “.4” is impossible in case of correct formula. The inaccuracy could be around 0.00000…01, because of inaccuracies of calculation of N*(1/N), and even for N = 10^6 (a whole column of identical numbers in a sheet), the error would be no greater than ~~10^6*10^6*10^(-16), i.e. no greater than 10^-4 (i.e., no greater than 0.0001)~~ - and that is an exaggerated estimation.

I am unsure how could I come with 0.0001 estimation: it is as if the error of 1/N would be 10^6*10^-16, when in fact it would be respectively smaller than 10^-16 - so no, in no case it would be that great. It just can’t come even close to 0.0001, let alone 0.4.

mikekaganski · June 8, 2021, 3:49pm

Opening the file attached to the question (thanks!), I see 1013.000000000030 in B7455. So indeed 0.000000000030 a very small deviation, much smaller than 0.0001 that I wrongly mentioned above, and I don’t see anything similar to 6593.4. It is normal, and rounding is fine here.

Tass · June 9, 2021, 3:02am

Second sheet, Mike. But I think Lupp found the answer already. Thanks, everybody. Really appreciated!

eeigor · June 8, 2021, 4:30pm

In addition try smth else… Maybe #3 is easier than {SUM(1/COUNTIF(…))}. SUMPRODUCT doesn’t require {}.

#1-2
A5:=SUMPRODUCT(1/COUNTIF(A1:A4;A1:A4))
A5:=SUMPRODUCT(FREQUENCY(MATCH(A1:A4;A1:A4;0);ROW(A1:A4)-ROW(A1)+1)>0)

Even though the FREQUENCY formula (2nd) is much longer and contains more functions than the COUNTIF formula (1st), the formula calculation time is astonishingly faster!

#3
A5:=SUMPRODUCT((A1:A4<>"")/COUNTIF(A1:A4;A1:A4**&""**))

NOTE: By itself, the COUNTIF function converts an empty cell to a zero in the criteria agrument. This results in a division by 0 error: #DIV/0! So pay attention to &"".
Empty cells will also be counted, but in the first condition, empty cells will give FALSE, and when dividing FALSE by the total number of empty cells, we will get zero. As a result, empty cells are not counted. If we don’t want to treat the empty cells as one of the possible data states.

Attention! If there are empty cells in the data in formulas #1 and #2, it is necessary to handle the error #DIV/0! and #N/A, respectively, using the IFERROR function.

=SUMPRODUCT(IFERROR(1/COUNTIF(A1:A4;A1:A4);""))

Tass · June 9, 2021, 4:17am

Thank you, that is quite useful!

erAck · June 9, 2021, 10:37am

Note that ignores any and all errors, not just a #DIV/0! because of an empty cell.

Lupp · June 8, 2021, 10:18pm

This does not answer the question “How does it work?”, which already was answered by others. This explains the reasons for getting wrong (and fractional) resulst under certain conditions. After all, the errors were the reason to ask the question. I will edit the question slightly to make this clear.

The fractional parts are due to unsymmetric cases where a COUNTIF(Cell;Cell) with twice the same cell would return 0, or where an empty string is accepted as a match for blank but not reversely. (Similar with 0-values in some cases).
The first variant won’t occur with exact matching for strimgs, but if the Option
>Tools>Options>LibreOffice Calc>Calculation>>Formulas Wildcards>>>"Enable regular expressions in formulas"
is enabled strings containing special characters of RegEX can as well spoil the recognition of the string itself as a match, as also count a not exact match as if it was exact.
To a reduced extent such an effect can also occur with Wildcards (MS style) enabled.
A different setting may also cause related issues: Search criteria....whole cells not checked.

In the given case the “unescaped parentheses” were “the culprit”.

Conclusion: Always consider thoroughly the options you enable/disable.
Bad: RegEx (or Wildcad) can only be enabled/disabled per documnent. Actually needed is a way to decide it per formula where functions are capable of RegEx (Wildcard) sensitivity.
Also: Try to work without tricks. Counting should be done by counting (1 by 1), not by adding fractions.

[Edit 2021-06-09 about 11:15 GMT: The announced user function]

Option VBAsupport 1

Function countDistinct1D(pRange)
countDistinct = -1
rg = pRange.CellRange
da = rg.getDataArray()
uR = Ubound(da)
Dim args(0) As New com.sun.star.beans.PropertyValue
args(0).Name = "Hidden"
args(0).Value = True
hDoc = StarDesktop.loadComponentFromURL("private:factory/scalc", "_blank", 0, args)
hSheet = hDoc.Sheets(0)
hRg = hSheet.getCellRangeByPosition(0, 0, 0, uR)
hRg.setDataArray(da)
fd = hRg.CreateFilterDescriptor(True)
Dim ffs(0) As New com.sun.star.sheet.TableFilterField
ffs(0).IsNumeric = False
With fd
  .SkipDuplicates = True
  .CopyOutputData = True
  tRa = .OutputPosition
  tRa.Sheet = 0
  tRa.Column = 1
  tRa.Row = 0
  .OutputPosition = tRa
  .FilterFields = ffs
  .FilterFields(0).IsNumeric = False
End With
hRg.Filter(fd)
hRg.clearContents(7)
cc = hSheet.createCursor()
cc.gotoStartOfUsedArea(False)
cc.gotoEndOfUsedArea(True)
countDistinct1D = cc.RangeAddress.EndRow + 1
hDoc.Close(True)
fail:
End Function

(It’s actually tested!)
[/Edit]

Tass · June 9, 2021, 3:23am

I just checked, I actually had “enable wildcards in formulas” checked, which seems to be the default, because I surely never changed these settings, not understanding their consequences

Tass · June 9, 2021, 4:20am

“Also: Try to work without tricks. Counting should be done by counting (1 by 1), not by adding fractions.”

How can I do this?

eeigor · June 9, 2021, 5:44am

I gave all the options above.
Including speed. Thus, I answered a question that was not asked, but “was in the air”

Tass · June 9, 2021, 6:13am

Actually, the Frequency/Match approach is pretty good as it doesn’t come up with decimals but integers. It came up with a different result (1 item less) on a list of 66,000 entries. I use this now. In terms of speed, I notice no difference though, I can go and make coffee in both cases before the process is finished.

Lupp · June 9, 2021, 11:11am

Be cautious:
The working of MATCH() also basically depends on the setting concerning “Wildcards” or “Regular expressions” or “Neither nor” (radio biuttons).
All the solutions suggested by @eeigor use a function of that famiily.
If somebody worries about the question if a result of any “countdistinct” formula is correct, it’s necesary to use the standard filter without any criterion, but wit No duplicates enabled under the options. There also is an option to output the filtered result elsewhere (given by the top-left cell of a range assured to be empty in advance - otherwise overwriting will occur). The results there are all unique now, and can simply be counted using COUNTA().
For specifically interesd users I will append to my answer a user Function based on ideas probably looking strange, but providing te fastest solution I can offer.