Fast way to count duplicates in 30000 rows (LibreOffice Calc)

CharlesDayan · November 3, 2018, 3:58pm

Actually, I already have a partial answer!!! Conditional formatting with “Cell value is” → “duplicate” !!! This way a check is performed for each user’s new entry in “real time”.

I need to check if duplicate entries exist in 30000 rows of a column (any value, but not blanks!) . I would like to keep track of how many duplicates during the filling process.

Ok, conditional formatting is a very effective visual indication and fast anough for my needs, but as I am not able to perform a loop to check the color of the cells (found some people against this approach!! Would be so easy! ) I need to find an alternative way to count the duplicates (as a whole, no need to identify how many for each case!).

I tryed the formula:

=SUMPRODUCT((COUNTIF(F2:F30001;$F$2:$F$30001)>1))

It works, but it takes two minutes to finish!!!

If you want to replicate my case. My 30000 entries are formatted as: letter “A” and numbers between 100000 and 999999, e.g., A354125, A214547, etc. Copy as text the result of “=CONCATENATE(“A”;RANDBETWEEN(100000;999999))” to save time.

Thanks!

PS: Does anybody know the algorithm used to find the duplicates in conditional formatting (it is fast)?

A macro solution is not the best, but is acceptable!

EasyTrieve · November 3, 2018, 6:55pm

Humm, I think here’s the rough algo. that will fix this: Convert your spreadsheet into a data base. Someone please correct me if I’m wrong, but I think 30k lines in a spreadsheet is not what spreadsheets were ever designed for. 30k lines in a database is medium fry, as tables can go from 1 to billions of lines. With databases you get fast indexes which can answer the kinds of questions you are seeking to answer. And you get various tools better designed to deal with large datasets.

CharlesDayan · November 3, 2018, 10:26pm

Thanks for your point, but it’s a true spreadsheet! Most of it is dedicated to perform calculations!

The “question” was already answered by the “conditional formatting”, but I don’t know how to access the properties to count the modified cells!

PS: About the 30k “limit”… My spreadsheets with 30k lines are the small ones! And we have 1048576 lines available for each of the 1024 columns, per sheet!!

Remember that for some time 256 columns was the limit for Excel! But these days are gone!

Lupp · November 6, 2018, 11:41am

@EasyTrieve: There are miracles.
I experimented a bit with 30000 randomly generated three-letter-words.
The creation of a pivot table (returning the di8stinct words, sorting them on the fly, and counting the multiplicities) needed less than 1 s. Similar time for a refresh.
II then added right of the original data column in the first data row a VLOOKUP() into the pivot-table results for the multiplicity of the first word. Having done so, I filled down by doubleclick. Time needed: About 2 s

Lupp · November 6, 2018, 11:44am

@EasyTrieve continued:
(The time given for filling down included the calculation and the delivering of the results, of course.)
Having done as described above, I ordered the pivot table to refresh: Needed > 20 s.
I also ordered a recalculation for the columnn of looked-up multiplicities: Needed > 20 s.
What little devile intervened?

Lupp · November 6, 2018, 11:50am

@CharlesDayan: please read my comment concerning CF on the answer by @erAck .

EasyTrieve · November 7, 2018, 9:30pm

@Lupp Thanks. Very interesting. Guess so long as it fits in available memory and never hits the disk it’s pretty fast. Also not too much time spent on garbage collection either, thought that could make it more unpredictable. Very interesting. Thank you.

CharlesDayan · November 8, 2018, 6:32pm

I really appreciate all the effort around this problem .The examples and the clever ideas ("rows-above comparison!) were very useful.
I implemented some of then and I think that the future user will need to deal with some delay or I will perfom the duplicates checking before saving the file. Up to 10000 entries the solutions were ok. But sometimes the program kind of updates the calculations and hangs a bit.

Thanks again …

PS: inform if you find a way to access styles changed by CF!

erAck · November 3, 2018, 6:39pm

You are not by chance calculating the formula =SUMPRODUCT((COUNTIF(F2:F30001;$F$2:$F$30001)>1)) in each row from row 2 to row 30001? That wouldn’t make sense but could explain why it would take 2 minutes because for each row the entire range is to be evaluated against the entire range.

Either calculate that formula only once to get the overall duplicates count, or if you want the duplicates count of each value in a row then use =COUNTIF(F$2:F$30001;F2) in row 2 and pull or copy&paste that formula cell down to row 30001 (let’s say in column G). You can then calculate the overall duplicates in one cell with =SUMPRODUCT(G2:G30001>1)

mikekaganski · November 3, 2018, 6:45pm

Well - I confirm that the single =SUMPRODUCT((COUNTIF(F2:F30001;$F$2:$F$30001)>1)) for the data described by OP in the question does take long (~55 s on my system)…

erAck · November 3, 2018, 7:01pm

It does 30000*30000=900000000 cell comparisons (plus stores them in an array of 30000 and then compares each element for >1 and then sums the 30000 results of that, which should be neglectable compared to the 900000000). It should take some time, yes.

mikekaganski · November 3, 2018, 7:26pm

Just wanted to note that it doesn’t need to be in each row - by the way, the variant with separate COUNTIFs in each row + one SUMPRODUCT also takes ~that time.

CharlesDayan · November 3, 2018, 10:06pm

Hi! Just to clarify, I calculated that formula only once. But it is recalculated after each new added entry!! What is very annoying!

CharlesDayan · November 3, 2018, 11:21pm

erAck and Mike Kaganski, I agree that there is some computational effort envolved. But the “conditional formatting” updates the cells almost instantaneously when new entries are added (cells are identified and formatted!!).

The underlying question is how to replicate this performance!

I regret to insist in the same solution, but, do you know how to access the background color property of a cell modified via conditional formating?

Lupp · November 6, 2018, 10:01am

CF only is applied to cells in the view.
It does not write anything to the cells’ properties. That’s why it can be fast.
(50 * 30000 = 1 500 000 only.)
No use in trying to retrieve a colour applied to little areas in the visible cell-grid.

erAck · November 9, 2018, 5:08pm

The initial calculation time creating =COUNTIF(F$2:F$30001;F2) in each row of column G is high as well, but further changes in column F should be slightly faster, and if the cell doing the SUMPRODUCT over all is not in the view even a little faster. Tried and using the original SUMPRODUCT approach changing a cell takes ~50s, using separate COUNTIF ~13s, with displaying the SUMPRODUCT along ~24s as that has to recalculate every row’s COUNTIF.

SM_Riga · November 3, 2018, 7:33pm

Just a tip, if it is allowed to sort the column where you want to count duplicates. Assume the data is in A2:A30000 range , first row is column label. You can sort data column ascendingly, then in the next B column in B2 enter 0, in B3 enter formula =IF(A3=A2;1;0) and fill down this formula till end of data. In C2 enter formula =SUM(B:B) and you get duplicates count. Took about 2-3 seconds to calculate.

CharlesDayan · November 3, 2018, 10:37pm

I did that in the beginning of my “search-for-an-answer-saga”, but your first statment, about sorting the column, is not an option!
BUT… I would consider a macro based solution that receives in a variable the sorted array (without changing the original range). So I could loop searching for contiguous/adjacent duplicates.

CharlesDayan · November 6, 2018, 5:07am

Trying a solution via BASIC. A sorting algorithm was used (quick sort) and I looped searching for duplicates.

Works, but this is not a good solution, though! The whole process is still far from the response time required by “conditional formatting”!

SM_Riga · November 6, 2018, 8:50am

I doubt that Conditional Formatting applies rules to the whole range at once. Using 100k rows range and trying to fill down helper column with COUNTIF function led to about 3 minutes long calculation. Same range with COUNTIF function applied as Conditional Formatting completes immediately, but navigation get smuch slower. Can it be that Calc applies Conditional Formatting only for currently visible cells?