# Fast way to count duplicates in 30000 rows (LibreOffice Calc)

Actually, I already have a partial answer!!! Conditional formatting with "Cell value is" -> "duplicate" !!! This way a check is performed for each user's new entry in "real time".

I need to check if duplicate entries exist in 30000 rows of a column (any value, but not blanks!) . I would like to keep track of how many duplicates during the filling process.

Ok, conditional formatting is a very effective visual indication and fast anough for my needs, but as I am not able to perform a loop to check the color of the cells (found some people against this approach!! Would be so easy! ) I need to find an alternative way to count the duplicates (as a whole, no need to identify how many for each case!).

I tryed the formula:

=SUMPRODUCT((COUNTIF(F2:F30001;$F$2:$F$30001)>1))

It works, but it takes two minutes to finish!!!

If you want to replicate my case. My 30000 entries are formatted as: letter "A" and numbers between 100000 and 999999, e.g., A354125, A214547, etc. Copy as text the result of "=CONCATENATE("A";RANDBETWEEN(100000;999999))" to save time.

Thanks!

PS: Does anybody know the algorithm used to find the duplicates in conditional formatting (it is fast)?

A macro solution is not the best, but is acceptable! ;)

Humm, I think here's the rough algo. that will fix this: Convert your spreadsheet into a data base. Someone please correct me if I'm wrong, but I think 30k lines in a spreadsheet is not what spreadsheets were ever designed for. 30k lines in a database is medium fry, as tables can go from 1 to billions of lines. With databases you get fast indexes which can answer the kinds of questions you are seeking to answer. And you get various tools better designed to deal with large datasets.

Thanks for your point, but it's a true spreadsheet! Most of it is dedicated to perform calculations!

The "question" was already answered by the "conditional formatting", but I don't know how to access the properties to count the modified cells!

PS: About the 30k "limit"... My spreadsheets with 30k lines are the small ones! And we have 1048576 lines available for each of the 1024 columns, per sheet!! ;)

Remember that for some time 256 columns was the limit for Excel! But these days are gone!

@EasyTrieve: There are miracles.

I experimented a bit with 30000 randomly generated three-letter-words.

The creation of a pivot table (returning the di8stinct words, sorting them on the fly, and counting the multiplicities) needed less than 1 s. Similar time for a refresh.

II then added right of the original data column in the first data row a VLOOKUP() into the pivot-table results for the multiplicity of the first word. Having done so, I filled down by doubleclick. Time needed: About 2 s

@EasyTrieve continued:

(The time given for filling down included the calculation and the delivering of the results, of course.)

Having done as described above, I ordered the pivot table to refresh: Needed > 20 s.

I also ordered a recalculation for the columnn of looked-up multiplicities: Needed > 20 s.

What little devile intervened?@CharlesDayan: please read my comment concerning CF on the answer by @erAck .

@Lupp Thanks. Very interesting. Guess so long as it fits in available memory and never hits the disk it's pretty fast. Also not too much time spent on garbage collection either, thought that could make it more unpredictable. Very interesting. Thank you.

I really appreciate all the effort around this problem .The examples and the clever ideas ("rows-above comparison!) were very useful. I implemented some of then and I think that the future user will need to deal with some delay or I will perfom the duplicates checking before saving the file. Up to 10000 entries the solutions were ok. But sometimes the program kind of updates the calculations and hangs a bit.

Thanks again ... :)

PS: inform if you find a way to access styles changed by CF! ;)