Ask Your Question
0

Fast way to count duplicates in 30000 rows (LibreOffice Calc)

asked 2018-11-03 16:58:50 +0100

CharlesDayan gravatar image

Actually, I already have a partial answer!!! Conditional formatting with "Cell value is" -> "duplicate" !!! This way a check is performed for each user's new entry in "real time".

I need to check if duplicate entries exist in 30000 rows of a column (any value, but not blanks!) . I would like to keep track of how many duplicates during the filling process.

Ok, conditional formatting is a very effective visual indication and fast anough for my needs, but as I am not able to perform a loop to check the color of the cells (found some people against this approach!! Would be so easy! ) I need to find an alternative way to count the duplicates (as a whole, no need to identify how many for each case!).

I tryed the formula:

=SUMPRODUCT((COUNTIF(F2:F30001;$F$2:$F$30001)>1))

It works, but it takes two minutes to finish!!!

If you want to replicate my case. My 30000 entries are formatted as: letter "A" and numbers between 100000 and 999999, e.g., A354125, A214547, etc. Copy as text the result of "=CONCATENATE("A";RANDBETWEEN(100000;999999))" to save time.

Thanks!

PS: Does anybody know the algorithm used to find the duplicates in conditional formatting (it is fast)?

A macro solution is not the best, but is acceptable! ;)

edit retag flag offensive close merge delete

Comments

1

Humm, I think here's the rough algo. that will fix this: Convert your spreadsheet into a data base. Someone please correct me if I'm wrong, but I think 30k lines in a spreadsheet is not what spreadsheets were ever designed for. 30k lines in a database is medium fry, as tables can go from 1 to billions of lines. With databases you get fast indexes which can answer the kinds of questions you are seeking to answer. And you get various tools better designed to deal with large datasets.

EasyTrieve gravatar imageEasyTrieve ( 2018-11-03 19:55:23 +0100 )edit

Thanks for your point, but it's a true spreadsheet! Most of it is dedicated to perform calculations!

The "question" was already answered by the "conditional formatting", but I don't know how to access the properties to count the modified cells!

PS: About the 30k "limit"... My spreadsheets with 30k lines are the small ones! And we have 1048576 lines available for each of the 1024 columns, per sheet!! ;)

Remember that for some time 256 columns was the limit for Excel! But these days are gone!

CharlesDayan gravatar imageCharlesDayan ( 2018-11-03 23:26:16 +0100 )edit

@EasyTrieve: There are miracles.
I experimented a bit with 30000 randomly generated three-letter-words.
The creation of a pivot table (returning the di8stinct words, sorting them on the fly, and counting the multiplicities) needed less than 1 s. Similar time for a refresh.
II then added right of the original data column in the first data row a VLOOKUP() into the pivot-table results for the multiplicity of the first word. Having done so, I filled down by doubleclick. Time needed: About 2 s

Lupp gravatar imageLupp ( 2018-11-06 12:41:49 +0100 )edit

@EasyTrieve continued:
(The time given for filling down included the calculation and the delivering of the results, of course.)
Having done as described above, I ordered the pivot table to refresh: Needed > 20 s.
I also ordered a recalculation for the columnn of looked-up multiplicities: Needed > 20 s.
What little devile intervened?

Lupp gravatar imageLupp ( 2018-11-06 12:44:51 +0100 )edit

@CharlesDayan: please read my comment concerning CF on the answer by @erAck .

Lupp gravatar imageLupp ( 2018-11-06 12:50:03 +0100 )edit

@Lupp Thanks. Very interesting. Guess so long as it fits in available memory and never hits the disk it's pretty fast. Also not too much time spent on garbage collection either, thought that could make it more unpredictable. Very interesting. Thank you.

EasyTrieve gravatar imageEasyTrieve ( 2018-11-07 22:30:02 +0100 )edit

I really appreciate all the effort around this problem .The examples and the clever ideas ("rows-above comparison!) were very useful. I implemented some of then and I think that the future user will need to deal with some delay or I will perfom the duplicates checking before saving the file. Up to 10000 entries the solutions were ok. But sometimes the program kind of updates the calculations and hangs a bit.

Thanks again ... :)

PS: inform if you find a way to access styles changed by CF! ;)

CharlesDayan gravatar imageCharlesDayan ( 2018-11-08 19:32:05 +0100 )edit

6 Answers

Sort by » oldest newest most voted
0

answered 2018-11-06 13:29:09 +0100

I do not think I am understanding the question clearly, but I believe this will help. Formulas to quickly show if any match below then row number will appear, or if any match above, then an "X" will appear! C:\fakepath\Check if same.ods

edit flag offensive delete link more

Comments

The OP wanted to count the overall multipliciy for each item of 30000 (if I understood correctly).
This requires 900 000 000 comparisons (and some additional work) as @erAck stated correctly.
Reducing the comparisons to items in rows above can only give the multiplicity "up to here" ("down to here", literally spoken). This would require roughly 450 000 000 comparisons, nonetheless.

Lupp gravatar imageLupp ( 2018-11-06 14:29:02 +0100 )edit
0

answered 2018-11-06 11:58:47 +0100

updated 2018-11-06 12:01:36 +0100

Just to throw all the options into the mix for people with similar but perhaps not identical questions...

I don't think you will like the following methods because they sort the rows, and you mentioned that's not an option for you.

Without going into details, both Pivot Tables and Subtotals under the Data menu can each serve to identify duplicates. Subtotals in particular is quite speedy. I created a table with 8 columns and over 100,000 rows, all with randomly generated content. Selected everything and did Data -> Subtotals. In subtotals dialog, Group by and Calculate subtotals for checkbox on the same column name. Under Use functions I chose Count.

The results of the Subtotals tool can be a little strange to maneuver, but it's not hard to get used to. It visually puts a total line under each repeating section of values. If Count instead of Sum, it will give the quantity of duplicates. If you click on the wee "2" near the top left of the cells, it will compact all duplicate groups and show you the counts of each. You can click the + left of each row group to expand/contract what duplicate value is contained in each group.

Again, these suggestions are probably not solutions for your particular situation, but all the answers given by everyone will probably help others with slightly varying scenarios.

Pivot tables act in a similar way by grouping repeated items, but they are a bit more complicated to explain.

Honestly though, I agree with those who say a database is better suited for large data sets. SQL statements can do all sorts of calculations and queries to present data in many presentations without changing the underlying data. Finding duplicates in huge data is quite fast and easy with queries. But I totally get it when you say spreadsheets are your paradigm. No one can know what your use case is without stepping into your office and being shown what you are up to. Spreadsheets definitely have a place. However, they are also risky. Data loss and crashing software are more likely to happen with spreadsheets than with a decent database.

With all these experiments, I'm sure you know to experiment on a copy of your master file, not the master. If you kill your master spreadsheet or manipulate it in an unintentional way, your toast. Since SELECT SQL statements don't manipulate database data, it's possible to tons of querying without risking data loss. Plus, data can be written by numerous users simultaneously, while other users can view it, etc. Pretty handy.

edit flag offensive delete link more
0

answered 2018-11-03 19:39:00 +0100

erAck gravatar image

updated 2018-11-03 19:39:43 +0100

You are not by chance calculating the formula =SUMPRODUCT((COUNTIF(F2:F30001;$F$2:$F$30001)>1)) in each row from row 2 to row 30001? That wouldn't make sense but could explain why it would take 2 minutes because for each row the entire range is to be evaluated against the entire range.

Either calculate that formula only once to get the overall duplicates count, or if you want the duplicates count of each value in a row then use =COUNTIF(F$2:F$30001;F2) in row 2 and pull or copy&paste that formula cell down to row 30001 (let's say in column G). You can then calculate the overall duplicates in one cell with =SUMPRODUCT(G2:G30001>1)

edit flag offensive delete link more

Comments

Well - I confirm that the single =SUMPRODUCT((COUNTIF(F2:F30001;$F$2:$F$30001)>1)) for the data described by OP in the question does take long (~55 s on my system)...

Mike Kaganski gravatar imageMike Kaganski ( 2018-11-03 19:45:26 +0100 )edit

It does 30000*30000=900000000 cell comparisons (plus stores them in an array of 30000 and then compares each element for >1 and then sums the 30000 results of that, which should be neglectable compared to the 900000000). It should take some time, yes.

erAck gravatar imageerAck ( 2018-11-03 20:01:13 +0100 )edit

Just wanted to note that it doesn't need to be in each row :-) - by the way, the variant with separate COUNTIFs in each row + one SUMPRODUCT also takes ~that time.

Mike Kaganski gravatar imageMike Kaganski ( 2018-11-03 20:26:35 +0100 )edit

Hi! Just to clarify, I calculated that formula only once. But it is recalculated after each new added entry!! What is very annoying!

CharlesDayan gravatar imageCharlesDayan ( 2018-11-03 23:06:20 +0100 )edit

erAck and Mike Kaganski, I agree that there is some computational effort envolved. But the "conditional formatting" updates the cells almost instantaneously when new entries are added (cells are identified and formatted!!).

The underlying question is how to replicate this performance!

I regret to insist in the same solution, but, do you know how to access the background color property of a cell modified via conditional formating?

CharlesDayan gravatar imageCharlesDayan ( 2018-11-04 00:21:23 +0100 )edit

CF only is applied to cells in the view.
It does not write anything to the cells' properties. That's why it can be fast.
(50 * 30000 = 1 500 000 only.) No use in trying to retrieve a colour applied to little areas in the visible cell-grid.

Lupp gravatar imageLupp ( 2018-11-06 11:01:41 +0100 )edit

The initial calculation time creating =COUNTIF(F$2:F$30001;F2) in each row of column G is high as well, but further changes in column F should be slightly faster, and if the cell doing the SUMPRODUCT over all is not in the view even a little faster. Tried and using the original SUMPRODUCT approach changing a cell takes ~50s, using separate COUNTIF ~13s, with displaying the SUMPRODUCT along ~24s as that has to recalculate every row's COUNTIF.

erAck gravatar imageerAck ( 2018-11-09 18:08:12 +0100 )edit
0

answered 2018-11-06 06:07:16 +0100

CharlesDayan gravatar image

Trying a solution via BASIC. A sorting algorithm was used (quick sort) and I looped searching for duplicates.

Works, but this is not a good solution, though! The whole process is still far from the response time required by "conditional formatting"!

edit flag offensive delete link more

Comments

I doubt that Conditional Formatting applies rules to the whole range at once. Using 100k rows range and trying to fill down helper column with COUNTIF function led to about 3 minutes long calculation. Same range with COUNTIF function applied as Conditional Formatting completes immediately, but navigation get smuch slower. Can it be that Calc applies Conditional Formatting only for currently visible cells?

SM_Riga gravatar imageSM_Riga ( 2018-11-06 09:50:12 +0100 )edit

Also, Conditional Formatting does not change CellStyle property of the Cell object. In my test file it remained Default, no matter was Conditional Formatting applied or not. Looks like Conditional Formatting is like an additional layer for viewing, not the tool to perform modifications.

SM_Riga gravatar imageSM_Riga ( 2018-11-06 10:02:26 +0100 )edit

Thank you for taking the time and effort in dealing with this problem.

In order to verify how fast was the "conditional formatting" response, I divided horizontally the screen. So the ranges "A2:A15" and "A29985:A30001" were visible at the same time. I changed the cell A7 and A29990 to the same string and immediately the duplicated input was detected and the cells formatted. I confirm that the style remained "Default" ! No way to access the applied style!

CharlesDayan gravatar imageCharlesDayan ( 2018-11-06 12:14:16 +0100 )edit

Yes, conditional formatting is calculated for the current view portion(s) only. That's why it appears to be faster on first sight, but may slow down scrolling or changing view.

And yes, it is an additional layer. Currently there are three cell style layers, from low to high 1) the applied style, 2) conditional formatting, 3) hard format attribution; the highest in priority set wins.

erAck gravatar imageerAck ( 2018-11-07 13:37:45 +0100 )edit
0

answered 2018-11-03 20:33:46 +0100

Just a tip, if it is allowed to sort the column where you want to count duplicates. Assume the data is in A2:A30000 range , first row is column label. You can sort data column ascendingly, then in the next B column in B2 enter 0, in B3 enter formula =IF(A3=A2;1;0) and fill down this formula till end of data. In C2 enter formula =SUM(B:B) and you get duplicates count. Took about 2-3 seconds to calculate.

edit flag offensive delete link more

Comments

I did that in the beginning of my "search-for-an-answer-saga", but your first statment, about sorting the column, is not an option! BUT... I would consider a macro based solution that receives in a variable the sorted array (without changing the original range). So I could loop searching for contiguous/adjacent duplicates.

CharlesDayan gravatar imageCharlesDayan ( 2018-11-03 23:37:48 +0100 )edit
0

answered 2018-11-06 11:50:29 +0100

Lupp gravatar image

updated 2018-11-08 12:19:52 +0100

Few words here. Also Calc has some DB features.
Look into this demo.

=== Edit1 2018-11-06 13:30 +1:00 === Coming back after some experiments:
It's a pity!
The creation of my PT for 30000 words needed abou 1 s, and originaly a refresh the same time.
After some enhancments in the document (realistic ones) the time went up to 25 s ins some cases, and to 52 s in others.
I did not yet research what time a "macro" would need to delete the old PT and to create a new one in its place. I will probably try that if I find the time.

=== Edit1 2018-11-08 12:20 +1:00 ===
The actually interesting question in this topic was just spoken aside by the OP:
Can we use the programisms of CF to get the result faster by looking for the "applied" conditional attributes.?
My answer to this sub-question was given in a comment on the anwer by @erAck first. It has three parts:
-1- The attributes resulting from CF are not applied to the cells but overlaid to their areas in the view.
-2- To do so they are only evaluated for cells lying in the view.
-3- This is the reason for which CF can be fast even if the formulae are rather time-complex.
To help you to retrace these facts I made this demo.

edit flag offensive delete link more

Comments

Yesterday I made a test file for this topic: 50k rows (single column), macro to create new sheet to work with data copy, macro to sort (sorting is not needed if we are sure there are no empty cells in the data range), macro to filter with no duplicates and count difference between sorted and filtered ranges - result is 3-5 seconds on pretty old machine.

SM_Riga gravatar imageSM_Riga ( 2018-11-06 13:56:29 +0100 )edit

Well, I can understand that only cells in the "visible" area are formated, but if the formatting rule is about duplicates, if you change any cell all the others (even if they are not visible) need to be compared in order to find the duplicates. And the same is true if you use some functions over the whole range (e.g., sum ). It is a pity that all this performance is not available!

CharlesDayan gravatar imageCharlesDayan ( 2018-11-10 03:24:50 +0100 )edit

You may need to reconsider the topic.
The formulae used in CF for a cell are evaluated exactly the same way as formulae being content of the cell, and, of course, all the references are evaluated whether or not the referenced cells are in the current view.
The difference is that the formulae used in CF only are evaluated for cells currently in the view.
The "all this performance" you dream of does not exist.

Lupp gravatar imageLupp ( 2018-11-10 11:13:41 +0100 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-11-03 16:58:50 +0100

Seen: 112 times

Last updated: Nov 08