Fast way to count duplicates in 30000 rows (LibreOffice Calc)

erAck · November 3, 2018, 7:01pm

It does 30000*30000=900000000 cell comparisons (plus stores them in an array of 30000 and then compares each element for >1 and then sums the 30000 results of that, which should be neglectable compared to the 900000000). It should take some time, yes.

mikekaganski · November 3, 2018, 7:26pm

Just wanted to note that it doesn’t need to be in each row - by the way, the variant with separate COUNTIFs in each row + one SUMPRODUCT also takes ~that time.

CharlesDayan · November 3, 2018, 10:06pm

Hi! Just to clarify, I calculated that formula only once. But it is recalculated after each new added entry!! What is very annoying!

CharlesDayan · November 3, 2018, 11:21pm

erAck and Mike Kaganski, I agree that there is some computational effort envolved. But the “conditional formatting” updates the cells almost instantaneously when new entries are added (cells are identified and formatted!!).

The underlying question is how to replicate this performance!

I regret to insist in the same solution, but, do you know how to access the background color property of a cell modified via conditional formating?

Lupp · November 6, 2018, 10:01am

CF only is applied to cells in the view.
It does not write anything to the cells’ properties. That’s why it can be fast.
(50 * 30000 = 1 500 000 only.)
No use in trying to retrieve a colour applied to little areas in the visible cell-grid.

erAck · November 9, 2018, 5:08pm

The initial calculation time creating =COUNTIF(F$2:F$30001;F2) in each row of column G is high as well, but further changes in column F should be slightly faster, and if the cell doing the SUMPRODUCT over all is not in the view even a little faster. Tried and using the original SUMPRODUCT approach changing a cell takes ~50s, using separate COUNTIF ~13s, with displaying the SUMPRODUCT along ~24s as that has to recalculate every row’s COUNTIF.

SM_Riga · November 3, 2018, 7:33pm

Just a tip, if it is allowed to sort the column where you want to count duplicates. Assume the data is in A2:A30000 range , first row is column label. You can sort data column ascendingly, then in the next B column in B2 enter 0, in B3 enter formula =IF(A3=A2;1;0) and fill down this formula till end of data. In C2 enter formula =SUM(B:B) and you get duplicates count. Took about 2-3 seconds to calculate.

CharlesDayan · November 3, 2018, 10:37pm

I did that in the beginning of my “search-for-an-answer-saga”, but your first statment, about sorting the column, is not an option!
BUT… I would consider a macro based solution that receives in a variable the sorted array (without changing the original range). So I could loop searching for contiguous/adjacent duplicates.

CharlesDayan · November 6, 2018, 5:07am

Trying a solution via BASIC. A sorting algorithm was used (quick sort) and I looped searching for duplicates.

Works, but this is not a good solution, though! The whole process is still far from the response time required by “conditional formatting”!

SM_Riga · November 6, 2018, 8:50am

I doubt that Conditional Formatting applies rules to the whole range at once. Using 100k rows range and trying to fill down helper column with COUNTIF function led to about 3 minutes long calculation. Same range with COUNTIF function applied as Conditional Formatting completes immediately, but navigation get smuch slower. Can it be that Calc applies Conditional Formatting only for currently visible cells?

SM_Riga · November 6, 2018, 9:02am

Also, Conditional Formatting does not change CellStyle property of the Cell object. In my test file it remained Default, no matter was Conditional Formatting applied or not. Looks like Conditional Formatting is like an additional layer for viewing, not the tool to perform modifications.

CharlesDayan · November 6, 2018, 11:14am

Thank you for taking the time and effort in dealing with this problem.

In order to verify how fast was the “conditional formatting” response, I divided horizontally the screen. So the ranges “A2:A15” and “A29985:A30001” were visible at the same time.
I changed the cell A7 and A29990 to the same string and immediately the duplicated input was detected and the cells formatted. I confirm that the style remained “Default” ! No way to access the applied style!

erAck · November 7, 2018, 12:37pm

Yes, conditional formatting is calculated for the current view portion(s) only. That’s why it appears to be faster on first sight, but may slow down scrolling or changing view.

And yes, it is an additional layer. Currently there are three cell style layers, from low to high 1) the applied style, 2) conditional formatting, 3) hard format attribution; the highest in priority set wins.

Lupp · November 6, 2018, 10:50am

Few words here. Also Calc has some DB features.
Look into this demo.

=== Edit1 2018-11-06 13:30 +1:00 ===
Coming back after some experiments:
It’s a pity!
The creation of my PT for 30000 words needed abou 1 s, and originaly a refresh the same time.
After some enhancments in the document (realistic ones) the time went up to 25 s ins some cases, and to 52 s in others.
I did not yet research what time a “macro” would need to delete the old PT and to create a new one in its place. I will probably try that if I find the time.

=== Edit1 2018-11-08 12:20 +1:00 ===
The actually interesting question in this topic was just spoken aside by the OP:
Can we use the programisms of CF to get the result faster by looking for the “applied” conditional attributes.?
My answer to this sub-question was given in a comment on the anwer by @erAck first. It has three parts:
-1- The attributes resulting from CF are not applied to the cells but overlaid to their areas in the view.
-2- To do so they are only evaluated for cells lying in the view.
-3- This is the reason for which CF can be fast even if the formulae are rather time-complex.
To help you to retrace these facts I made this demo.

SM_Riga · November 6, 2018, 12:56pm

Yesterday I made a test file for this topic: 50k rows (single column), macro to create new sheet to work with data copy, macro to sort (sorting is not needed if we are sure there are no empty cells in the data range), macro to filter with no duplicates and count difference between sorted and filtered ranges - result is 3-5 seconds on pretty old machine.

CharlesDayan · November 10, 2018, 2:24am

Well, I can understand that only cells in the “visible” area are formated, but if the formatting rule is about duplicates, if you change any cell all the others (even if they are not visible) need to be compared in order to find the duplicates. And the same is true if you use some functions over the whole range (e.g., sum ). It is a pity that all this performance is not available!

Lupp · November 10, 2018, 10:13am

You may need to reconsider the topic.
The formulae used in CF for a cell are evaluated exactly the same way as formulae being content of the cell, and, of course, all the references are evaluated whether or not the referenced cells are in the current view.
The difference is that the formulae used in CF only are evaluated for cells currently in the view.
The “all this performance” you dream of does not exist.

PhLo · November 6, 2018, 10:58am

Just to throw all the options into the mix for people with similar but perhaps not identical questions…

I don’t think you will like the following methods because they sort the rows, and you mentioned that’s not an option for you.

Without going into details, both Pivot Tables and Subtotals under the Data menu can each serve to identify duplicates. Subtotals in particular is quite speedy. I created a table with 8 columns and over 100,000 rows, all with randomly generated content. Selected everything and did Data -> Subtotals. In subtotals dialog, Group by and Calculate subtotals for checkbox on the same column name. Under Use functions I chose Count.

The results of the Subtotals tool can be a little strange to maneuver, but it’s not hard to get used to. It visually puts a total line under each repeating section of values. If Count instead of Sum, it will give the quantity of duplicates. If you click on the wee “2” near the top left of the cells, it will compact all duplicate groups and show you the counts of each. You can click the + left of each row group to expand/contract what duplicate value is contained in each group.

Again, these suggestions are probably not solutions for your particular situation, but all the answers given by everyone will probably help others with slightly varying scenarios.

Pivot tables act in a similar way by grouping repeated items, but they are a bit more complicated to explain.

Honestly though, I agree with those who say a database is better suited for large data sets. SQL statements can do all sorts of calculations and queries to present data in many presentations without changing the underlying data. Finding duplicates in huge data is quite fast and easy with queries. But I totally get it when you say spreadsheets are your paradigm. No one can know what your use case is without stepping into your office and being shown what you are up to. Spreadsheets definitely have a place. However, they are also risky. Data loss and crashing software are more likely to happen with spreadsheets than with a decent database.

With all these experiments, I’m sure you know to experiment on a copy of your master file, not the master. If you kill your master spreadsheet or manipulate it in an unintentional way, your toast. Since SELECT SQL statements don’t manipulate database data, it’s possible to tons of querying without risking data loss. Plus, data can be written by numerous users simultaneously, while other users can view it, etc. Pretty handy.

MarkMcLean · November 6, 2018, 12:29pm

I do not think I am understanding the question clearly, but I believe this will help. Formulas to quickly show if any match below then row number will appear, or if any match above, then an “X” will appear! Check if same.ods

Lupp · November 6, 2018, 1:29pm

The OP wanted to count the overall multipliciy for each item of 30000 (if I understood correctly).
This requires 900 000 000 comparisons (and some additional work) as @erAck stated correctly.
Reducing the comparisons to items in rows above can only give the multiplicity “up to here” (“down to here”, literally spoken). This would require roughly 450 000 000 comparisons, nonetheless.