Fast way to count duplicates in 30000 rows (LibreOffice Calc)

SM_Riga · November 6, 2018, 9:02am

Also, Conditional Formatting does not change CellStyle property of the Cell object. In my test file it remained Default, no matter was Conditional Formatting applied or not. Looks like Conditional Formatting is like an additional layer for viewing, not the tool to perform modifications.

CharlesDayan · November 6, 2018, 11:14am

Thank you for taking the time and effort in dealing with this problem.

In order to verify how fast was the “conditional formatting” response, I divided horizontally the screen. So the ranges “A2:A15” and “A29985:A30001” were visible at the same time.
I changed the cell A7 and A29990 to the same string and immediately the duplicated input was detected and the cells formatted. I confirm that the style remained “Default” ! No way to access the applied style!

erAck · November 7, 2018, 12:37pm

Yes, conditional formatting is calculated for the current view portion(s) only. That’s why it appears to be faster on first sight, but may slow down scrolling or changing view.

And yes, it is an additional layer. Currently there are three cell style layers, from low to high 1) the applied style, 2) conditional formatting, 3) hard format attribution; the highest in priority set wins.

Lupp · November 6, 2018, 10:50am

Few words here. Also Calc has some DB features.
Look into this demo.

=== Edit1 2018-11-06 13:30 +1:00 ===
Coming back after some experiments:
It’s a pity!
The creation of my PT for 30000 words needed abou 1 s, and originaly a refresh the same time.
After some enhancments in the document (realistic ones) the time went up to 25 s ins some cases, and to 52 s in others.
I did not yet research what time a “macro” would need to delete the old PT and to create a new one in its place. I will probably try that if I find the time.

=== Edit1 2018-11-08 12:20 +1:00 ===
The actually interesting question in this topic was just spoken aside by the OP:
Can we use the programisms of CF to get the result faster by looking for the “applied” conditional attributes.?
My answer to this sub-question was given in a comment on the anwer by @erAck first. It has three parts:
-1- The attributes resulting from CF are not applied to the cells but overlaid to their areas in the view.
-2- To do so they are only evaluated for cells lying in the view.
-3- This is the reason for which CF can be fast even if the formulae are rather time-complex.
To help you to retrace these facts I made this demo.

SM_Riga · November 6, 2018, 12:56pm

Yesterday I made a test file for this topic: 50k rows (single column), macro to create new sheet to work with data copy, macro to sort (sorting is not needed if we are sure there are no empty cells in the data range), macro to filter with no duplicates and count difference between sorted and filtered ranges - result is 3-5 seconds on pretty old machine.

CharlesDayan · November 10, 2018, 2:24am

Well, I can understand that only cells in the “visible” area are formated, but if the formatting rule is about duplicates, if you change any cell all the others (even if they are not visible) need to be compared in order to find the duplicates. And the same is true if you use some functions over the whole range (e.g., sum ). It is a pity that all this performance is not available!

Lupp · November 10, 2018, 10:13am

You may need to reconsider the topic.
The formulae used in CF for a cell are evaluated exactly the same way as formulae being content of the cell, and, of course, all the references are evaluated whether or not the referenced cells are in the current view.
The difference is that the formulae used in CF only are evaluated for cells currently in the view.
The “all this performance” you dream of does not exist.

PhLo · November 6, 2018, 10:58am

Just to throw all the options into the mix for people with similar but perhaps not identical questions…

I don’t think you will like the following methods because they sort the rows, and you mentioned that’s not an option for you.

Without going into details, both Pivot Tables and Subtotals under the Data menu can each serve to identify duplicates. Subtotals in particular is quite speedy. I created a table with 8 columns and over 100,000 rows, all with randomly generated content. Selected everything and did Data -> Subtotals. In subtotals dialog, Group by and Calculate subtotals for checkbox on the same column name. Under Use functions I chose Count.

The results of the Subtotals tool can be a little strange to maneuver, but it’s not hard to get used to. It visually puts a total line under each repeating section of values. If Count instead of Sum, it will give the quantity of duplicates. If you click on the wee “2” near the top left of the cells, it will compact all duplicate groups and show you the counts of each. You can click the + left of each row group to expand/contract what duplicate value is contained in each group.

Again, these suggestions are probably not solutions for your particular situation, but all the answers given by everyone will probably help others with slightly varying scenarios.

Pivot tables act in a similar way by grouping repeated items, but they are a bit more complicated to explain.

Honestly though, I agree with those who say a database is better suited for large data sets. SQL statements can do all sorts of calculations and queries to present data in many presentations without changing the underlying data. Finding duplicates in huge data is quite fast and easy with queries. But I totally get it when you say spreadsheets are your paradigm. No one can know what your use case is without stepping into your office and being shown what you are up to. Spreadsheets definitely have a place. However, they are also risky. Data loss and crashing software are more likely to happen with spreadsheets than with a decent database.

With all these experiments, I’m sure you know to experiment on a copy of your master file, not the master. If you kill your master spreadsheet or manipulate it in an unintentional way, your toast. Since SELECT SQL statements don’t manipulate database data, it’s possible to tons of querying without risking data loss. Plus, data can be written by numerous users simultaneously, while other users can view it, etc. Pretty handy.

MarkMcLean · November 6, 2018, 12:29pm

I do not think I am understanding the question clearly, but I believe this will help. Formulas to quickly show if any match below then row number will appear, or if any match above, then an “X” will appear! Check if same.ods

Lupp · November 6, 2018, 1:29pm

The OP wanted to count the overall multipliciy for each item of 30000 (if I understood correctly).
This requires 900 000 000 comparisons (and some additional work) as @erAck stated correctly.
Reducing the comparisons to items in rows above can only give the multiplicity “up to here” (“down to here”, literally spoken). This would require roughly 450 000 000 comparisons, nonetheless.