Principles for building a large spreadsheet

eteb3 · October 14, 2024, 9:27pm

I plan to build a large spreadsheet that will test one set of datasets against another for matching entries.

It seems many lines on one sheet slow Calc down (at least on my machine).

If I want to avoid this, is it enough to break the datasets down and put smaller parts on several sheets within one workbook?

Or will this also be slow, and I should put each part in a different workbook?

Thanks.

schiavinatto · October 14, 2024, 10:50pm

Ola @eteb3 , o que é uma planilha grande?
de números…

O que torna lenta uma planilha, são grande quantidade de cálculos.

Hi @eteb3, what is a large spreadsheet?
of numbers…

What makes a spreadsheet slow is a large amount of calculations.

Wanderer · October 15, 2024, 6:35am

If you need to do this “tests” for the complete dataset, splitting will not help.
A search in dataset A, then B , then C is even more complex than search in combined ABC.
.
If it is not necessary to compare all, than splitting helps “if condition then search in A else B”
.
In my case I needed to analyse and convert a bunch dBase-archives. Up to 10.000 rows I didn’t have problems, then files started to become unusable. Actual figures depend on your hardware and the used files.
.
One last hint: databases do this much more eficient. My database of Photos is at 80.000 files and even without special optimizing simply finding files by matching exif-data is blazingly quick

Villeroy · October 15, 2024, 6:35am

What you describe is a database, not a spreadsheet.

mikekaganski · October 15, 2024, 6:49am

Note that one of the things that greatly slows Calc down is, when your column has non-uniform data. Say, A1:A10 have numbers, A11 has text, A12:A100 numbers, A101 blank, A102:A250 numbers, A251:A253 formulas … and so on (note that “text looking like a number” is still text). When the total number of rows is great, this data setup gives a huge blow to the performance. Compared to this, a simple A1:A1000000 having only numbers would work maybe hundred of times faster.

eteb3 · October 15, 2024, 9:03am

Ok, but I’m not a technical person, and I know how to use Calc, so that’s what I’m using.
EDIT and correction: I know how to use Calc better than I know how to use anything else!

eteb3 · October 15, 2024, 11:31am

Based on one of your previous answers to me: am I right that, in a column of text, forcing ="" into a cell means it is no longer blank (for the purposes of your point here)?

mikekaganski · October 15, 2024, 11:34am

Of course. It is by definition: a cell with a formula is not blank. Additionally, "" is not “nothing”, it is a zero-length string.

Villeroy · October 15, 2024, 12:33pm

A database would be a lot easier to use in the long run.

Contrary to a spreadsheet, a database can enforce referential integrity. For instance, the database can be set up to not store any items in table A having no matching entry in table B. You can set it up to not store incomplete records. You never have to check for duplicates. You can simple prohibit storing duplicates. You don’t need any validation rules because the whole database is a validation engine. At the same time you have seamless integration with Calc and Writer. Everybody can fill a database form. Entering correct numbers, dates, times or numeric strings into a spreadsheet is a never ending source of error.

eteb3 · October 15, 2024, 12:35pm

Thank you: that and the rest of your answer is very helpful.

You’ve already half-persuaded me: I’ve started another question to see whether Base is suitable (and suitable for novices)

Wanderer · October 15, 2024, 9:09pm

Your other thread shows you may not profit much from using a database. To me this seems to be a job of consolidating three lists once to compile one complete list. So it doesn’t matter, if you need 5 or 50 seconds to find all adresses in Southampton.
.
And your hint on differing and incomplete data calls for human supervision of the task…

eteb3 · October 17, 2024, 8:08am

Intuitively I would guess the same is true if I’m using a non-uniform range in a formula. Is that true? If so, is it therefore better to leave the header row out of a named range? Or is that one non-uniform row forgettable?

EDIT Follow-up question aside, I have now filled all the blank cells in the spreadsheet. This has transformed the performance. So I think this is the solution.

eteb3 · October 17, 2024, 11:33am

Another tip:
It seems it is the current sheet that is important.
If your source data has blank cells, you can put this in its own sheet, then bring the values into your working sheet using =T(). This will force a "" value where there is a blank in the source, and it speeds things up a lot.

erAck · October 17, 2024, 1:19pm

It depends entirely on the formulas what functions they use and how those functions access an array of data. In general, if a cell range is to be converted to array then different cell types result in fragmented typed arrays (what Mike mentioned as non-uniform data), whereas a contiguous row range of numeric values is just a vector of double values. However, not all functions even convert a cell range to an array.

Your example of =T(...) (I presume you actually meant the function T()) returning an "" empty string for a blank cell may make sense if all other values also are of type text string and if the function used on the range converts a cell range to array. If you do that on every single value however then the indirection to access data from another location may actually also slow down initial calculation. YMMV…

eteb3 · October 24, 2024, 8:25am

Would this be the case if I use =T() to bring in every datum from Sheet 1 to Sheet 2, then run my calculations in Sheet 2 from the data that are now in Sheet 2? It feels like they wouldn’t have to update unless Sheet 1 changes, and therefore the indirection is less. But this must depend on the architecture of Calc under the hood, and I don’t know what that is. Or—again—maybe it varies?

erAck · October 24, 2024, 10:45am

It depends on your document content. If Sheet1 contains constant data that is brought over to Sheet2 using only T() then it probably doesn’t matter much until everything needs to be recalculated. But if every data value needs to be treated with T() just to eliminate empty cells and convert them to empty string then the question remains if the initial data isn’t of a wrong type or wrongly structured…

eteb3 · October 24, 2024, 3:01pm

Thank you. A quick off-topic question, only if you have time:

When should I put things in backticks? I’d understood that everything technical goes in backticks, but you called the function T() (just like that) and didn’t seem to understand me 100% when I called it =T()

Wanderer · October 24, 2024, 9:36pm

One problem is: This site alters your input. So a person who copies from your text here may not receive the same text you entered. This is a source of error.
.
As an example an easy SQL-statement with backticks:
SELECT "col1", "col2" FROM "table";
If you omit the backtick the quotes are altered and the statement will throw an error:
SELECT “col1”, “col2” FROM “table”;
.
Another is in the following thread, where a ** is interpreted as a marker for bold text:

erAck · October 27, 2024, 6:50pm

That looked correct though…
Anyhow, as Wanderer said, to get straight ASCII double quote characters you need to enclose an expression in backticks, else they are replaced by typographic quotes. Also anything that might get interpreted by markdown like asterisks (* for italics or ** for bold). See also This is the guide - How to use the Ask site? - #6 by erAck