Principles for building a large spreadsheet

I plan to build a large spreadsheet that will test one set of datasets against another for matching entries.

It seems many lines on one sheet slow Calc down (at least on my machine).

If I want to avoid this, is it enough to break the datasets down and put smaller parts on several sheets within one workbook?

Or will this also be slow, and I should put each part in a different workbook?

Thanks.

Ola @eteb3 , o que é uma planilha grande?
de números…

O que torna lenta uma planilha, são grande quantidade de cálculos.


Hi @eteb3, what is a large spreadsheet?
of numbers…

What makes a spreadsheet slow is a large amount of calculations.

1 Like

If you need to do this “tests” for the complete dataset, splitting will not help.
A search in dataset A, then B , then C is even more complex than search in combined ABC.
.
If it is not necessary to compare all, than splitting helps “if condition then search in A else B”
.
In my case I needed to analyse and convert a bunch dBase-archives. Up to 10.000 rows I didn’t have problems, then files started to become unusable. Actual figures depend on your hardware and the used files.
.
One last hint: databases do this much more eficient. My database of Photos is at 80.000 files and even without special optimizing simply finding files by matching exif-data is blazingly quick

2 Likes

What you describe is a database, not a spreadsheet.

1 Like

Note that one of the things that greatly slows Calc down is, when your column has non-uniform data. Say, A1:A10 have numbers, A11 has text, A12:A100 numbers, A101 blank, A102:A250 numbers, A251:A253 formulas … and so on (note that “text looking like a number” is still text). When the total number of rows is great, this data setup gives a huge blow to the performance. Compared to this, a simple A1:A1000000 having only numbers would work maybe hundred of times faster.

2 Likes

Ok, but I’m not a technical person, and I know how to use Calc, so that’s what I’m using.
EDIT and correction: I know how to use Calc better than I know how to use anything else!

Based on one of your previous answers to me: am I right that, in a column of text, forcing ="" into a cell means it is no longer blank (for the purposes of your point here)?

Of course. It is by definition: a cell with a formula is not blank. Additionally, "" is not “nothing”, it is a zero-length string.

1 Like

A database would be a lot easier to use in the long run.

Contrary to a spreadsheet, a database can enforce referential integrity. For instance, the database can be set up to not store any items in table A having no matching entry in table B. You can set it up to not store incomplete records. You never have to check for duplicates. You can simple prohibit storing duplicates. You don’t need any validation rules because the whole database is a validation engine. At the same time you have seamless integration with Calc and Writer. Everybody can fill a database form. Entering correct numbers, dates, times or numeric strings into a spreadsheet is a never ending source of error.

2 Likes

Thank you: that and the rest of your answer is very helpful.

You’ve already half-persuaded me: I’ve started another question to see whether Base is suitable (and suitable for novices)

Your other thread shows you may not profit much from using a database. To me this seems to be a job of consolidating three lists once to compile one complete list. So it doesn’t matter, if you need 5 or 50 seconds to find all adresses in Southampton.
.
And your hint on differing and incomplete data calls for human supervision of the task…

2 Likes

Intuitively I would guess the same is true if I’m using a non-uniform range in a formula. Is that true? If so, is it therefore better to leave the header row out of a named range? Or is that one non-uniform row forgettable?

EDIT Follow-up question aside, I have now filled all the blank cells in the spreadsheet. This has transformed the performance. So I think this is the solution.

Another tip:
It seems it is the current sheet that is important.
If your source data has blank cells, you can put this in its own sheet, then bring the values into your working sheet using =T(). This will force a "" value where there is a blank in the source, and it speeds things up a lot.

It depends entirely on the formulas what functions they use and how those functions access an array of data. In general, if a cell range is to be converted to array then different cell types result in fragmented typed arrays (what Mike mentioned as non-uniform data), whereas a contiguous row range of numeric values is just a vector of double values. However, not all functions even convert a cell range to an array.


Your example of =T(...) (I presume you actually meant the function T()) returning an "" empty string for a blank cell may make sense if all other values also are of type text string and if the function used on the range converts a cell range to array. If you do that on every single value however then the indirection to access data from another location may actually also slow down initial calculation. YMMV…

1 Like