Why does LibreOffice Calc take so long to save changes to a file?

spearthistle · December 17, 2022, 2:28pm

I downloaded this data: https://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_Livestock_E_All_Data.zip from FAOSTAT and imported it into LibreOffice Calc as a CSV.

I then saved the file as an ODF and this took a long time. The Manjaro operating system kept asking me to Force Quit or Wait. I am guessing that the program could have provided a progress bar with options to cancel (if the estimated time was too long).

I then made one small change to a row of data (changed the background colour) and then saved the file.

It then took a very long time to save the file with the ‘Force Quit or Wait’ dialog popping up.

Why should this be the case?

Why it is not possible to update only the changes to an in-memory data structure to a file?

Kind regards,
David.

Villeroy · December 17, 2022, 3:00pm

Saving the large table as Open Document Spreadsheet (ods) takes 10 seconds on my lame notebook.
A spreadsheet is not a database. CSV is a database exchange format. The csv tables in that zip file are supposed to be related to each other.

What exactly do you want to do with these data?

Wanderer · December 17, 2022, 11:35pm

Woldn’t say it is impossible, but can be complicated.
Lets assume abc is changed to adc - we could save directly.
Now try abc and change to abbc - so we need more space to save either abbc or a marker a@c with a list @2:bb to replace the marker with new text.
Same for adding attributes like bold - you need additional space.
.
As additional obstacle code all this in xml and compress the file. Now try to do a change at a defined position quickly.

spearthistle · December 18, 2022, 8:30am

Thank you for realising that I am not criticising or taking a dig at LibreOffice Calc and that I am genuinely interested in why it is difficult to save in memory changes to storage quickly (especially tiny changes).

Is it fair to say that tiny changes to in memory data can be very fast and efficient but when these tiny changes are saved to an XML file on the filesystem things can be slow because a file has to be rewritten from start to finish.

There is no clever way in which the data on the filesystem is broken up into smaller updatable pieces that are then be joined together.

For example, from my limited knowledge, if a file was a linked list, that mirrored the in memory data, then would the program be able to collect all changes into a data structure and then pass these changes onto the file updating process.

So that it knows ‘at these points in the linked list’ insert, update or delete data - with a fast way to access these points.

Does it boil down to the limitations of the file system - how data is written, restored, updated and deleted from storage.

Anyway, thank you for piquing my interest.

Villeroy · December 18, 2022, 1:45pm

Livestock (170 MB) is something to play with.
I wrote a little program to normalize and repair the csv data on sheet, created a HSQL database, copied the normalized rows million by million into the database, added some queries, a form and an installation macro which connects the Base document to the database after you extracted the package on your computer.
Then I created a spreadsheet as an additional frontend. First sheet has the same form as the database where you define meaningful filter criteria. Second sheet has a pivot table and a pivot chart as kind of report. The macro in the spreadsheet refreshes the pivot when you select its sheet.
The filterable form allows to find and edit data record by record and it allows for resonably comparable reports.
In this simplistic draft you can only choose one geographic region or all, one crop or all of them. With some extra effort it would be possible to compare multiple selected regions and crops.

P.S. when playing with this large record set, auto-updating the pivot on sheet activation may be a bad idea. How to disable: Right-click “Pivot” sheet > Events and remove the activation event. Manual refresh: right-cick pivot area and choose “Refresh”

mikekaganski · December 18, 2022, 8:21am

Because LibreOffice does not keep the ODF file in memory; the XML data is generated from in-memory data each time, and then ZIPped. If the data is large, it takes time; but using the XML in memory instead (even without taking packaging into ZIP) would result in much slower operation at every user edit, not compensated by a bit faster save time.

Indeed, improvements are possible; for them, bug reports are needed, with problematic data, and perf keyword.

mikekaganski · December 18, 2022, 8:57am

Note that file format has own effects on this. E.g., ODS has all the data (i.e., all sheets) in a single contents.xml stream; while XLSX has separate XMLs per sheet. That makes reading and writing XLSX faster in case when several sheets are used; in case of your data, it would show up if you imported each CSV from the package into a separate sheet. LibreOffice would work faster with the external, non-native file format in this case than with own native ODF (that doesn’t mean, of course, that that external format has better support - the data/formatting losses are likely; but still could look strange).

tomachi · October 17, 2024, 11:49am

I would have thought all files of all apps would commonly be almost the same as the in-memory copy but that is naive I’m sure, essentially a loop that de-fragments the heap into a contiguous binary and back. Probably not memory safe. With pointers locating things that break with version changes.

Perhaps instead it serialises all of the heap objects and does that header coding labelling each chunk of data as string, blocks of unit8, blocks of audio, image etc. And the reverse for bringing from disk.

But I’m convinced I’ve found a bug with the LibreOffice Writer auto-save / save function. I should file this as a feature/request / bug official.

Feature/Bug: LibreOffice Autosave Performance Optimisation Bug

I’m working on a book, an encyclopaedia actually, that is 71,000 words and 433 pages. Just now it took 29 seconds to save on very beefy 32GB Ryzen workstation @4.2Ghz

So I have changed to never saving the file. By letting autosave do it’s thing I can keep working on the book. I now have a requirement for a performance optimization that is a) possible, b) kinda obvious it isn’t the way it’s done.

The optimisation checks if you have not touched the keyboard, nor clicked or changed anything in the file since the last autosave; that it would rename/copy the autosave file into the main file instantly instead.

Luckily, LibreOffice seems extremely stable and i have backups if the file goes corrupt but it is a little nerve-wracking as sometimes I have to reboot quickly and no i do not get to do manual save and yes the autosave ALWAYS has my back so far but it is freaking sometimes when that happens during fast reboot, system freeze or power cut.

I have my autosave freq at 1 minute instead of 10 mins default.
This would enable me to hit save as frequently as I have trained myself to do. I just noticed a feature I haven’t tried: “Automatically save the document too” maybe this does what I’m talking about, oops I think this will do. It’s nice to have the right to hit Save though.

mikekaganski · October 17, 2024, 11:52am

Why would you write about “a bug with the LibreOffice Writer auto-save / save function” - (1) in a thread dedicated to a Calc-related problem; and (2) on the Ask site, where bug reports are off-topic?

tomachi · October 17, 2024, 11:59am

Valid question. I didn’t know bug reports is off that makes sense, actually thanks for schooling me on that, and I had half intended to copy paste it into a filing and will do that now. It would get annoying if everybody posted like that.

Villeroy · December 18, 2022, 9:11am

There is no structure in a spreadsheet. Each cell can take any text or number. There are no fields nor records. A spreadsheet does not even have tables. A “table” is just a rectangle of cells that can be interpreted as a record set without being a record set.
Editing a database, you edit one record at a time.

mikekaganski · December 18, 2022, 9:23am

This is not really relevant. The “unstructured” nature of spreadsheets doesn’t map to the storage format, which has a structure. You may even invent a storage format for spreadsheets, which would be based on a database - say, store each row as a record of BLOBs … and then you could have atomic transactions on save. Or even easier - without any new format: use the XML DOM as the internal document model in a program, with per-node modified state (something that LibreOffice does not do for many reasons), and your changes would be very targeted, with very much improved store times. My point is: the two facts (LibreOffice implementation not allowing to have small/limited file modifications instead of full file regenerations, and spreadsheets being “unstructured”) are unrelated.