Ask Your Question
0

How can I uncompress a LibreOffice document to get its XML internals, then make a new document from them?

asked 2020-12-16 01:52:15 +0100

Jim DeLaHunt gravatar image

updated 2020-12-16 01:54:53 +0100

I understand that LibreOffice document files (.odt, .ods, etc) are compressed archives containing multiple XML files internally. I want to open up a LO document so that I can get at the XML internal files. Then I want to be able to generate a new LO document file from those XML internal files.

I might want to modify the XML internal files. For instance, I might want to restore a document as in question, "Table changed multiple cells to 0", which is also bug 131025. This means I would be editing some of the XML files — but those edits are not part of this question.

If there is some documentation on how to uncompress and unpack LibreOffice documents into their XML internal files, and how to reverse that process, then I'd appreciate a link to that documentation. Otherwise, an answer here would be a helpful reference.

I suspect that the same process applies to all LibreOffice document types: .odt for Writer, .ods for Calc, etc. If there are differences, please include that in the answer also.

edit retag flag offensive close merge delete

Comments

They are all just zip files - depending on your OS you may be able to use right click and choose Unzip or equivalent, or use a command line tool such as unzip in Linux.

Or you could rename them as .zip to allow use of whatever tool your OS associates with zip files.

robleyd gravatar imagerobleyd ( 2020-12-16 03:13:31 +0100 )edit

@robleyd That sounds like an answer. Maybe submit it again, as an answer?

Jim DeLaHunt gravatar imageJim DeLaHunt ( 2020-12-16 04:51:27 +0100 )edit

3 Answers

Sort by » oldest newest most voted
1

answered 2020-12-16 11:54:59 +0100

Regina gravatar image

You do not really need to totally unpack the archive. I use 7-Zip and that has not only "Extract" but also "Open Archive". If you use "Open Archive" you get a view similar to a file manager. If you right-click on a file you get the items "View" and "Edit". In the properties of 7-Zip you can determine which application is used. I have bound "Edit" to "Microsoft XML Notepad" and "View" to "Notepad++". "View" allows changes too despite of its name. You can manipulate a file in the editor. Then save it in the editor and close the editor. 7-Zip ask you then, whether you want to update the file in the archive. Agree. When you then close 7-Zip, the package is still valid and you can use it immediately in LibreOffice. For manual editing the xml-files I find this more comfortable then using the flat-format. The flat format has the disadvantage, that all images are base64 encoded and blow up the file, where as in the package they are in a separate folder. In any case you should enable "pretty printing" in LibreOffice, in case you want to change the xml files manually.

If you add or remove files from the archive, you have to update the file manifest.xml in folder META-INF.

You find the specification of the package format in part 2 of the ODF standard. http://docs.oasis-open.org/office/Ope... There exists some restriction in file order and compression, which you need to consider, if you totally unpack the archive and then pack it again. LibreOffice is tolerant about that and will 'repair' the package when saving.

BTW, the most common error when people first time try to work with the package is, that they zip the whole folder which they had got from extracting. Instead only the content of the folder has to be zipped.

edit flag offensive delete link more

Comments

Thank you for the link to the ODF standard and the hint about the common error.

Jim DeLaHunt gravatar imageJim DeLaHunt ( 2020-12-17 00:50:55 +0100 )edit
2

answered 2020-12-16 07:56:01 +0100

ajlittoz gravatar image

Opening a LO document as a zip file leaves you with a directory and you'll have to manage yourself the relationships between the contained files and directories. Manipulating the various files exposes you to a risk of discrepancy or breaking implied/implicit relationships.

Moreover, the raw LO document is not intended to be human-read. Consequently, XML lines may be very long and overflow common application buffers (this is the case for standard text editors).

A better solution is to save the document as a flat XML file (.fodt, .fods, .fodp, ...). This flat file do not require any auxiliary file; everything needed is listed inside.

To show the community your question has been answered, click the ✓ next to the correct answer, and "upvote" by clicking on the ^ arrow of any helpful answers. These are the mechanisms for communicating the quality of the Q&A on this site. Thanks!

edit flag offensive delete link more

Comments

This is a clever alternate approach. However, I will need to test to see if it changes the file content in a way that makes it impossible for me to make the modification I want. For example, if a document is corrupted by bug 131025, I wonder if saving it as a flat XML file will make the corruption irrevocable, or if it will still allow recovery.

Jim DeLaHunt gravatar imageJim DeLaHunt ( 2020-12-16 08:48:29 +0100 )edit
1

Saving as a flat file does not change document content. It is just an alternate representation. tdf#131025 is not related to some structure corruption but to incorrect formatting. Therefore, the save as flat XML has no impact on your goal.

ajlittoz gravatar imageajlittoz ( 2020-12-16 09:21:34 +0100 )edit

Sorry, I just checked my document which is affected by tdf#131025. When I save it as a flat XML file (.fodt), the non-numeric table contents are replaced by "0" values. Thus I believe I can't do the needed repair on the flat XML file. Instead, I have to open up the .odt package. Flat XML format may be a good alternative for some cases, but not for tdf#131025 repair.

Jim DeLaHunt gravatar imageJim DeLaHunt ( 2020-12-17 09:28:14 +0100 )edit

Thanks for the info. Didn't know that various ODF output formats could changed file content.

ajlittoz gravatar imageajlittoz ( 2020-12-17 09:47:52 +0100 )edit
1

@ajlittoz: although it's true that FODF has its deficiency (e.g. tdf#63642), this specific case is not the same, and is not about differences of the data saved to different ODF flavors.

The cited bug is about Writer saving wrong cell data type for textual cells (a first save problem); and that makes Writer to change the value from the textual string (that is still kept in the file on first save) to zero on the next read (next read problem). Which results in a permanent data loss on second save (regardless of the format).

OP was trying to repair a file they have after first save; it's expected that opening it, and then saving would introduce the "second open and save" problems. So any workflow that involves opening and re-saving the data in Writer is not applicable to that specific bug (at least until it's fixed somehow).

Mike Kaganski gravatar imageMike Kaganski ( 2020-12-17 11:03:16 +0100 )edit
2

answered 2020-12-16 05:14:17 +0100

robleyd gravatar image

They are all just zip files - depending on your OS you may be able to use right click and choose Unzip or equivalent, or use a command line tool such as unzip in Linux.

Or you could rename them as .zip to allow use of whatever tool your OS associates with zip files.

The same tool should allow you to reconstruct a zip file from the elements you obtain when you unzip the file. Note that there are a number of files and sub-directories in the unzipped package.

If this answer helped you, please accept it by clicking the check mark to the left and, karma permitting, upvote it. That will help other people with the same question.

In case you need clarification, edit your question (don't use an answer) or comment the relevant answer.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2020-12-16 01:52:15 +0100

Seen: 71 times

Last updated: Dec 16 '20