How to repair corrupted writer document

I have a document on which I worked many hours, and all of the sudden LO stopped while doing the automatic in-between-save. It got stuck and didn’t respond at all, so after 30 minutes or so, I decided to kill it. When reopening, the particular document couldn’t be opened, a copy of the original (without all my edits) was saved, while the corrupted file was left. I tried to open it with LO, but LO gets stuck every time I try. The progress bar goes up to about 3/8 of the total width, and then it stops, LO doesn’t respond to any input ( I waited quite a while), so I had to kill it again. I looked at the file (or better: an uncompressed copy) and checked the content.xml file, which seems to contain all info. I then used xmlstarlet to verify the file, and it seems to be ok. I noticed, though, that the file is about 30 times bigger than the original (on which I started to work). I opened it then in LibreWolf, where all the xml tags are properly ordered, and found that from a certain point onwards, I have hundreds of the following tags:

<text:alphabetical-index-mark-start text:id="IMark93853661952544"/>
<text:alphabetical-index-mark-start text:id="IMark93853660957056"/>
<text:alphabetical-index-mark-start text:id="IMark93853694318768"/>
<text:alphabetical-index-mark-start text:id="IMark93853768405744"/>
<text:alphabetical-index-mark-start text:id="IMark93853742750128"/>
<text:alphabetical-index-mark-start text:id="IMark93853657390256"/>
<text:alphabetical-index-mark-start text:id="IMark93853696152832"/>
<text:alphabetical-index-mark-start text:id="IMark93853626718400"/>
<text:alphabetical-index-mark-start text:id="IMark93853639761136"/>
<text:alphabetical-index-mark-start text:id="IMark93853683126080"/>
<text:alphabetical-index-mark-start text:id="IMark93853667192912"/>
<text:alphabetical-index-mark-start text:id="IMark93853664268320"/>
<text:span text:style-name="T590">xxxxxxxxxxxxxxxxxxxxx</text:span>
<text:span text:style-name="T590">xxxxxxxxxxxxxxxxxxxxx</text:span>
<text:span text:style-name="T590">xxxxxxxxxxxxxxxxxxxxx</text:span>
<text:alphabetical-index-mark-end text:id="IMark93853664268320"/>
<text:alphabetical-index-mark-end text:id="IMark93853667192912"/>
<text:alphabetical-index-mark-end text:id="IMark93853683126080"/>
<text:alphabetical-index-mark-end text:id="IMark93853639761136"/>
<text:alphabetical-index-mark-end text:id="IMark93853626718400"/>
<text:alphabetical-index-mark-end text:id="IMark93853696152832"/>
<text:alphabetical-index-mark-end text:id="IMark93853657390256"/>
<text:alphabetical-index-mark-end text:id="IMark93853742750128"/>
<text:alphabetical-index-mark-end text:id="IMark93853768405744"/>
<text:alphabetical-index-mark-end text:id="IMark93853694318768"/>
<text:alphabetical-index-mark-end text:id="IMark93853660957056"/>
<text:alphabetical-index-mark-end text:id="IMark93853661952544"/>

To be precise: of the index-mark-start and corresponding end tags, there are hundreds each, then comes one of the entries, and then again hundreds of index-mark-start and so on. Where it says “xxxxxxxxxxxxxxxxxxxxxx”, there is text that I have worked on. Sometimes it’s just a single letter, sometimes a bit more. I had the option to record changes enabled, but I don’t think that this would cause such a lot of entries.
I thought I can recover from a backup file, which is done when saving, but the backup file is empty. The file contains a lot of confidential data, hence I cannot upload it. Is there a way to repair it? E.g. by just removing all but one of the tags? Or are these tags all needed?
And one more question: If I edit the file, can I also use line breaks? Because if I try to edit it with a text editor (kwrite), I get a limit warning that lines are longer than 1 Mio characters. I can extend this number temporarily, but it is anyway impossible to work on it then, because all tags are in a single line. The file now is >50 MB, before I started working on it, it was 1,7 MB. I’m sure I haven’t added that much data (no images, only text and formatting changes, and that also not on every word).
I run Manjaro testing (up-to-date), LO 7.6.7.

unfortunately recurring theme :frowning:
Corrupted file, usual methods not working

needed maybe, but maybe not mandatory to recover your work.
doesn’t hurt to experiment on a copy of your file.

yep. but easier/safer if you use a XML editor.

Thank you for your reply!

Well, that’s not really similar, because here no USB stick is involved, the file was always saved on local HDD. And there is no solution. Is there a description of the odf tags that writer uses somewhere, and a description of how the writer documents usually are organized?
Do you know a useful OSS xml editor that runs in Linux? I’ve been searching for a while now without much luck, I wanted to use VSCodium with an XML plugin.

I was wondering because the file has become 30 times larger than before, and somehow this must have a reason like unnecessary duplications or multifications of tags with no purpose. The IDs (this I have checked already) usually appear only twice, i.e. they have no reference to anything else within the document, which makes me believe that they are unnecessary.
Well, yes, I will experiment on a copy, but I also hope that maybe there is someone around here who knows the structure of odt-documents well and can give some insight.

Hmm, how could that (a HW failure during saving, causing a corrupt package without content.xml) be somehow related to this one (first time seen, as far as I can tell) case of unexpected creation of hundreds of index marks? Which would be a bug, needing a sample to repro (i.e., a document without those marks, which generates them on save. Maybe the failing document could be used to cleanup and reconstruct such a reproducer - who knows…).

let’s call it instability in general :wink:

definitely a good move with Preventing data disaster - The Document Foundation Wiki,
interestingly, the word bug is never mentionned :innocent: :

and e.g. if we can state :
image
there should be an BIG warning popup for random users, when saving directly on removable or networked FS.

Removables and other unreliable storage are off-topic here.

from emacs, eclipse ?

not sure just the structure would help on such a bug.

would definitely be interesting to have a simple tool to anonymize your odt, so you can submit it in bugzilla.

I experimented a lot and then I had to work hard on that document, hence this post only now:
The problem seems to be related with recording changes within the document, because that is what those tags refer to, as far as I could tell. There are entries about when and by whom a change was made, and then follow these pairs of index markers, in the middle you find a short piece of text or something else that has been modified.
Those lines

<text:alphabetical-index-mark-start text:id="IMark93853934879264"/>

have always a corresponding entry:

<text:alphabetical-index-mark-end text:id="IMark93853934879264"/>

both are before and after a piece of text (very short piece often) or some other text like style entry or so

<text:alphabetical-index-mark-start text:id="IMark123456"/>
irgendein Text
<text:alphabetical-index-mark-end text:id="IMark123456"/>

After a number of such entries there was an entry like this:

<text:alphabetical-index-mark-end text:id="IMark123456"/>
irgendein Text
<text:alphabetical-index-mark-start text:id="IMark123456"/>

Note the difference: the end block comes before the start block. I guess that this was the cause for LO not being able to repair the document. It’s important to say that those index markers increased exponentially in number. While around the first such entries had only one such line before and after, after some time suddenly there were 5, 30, 90, 260 etc., at maximum there were 16,000 (sixteen thousand) lines with “index-mark-start” and the same number of lines with “index-mark-end”. It could be that the number of lines was doubled with every new entry, but I can’t say that for sure.
Since I was working on a copy, I removed all these lines except for one pair which I left around the text about which this seemed to be. After saving, I could open it but found now numerous spaces inserted in the text, seemingly without much sense, sometimes in the middle of a word, often around soft spaces (where hard spaces make no sense). Well, I removed them manually, hundreds of them, because I didn’t find a way to do it by search&replace without doing more harm than good.
I continued to work on the document with “record changes” enabled and autosave as well. After a while the time of autosave started to increase (like befor), and eventually I waited for 8 hours (actually, I went to bed then) until the document was saved (then about 25 times as big as it was before). So, again into the content.xml file, and the same things were there, multiple lines of index markers with no appearent meaning. I removed them again, but then I copied the text out of the document and put them in a new text document, where I disabled “record changes” and also autosave, just to be sure. I then saved manually, made always a new copy, which wouldn’t have been necessary, because now it worked. Obviously, the problem lies somewhere in the recording of changes.
Some more observations: When I opened content.xml the first time, all the data was in one line. I used an ordinary editor to replace the “><” with >linebreak<", where “linebreak” has to be substituted with the appropriate code, but it worked, each xml entry was now in one line. Thus I could work with it. LO had no problems with the file (except, maybe, the linebreaks were translated into spaces which then appeared in the document…?). Maybe I should have put it all back into one line, then. Well, I finished the work on the document.

Filing a bug report seems useless, since I cannot provide the document, as it contains a lot of private data which I am not allowed to share. But I thought I report what I did, maybe someone else stumbles over a similar problem.