Corrupted data in heavily commented document.

Arpista · February 23, 2020, 6:49am

(Windows 10 Home, LibreOffice 6.3.5, currently supposedly-stable version)

When I try to open my document, I get an error:

“SAXException: No input source”

It asks if I would like it to open the document, and warns of danger that the file will be further corrupted if I do. Refusing gets me more information:

File format error found at
SAXParseException: “No input source”
SAXParseException: ‘[word/footnotes.xml line 2]: unknown error’, Stream ‘word/footnotes.xml’, Line 2, Column 28112
SAXParseException: ‘[word/document.xml line 2]: unknown error’, Stream ‘word/document.xml’, Line 2, Column 31535(row,col).

I am a professional editor, and I would really like to know whether my last several weeks of work have just been slagged. The document worked fine last time I closed it. Opening it, I see that many of my comments have had their content, author and date information deleted (most or all of which apply to footnotes; all comments applying to footnotes seem to have been affected); most seem to still be fine, but the program crashes fairly shortly after I load the document, and I can’t copy-paste out (initially it copy-pasted with no comments, which doesn’t help, since I have a backup of the plain text; now it won’t do so at all). Advice? I have been using Open- or LibreOffice for this purpose for the past several years, and this is the first time I’ve had this problem.

mikekaganski · February 23, 2020, 6:59am

The file has been corrupted (i.e., some invalid information was written into it) the last time you worked with it. It is not a user error: this is definitely a bug in the program. Now the internal structure of the file is damaged, and the program detects that and warns you that it may proceed, at the cost of discarding parts of the document information after the invalid chunk. It’s impossible to guess what is corrupted now, and if it’s recoverable, without seeing the file itself; often the corruption is recoverable by manual editing the document structure (XML files inside the ZIP). So if you publish your file and post a link to your question, it might be checked by others to try to repair.

It could be most helpful if you remember what you did last time, and try to reproduce creating the invalid file, and file a bug report to fix the bug.

Arpista · February 23, 2020, 8:39am

Thank you very much. I obviously cannot post the document without consent of the author, since it’s not mine; this is a document I was editing. Would it be possible for me to access the document structure myself (what program do you open it in?), and if so, is the “file format error” message that I got a detailed account of where in said document structure to find the corruption? That would be very convenient if so.

The comments that had data deleted were all applying to footnotes, but I’ve commented on footnotes before with no problem. “What I did last time” covers probably fifty or a hundred pages of work, so while I will certainly remember to file a bug report if this happens again and I can figure out what did it, I am dubious I’ll be able to replicate it this time.

Thank you very much again for the reply, and for the link in case I do figure it out.

mikekaganski · February 23, 2020, 9:12am

what program do you open it in?

As I mention, the file is just a ZIP archive with XML file structure inside. There’s no better pointer to where the problem is available. You likely need to examine the structure to find the corruption, and if that’s not an invalid XML (which would be caught by a strict parser), then you would need to check if that’s an ODF corruption consulting to the ODF standard.

EDIT: Oh, you are using an DOCX! that’s a bad practice. Of course, saving an invalid DOCX is still the bug in the program; but using an external format for the ongoing work is much more likely to create problems…

And of course, in that case you would need to consult another standard: specifically, ECMA-376.

Lupp · February 23, 2020, 1:41pm

Spoken aside. And a guess: There were many cases for years now reported here and also to forum.openoffic.org where documents coming from the MS universe (specifically documents containing change records) failed, even if they were last saved to ODF. (In cases where the posters submitted the files I repaired some of them.)
Sorry. You cannot collaborate accross the limits of the MS realm. As soon as LibO manages to be “perfect” in this, MS would change their “standards”. Incompatibity is their raison d’être. If I was MS I also hadn’t a choice insofar.

Arpista · February 24, 2020, 1:04am

Thank you very much. I was using .docx because LibreOffice is supposed to be able to work with it, and my original document is a .docx (and therefore that’s the form I need to return it in); half my clients are using Word, and need their documents back in a form their word processors recognize. Are you saying it would be better to work in .odt even if I needed to convert it back at the end of the process? I have been avoiding that for fear of introducing errors in the ending conversion.

(Also, as of about two years ago, .doc handled comment slowdown better than .odt, though that may have changed.)

Arpista · February 24, 2020, 1:12am

OK, on second thought, the unzipped documents will allow me to reconstruct my comments; the data is all here, and even in order, it just needs some reassembly. I’m going to go for the lower-tech solution, and just copy-paste them back in, since I’m running into many “stupid questions” I’d have to ask in order to learn how to repair it properly. Thank you ever so incredibly much for the help, and if I manage to reproduce the bug I will absolutely submit it.

mikekaganski · February 24, 2020, 7:59am

Are you saying it would be better to work in .odt even if I needed to convert it back at the end of the process? I have been avoiding that for fear of introducing errors in the ending conversion.

Yes, I suggest to use ODF as much as possible in workflow. It’s not a matter of “ending conversion”, it’s actually the opposite.

When you open an DOCX, you are not working with DOCX, you are working with what had been imported. ODT is the closest representation of what had been imported. Saving into ODT would avoid second conversion in the end of each editing session, and following repeated conversion at the start of the following session. So instead of convert from DOCX - edit - convert to DOCX - convert from DOCX - edit - convert to DOCX - …, you would have convert from DOCX - edit - save to ODT - edit - edit - edit - … - convert to final DOCX.

Arpista · February 24, 2020, 11:13pm

Thank you very much, that’s extremely helpful to know.

JohnHa · February 23, 2020, 11:35am

Search the forum with saxparse or see [Tutorial] How to fix SAXParse error in LibreOff .docx files

The file has probably been edited by MS Word which may have caused the corruption.

I am a professional editor, and I would really like to know whether my last several weeks of work have just been slagged.

As a professional you will, of course, have backed up your work so just get it back from your daily backup.

Arpista · February 24, 2020, 1:15am

Unfortunately, my daily non-cloud backup was briefly not operational, and I don’t do cloud backups on my clients’ documents. I do realize how serious an error that was now, yes. Thank you very much for the tutorial; it helped immensely, and I appreciate you taking the time to help me.