[Tutorial] Fixing .docx files with SAXParse error

JohnHa · December 14, 2016, 3:08pm

Three self-help methods to fix LibreOffice .docx files with SAXParse errors.

1 AOO seems to be able to open these files …

… so download Apache OpenOffice from http://www.openoffice.org/download/index.html. Create a new user on your PC and install AOO for that user only. AOO and LO seem to interact in that LO grabs some of the AOO properties and this will completely isolate AOO from LO. Open the .docx file with AOO. Save it as a .odt file. Uninstall AOO and delete the added user.

2 Follow the directions given in the AOO forum at …

… [Solved] LibreOffice File format error found at SAXParse (View topic) • Apache OpenOffice Community Forum. This requires you to unzip the .docx file, extract the \word\document.xml file, and remove all the occurrences of the repeated attribute specified in the error message you get when you open the .docx file.

Note that there may be more than one attribute which is repeated throughout the file so you may have to do this for each of the other repeated attribute(s). Repeated attributes reported include w:themeShade, w:themeColor and w:cstheme. Files have had many (30+?) repeats.

3 Extract \word\document.xml from the .docx file and strip off all the XML tags to leave just the text

This is a “brute force and ignorance” method that strips all the formatting and leaves just the unformatted text, but it should work when everything else fails.

Windows:

Rename the file from fred.docx to fred.ZIP.

Double click fred.ZIP. Navigate to the \word folder.
Drag document.XML onto the desktop.
Install Notepad++ and the XML Tools plug-in.
Open document.xml with Notepad ++. Go Plugins > XML Tools > Pretty print XML with line breaks.
Delete the XML tags leaving just the text.

Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.

Linux:

Rename the file from fred.docx to fred.ZIP.

Unzip fred.ZIP - you may need to install a ZIP utility on Linux.
Navigate to the \word folder.
Extract document.xml.
Install an XML editor. Open document.xml with the XML editor and format it “pretty print”. Delete the XML tags leaving just the text.

Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.

EDIT

The easiest way to delete all the XML tags is with a Regular Expression Find and Replace. It should work in LO as long as you do not break the character limit for a paragraph (64k in AOO).

It works fine in NotePad++ (https://notepad-plus-plus.org/). Ooen document.xml. Go Search > Replace …, with search argument <[^>]+> and replace argument blank. Tick Regular Expressions. Click Replace All.

All XML tags are deleted and you are left with just the text.

mikekaganski · January 23, 2017, 7:54am

To remove the most-often encountered problem: redefined attributes - it’s better to replace regex

(\<[^>]+)([\w]+:[\w]+="[^"]+")([^>]+)\2

with

$1$2$3

in document.xml (works with Notepad++). The modified file then needs to be added back to the OOXML (using a ZIP application like e.g. 7-zip), and then it will most probably open OK.

Justarandomjoe · May 25, 2018, 7:58am

Just registered to say thank you for this post, especially the part about open office being able to open up the files! Came across this after hours of trying to recover a large document, thank you

mikekaganski · May 25, 2018, 8:43am

LibreOffice opens the documents just as fine. I.e., it opens them, but looses whatever goes after the invalid place, exactly like AOO. If your version does not, then you must use some very outdated version.

JohnHa · June 26, 2018, 7:40pm

LibreOffice opens the documents just as fine. I.e., it opens them, but loses whatever goes after the invalid place, exactly like AOO

Not so.

LibreOffice opens the file up to the place of the repeated definition and loses everything after the error.

AOO seems to have a better error handling system which ignores the repeated definition. AOO opens the file fully without data lass.

mikekaganski · June 26, 2018, 9:51pm

Please provide an example. I know the code that handles this; and I know what had changed in LibreOffice. Your claim is unlikely.

For example, you might test bug 113790. It has both original document, and the one with duplicating attributes. And opening both with LO and AOO 4.1.5, the corrupt file doesn’t have last 2 paragraphs (“ABCD” with bullet and “Title3”).

mikekaganski · June 26, 2018, 11:58pm

The “better error handling system” of AOO is only better in hiding the fact of corruption, where the data loss silently gets unnoticed by user. This fact was the driving force behind the change in LO, where we first introduced the error message (failing the opening), and later made it possible to continue opening (just like previously in OOo), but only after notifying user of possible data loss.

mikekaganski · February 24, 2020, 2:39pm

@JohnHa: excuse me for replying here: the topic at the forum is locked now.

So I only wrote the criticism there just because I told you about the facts long ago - right here. And yet, the false statements stayed in the tutorials both here and there - you didn’t try to state correct facts in the overall worthy text, thus creating false impression about products.

And yes, I fixed many; some examples of them you may find here in various SAXParseException-related topics; and also I fixed many bugs related to it. And also made LibreOffice show more such errors which were previously ignored in LibreOffice (and are still ignored in AOO).

Please don’t try to defend wrong things that were pointed to you; it would be better to just fix the wrong part, and keep correct ones (and I’d simply removed my comment then, which I can’t now).

senator_LD · May 5, 2020, 7:30am

Thanks for this - I have encountered this issue a few times and post a short outline how to fix it that seems a bit easier than the one outlined above.

Make copy of example.docx
Rename copied file as examplecopy.zip (or whatever filename, obviously)
unzip
Navigate to word folder and copy document.xml to another location as backup
Take any xml validator, e.g. also online at for example Validate XML files and check file document.xml
It will show you tags that are corrupt - fix these tags, typically you will see that in the tag that is indicated in the validator you have a definition twice or a definition that makes no sense (check what the error message says). Delete this part of the tag.
Often you have to repeat this multiple times to find all corrupt tags (make sure if you use an online tool to reload the page each time)
Save file and rezip - for example in a shell: zip -r test.docx *
Note that rezipping based on the tools in my OS (OS X14) would not work.
Check docx file

AlexKemp · September 20, 2020, 9:35am