Ask Your Question
1

[Tutorial] Fixing .docx files with SAXParse error

asked 2016-12-14 16:08:17 +0200

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

Three self-help methods to fix LibreOffice .docx files with SAXParse errors.

1 AOO seems to be able to open these files ...

... so download Apache OpenOffice from http://www.openoffice.org/download/in.... Create a new user on your PC and install AOO for that user only. AOO and LO seem to interact in that LO grabs some of the AOO properties and this will completely isolate AOO from LO. Open the .docx file with AOO. Save it as a .odt file. Uninstall AOO and delete the added user.

2 Follow the directions given in the AOO forum at ...

... https://forum.openoffice.org/en/forum.... This requires you to unzip the .docx file, extract the \word\document.xml file, and remove all the occurrences of the repeated attribute specified in the error message you get when you open the .docx file.

Note that there may be more than one attribute which is repeated throughout the file so you may have to do this for each of the other repeated attribute(s). Repeated attributes reported include w:themeShade, w:themeColor and w:cstheme. Files have had many (30+?) repeats.

3 Extract \word\document.xml from the .docx file and strip off all the XML tags to leave just the text

This is a "brute force and ignorance" method that strips all the formatting and leaves just the unformatted text, but it should work when everything else fails.

Windows:

Rename the file from fred.docx to fred.ZIP.

  • Double click fred.ZIP. Navigate to the \word folder.
  • Drag document.XML onto the desktop.
  • Install Notepad++ and the XML Tools plug-in.
  • Open document.xml with Notepad ++. Go Plugins > XML Tools > Pretty print XML with line breaks.
  • Delete the XML tags leaving just the text.

Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.

Linux:

Rename the file from fred.docx to fred.ZIP.

  • Unzip fred.ZIP - you may need to install a ZIP utility on Linux.
  • Navigate to the \word folder.
  • Extract document.xml.
  • Install an XML editor. Open document.xml with the XML editor and format it "pretty print". Delete the XML tags leaving just the text.

Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.

EDIT

The easiest way to delete all the XML tags is with a Regular Expression Find and Replace. It should work in LO as long as you do not break the character limit for a paragraph (64k in AOO).

It works fine in NotePad++ (https://notepad-plus-plus.org/). Ooen document.xml. Go Search > Replace ..., with search argument <[^>]+> and replace argument blank. Tick Regular Expressions. Click Replace All.

All XML tags are deleted and you are left with just the text.

edit retag flag offensive close merge delete

Comments

To remove the most-often encountered problem: redefined attributes - it's better to replace regex

(\<[^>]+)([\w]+:[\w]+="[^"]+")([^>]+)\2

with

$1$2$3

in document.xml (works with Notepad++). The modified file then needs to be added back to the OOXML (using a ZIP application like e.g. 7-zip), and then it will most probably open OK.

Mike Kaganski gravatar imageMike Kaganski ( 2017-01-23 08:54:39 +0200 )edit

2 Answers

Sort by » oldest newest most voted
0

answered 2018-06-26 21:40:57 +0200

JohnHa gravatar image

LibreOffice opens the documents just as fine. I.e., it opens them, but loses whatever goes after the invalid place, exactly like AOO

Not so.

LibreOffice opens the file up to the place of the repeated definition and loses everything after the error.

AOO seems to have a better error handling system which ignores the repeated definition. AOO opens the file fully without data lass.

edit flag offensive delete link more

Comments

Please provide an example. I know the code that handles this; and I know what had changed in LibreOffice. Your claim is unlikely.

For example, you might test bug 113790. It has both original document, and the one with duplicating attributes. And opening both with LO and AOO 4.1.5, the corrupt file doesn't have last 2 paragraphs ("ABCD" with bullet and "Title3").

Mike Kaganski gravatar imageMike Kaganski ( 2018-06-26 23:51:38 +0200 )edit

The "better error handling system" of AOO is only better in hiding the fact of corruption, where the data loss silently gets unnoticed by user. This fact was the driving force behind the change in LO, where we first introduced the error message (failing the opening), and later made it possible to continue opening (just like previously in OOo), but only after notifying user of possible data loss.

Mike Kaganski gravatar imageMike Kaganski ( 2018-06-27 01:58:27 +0200 )edit
0

answered 2018-05-25 09:58:43 +0200

Just registered to say thank you for this post, especially the part about open office being able to open up the files! Came across this after hours of trying to recover a large document, thank you :)

edit flag offensive delete link more

Comments

LibreOffice opens the documents just as fine. I.e., it opens them, but looses whatever goes after the invalid place, exactly like AOO. If your version does not, then you must use some very outdated version.

Mike Kaganski gravatar imageMike Kaganski ( 2018-05-25 10:43:40 +0200 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2016-12-14 16:08:17 +0200

Seen: 1,887 times

Last updated: Jun 26