[Tutorial] Fixing .docx files with SAXParse error

asked 2016-12-14 16:08:17 +0200

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

Three self-help methods to fix LibreOffice .docx files with SAXParse errors.

1 AOO seems to be able to open these files ...

... so download Apache OpenOffice from http://www.openoffice.org/download/in.... Create a new user on your PC and install AOO for that user only. AOO and LO seem to interact in that LO grabs some of the AOO properties and this will completely isolate AOO from LO. Open the .docx file with AOO. Save it as a .odt file. Uninstall AOO and delete the added user.

2 Follow the directions given in the AOO forum at ...

... https://forum.openoffice.org/en/forum.... This requires you to unzip the .docx file, extract the \word\document.xml file, and remove all the occurrences of the repeated attribute specified in the error message you get when you open the .docx file.

Note that there may be more than one attribute which is repeated throughout the file so you may have to do this for each of the other repeated attribute(s). Repeated attributes reported include w:themeShade, w:themeColor and w:cstheme. Files have had many (30+?) repeats.

3 Extract \word\document.xml from the .docx file and strip off all the XML tags to leave just the text

This is a "brute force and ignorance" method that strips all the formatting and leaves just the unformatted text, but it should work when everything else fails.

Windows:

Rename the file from fred.docx to fred.ZIP.

  • Double click fred.ZIP. Navigate to the \word folder.
  • Drag document.XML onto the desktop.
  • Install Notepad++ and the XML Tools plug-in.
  • Open document.xml with Notepad ++. Go Plugins > XML Tools > Pretty print XML with line breaks.
  • Delete the XML tags leaving just the text.

Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.

Linux:

Rename the file from fred.docx to fred.ZIP.

  • Unzip fred.ZIP - you may need to install a ZIP utility on Linux.
  • Navigate to the \word folder.
  • Extract document.xml.
  • Install an XML editor. Open document.xml with the XML editor and format it "pretty print". Delete the XML tags leaving just the text.

Alternatively, Google pretty print and upload document.xml to a pretty print web site which will format it. Delete the XML tags.

EDIT

The easiest way to delete all the XML tags is with a Regular Expression Find and Replace. It should work in LO as long as you do not break the character limit for a paragraph (64k in AOO).

It works fine in NotePad++ (https://notepad-plus-plus.org/). Ooen document.xml. Go Search > Replace ..., with search argument <[^>]+> and replace argument blank. Tick Regular Expressions. Click Replace All.

All XML tags are deleted and you are left with just the text.

edit retag flag offensive close merge delete

Comments

To remove the most-often encountered problem: redefined attributes - it's better to replace regex

(\<[^>]+)([\w]+:[\w]+="[^"]+")([^>]+)\2

with

$1$2$3

in document.xml (works with Notepad++). The modified file then needs to be added back to the OOXML (using a ZIP application like e.g. 7-zip), and then it will most probably open OK.

Mike Kaganski gravatar imageMike Kaganski ( 2017-01-23 08:54:39 +0200 )edit