Extract Track changes with command line

sriram · February 8, 2014, 11:33am

Hello,

I want to extract the text from both doc and docx file with track changes in a batch mode (preferably in commandline) in Linux (Ubuntu) environment.

There are some options as below

soffice --headless --convert-to txt:Text input-file.doc[x]

But it creates the txt file containing text both before and after the track changes.

Another option is :

abiword -t txt inputfile.doc[x]

But it gives the file after track changes (final) .

But I want two text files from the docx file

Text containing before track changes
Text containing after track changes

oweng · February 8, 2014, 9:48pm

All I can say is, good luck. You are going to need an XML parser because using regular expressions to parse (X)HTML / XML can lead to a zalgo situation. It is easy enough to extract the content and format it:

unzip -p $f_docx word/document.xml | xmllint --format - > $f_formatted_xml;

… but after that it becomes an excercise in:

Finding all the paragraphs (all content between <w:p> and </w:p>) and extracting these.
Dividing the paragraphs into those without any recorded changes and those with some kind of change (refer 17.13.5 in the OOXML specification for a list of all possible tags - there are 37 in total).

sriram · February 10, 2014, 6:16am

tnx for the soln. But I have to deal with both doc and docx format. Hence I need simple commandline tools having different parameters to extract the text both original/ Final changes.

oweng · February 11, 2014, 2:57am

For DOC you have an even more significant challenge as it is a binary format. Please report back if you manage to find any such tool.

AlexKemp · February 19, 2016, 8:32am