Extract Track changes with command line


I want to extract the text from both doc and docx file with track changes in a batch mode (preferably in commandline) in Linux (Ubuntu) environment.

There are some options as below

soffice --headless --convert-to txt:Text input-file.doc[x]

But it creates the txt file containing text both before and after the track changes.

Another option is :

abiword -t txt inputfile.doc[x]

But it gives the file after track changes (final) .

But I want two text files from the docx file

  1. Text containing before track changes
  2. Text containing after track changes

All I can say is, good luck. You are going to need an XML parser because using regular expressions to parse (X)HTML / XML can lead to a zalgo situation. It is easy enough to extract the content and format it:

unzip -p $f_docx word/document.xml | xmllint --format - > $f_formatted_xml;

… but after that it becomes an excercise in:

  1. Finding all the paragraphs (all content between <w:p> and </w:p>) and extracting these.
  2. Dividing the paragraphs into those without any recorded changes and those with some kind of change (refer 17.13.5 in the OOXML specification for a list of all possible tags - there are 37 in total).

tnx for the soln. But I have to deal with both doc and docx format. Hence I need simple commandline tools having different parameters to extract the text both original/ Final changes.

For DOC you have an even more significant challenge as it is a binary format. Please report back if you manage to find any such tool.