Extract Track changes with command line [closed]

Hello,

I want to extract the text from both doc and docx file with track changes in a batch mode (preferably in commandline) in Linux (Ubuntu) environment.

There are some options as below

But it creates the txt file containing text both before and after the track changes.

Another option is :

abiword -t txt inputfile.doc[x]

But it gives the file after track changes (final) .

But I want two text files from the docx file

1. Text containing before track changes
2. Text containing after track changes
edit retag reopen merge delete

Closed for the following reason the question is answered, right answer was accepted by Alex Kemp close date 2016-02-19 09:32:48.484561

Sort by » oldest newest most voted

All I can say is, good luck. You are going to need an XML parser because using regular expressions to parse (X)HTML / XML can lead to a zalgo situation. It is easy enough to extract the content and format it:

unzip -p $f_docx word/document.xml | xmllint --format - >$f_formatted_xml;


... but after that it becomes an excercise in:

1. Finding all the paragraphs (all content between <w:p> and </w:p>) and extracting these.
2. Dividing the paragraphs into those without any recorded changes and those with some kind of change (refer 17.13.5 in the OOXML specification for a list of all possible tags - there are 37 in total).
more

tnx for the soln. But I have to deal with both doc and docx format. Hence I need simple commandline tools having different parameters to extract the text both original/ Final changes.

( 2014-02-10 07:16:11 +0200 )edit

For DOC you have an even more significant challenge as it is a binary format. Please report back if you manage to find any such tool.

( 2014-02-11 03:57:36 +0200 )edit