Ask Your Question
0

Extract Track changes with command line [closed]

asked 2014-02-08 12:33:47 +0200

sriram gravatar image

updated 2015-09-12 21:57:41 +0200

Alex Kemp gravatar image

Hello,

I want to extract the text from both doc and docx file with track changes in a batch mode (preferably in commandline) in Linux (Ubuntu) environment.

There are some options as below

soffice --headless --convert-to txt:Text input-file.doc[x]

But it creates the txt file containing text both before and after the track changes.

Another option is :

abiword -t txt inputfile.doc[x]

But it gives the file after track changes (final) .

But I want two text files from the docx file

  1. Text containing before track changes
  2. Text containing after track changes
edit retag flag offensive reopen merge delete

Closed for the following reason the question is answered, right answer was accepted by Alex Kemp
close date 2016-02-19 09:32:48.484561

1 Answer

Sort by » oldest newest most voted
0

answered 2014-02-08 22:48:00 +0200

oweng gravatar image

All I can say is, good luck. You are going to need an XML parser because using regular expressions to parse (X)HTML / XML can lead to a zalgo situation. It is easy enough to extract the content and format it:

unzip -p $f_docx word/document.xml | xmllint --format - > $f_formatted_xml;

... but after that it becomes an excercise in:

  1. Finding all the paragraphs (all content between <w:p> and </w:p>) and extracting these.
  2. Dividing the paragraphs into those without any recorded changes and those with some kind of change (refer 17.13.5 in the OOXML specification for a list of all possible tags - there are 37 in total).
edit flag offensive delete link more

Comments

tnx for the soln. But I have to deal with both doc and docx format. Hence I need simple commandline tools having different parameters to extract the text both original/ Final changes.

sriram gravatar imagesriram ( 2014-02-10 07:16:11 +0200 )edit

For DOC you have an even more significant challenge as it is a binary format. Please report back if you manage to find any such tool.

oweng gravatar imageoweng ( 2014-02-11 03:57:36 +0200 )edit

Question Tools

1 follower

Stats

Asked: 2014-02-08 12:33:47 +0200

Seen: 610 times

Last updated: Feb 10 '14