I want to create a script to extract some textual information from a DOC file provided by my university. My first thought was to use LibreOffice writer to convert the DOC file to TXT. However, the resulting text file does not contain all the text from the original document and is in fact missing the text that I need. Perhaps this is related to the text that I want being inside a framed box in the original document but I am not sure.
Is there a way to ensure that the conversion to TXT doesn’t throw away any textual data?
Is there some other way to extract textual data from a document?
By the way, I already tried converting the original DOC file to ODT, DOCX and RTF but in those cases the subsequent conversion to TXT also discards the text inside the framed box.