[Tutorial] Differences between MS Word (eg .doc and .docx) and LO files (eg .odt)

Note: This post is based on AOO but 99% should be the same for LO

LO Writer, AOO Writer and MS Word (and other word processing programs) are all similar, but none are identical. If you open a Writer .odt file in Word, or a Word .doc, .rtf or .docx file in Writer, you will often notice differences.

The programs do many of the same things, including text, styles, tables, images, bold, italics, headings, page number, headers, footers etc. However, Writer does some things which Word does not do; and Word does some things which Writer does not do. Each program can store its own data in its own file, but obviously cannot store this extra data in the other program’s file as there is nowhere for it to go and/or nothing in the other program to see it.

An example: Writer and Word are based on different schools of typography which can be slightly confusing. Word considers the page header/footer areas to be part of “print matter” while Writer considers them to be “marginalia”. You may need to change the top and/or bottom margin widths by the height of ‘one line + header/footer spacing’ if you have page headers/footers and you are trying to replicate a .doc layout in a .odt file.

When you open a .doc. .docx or .rtf file, what you see may not be exactly how the author wrote it - formatting in particular is often changed. .rtf files are particularly limited in what they can store and suffer most changes or deletions. .txt files can store only the text characters - they cannot store any formatting or font information.

When using Writer you are strongly encouraged always to save all documents as .odt files.

That way you know that all your document and formatting will be saved. If someone irrationally asks you to send them a .doc file, question the request, and offer to send them a .odt file instead as all versions of Microsoft Office later than 2007 claim to be able both to read and to write .odt files. If MS Word corrupts the .odt file, get the recipient to complain to Microsoft. If the requester insists on a .doc file, then create a .doc file from the .odt file, and delete the .doc after sending it.

Always work in, and save all documents as, .odt files.

Don’t forget that Google Docs uses .odt files and Microsoft is now feeling a lot of pressure from the .odt format.

If you save your work as any other file other than a .odt file (eg .doc, .rtf etc) you are almost certain to lose something. In general, it is the more complex things which get lost or mangled, such as Edit > Changes, bullet shapes, colours etc.

1. Bullets, list items and numbered items in .doc files often display incorrectly

Bullets, list items and numbered items in MS Word .doc files often display incorrectly when the file is opened with Writer and the corruption persists when the file is saved as a .odt file.

2. Textboxes in .docx files are NOT part of the Microsoft OOXML standard

Later versions of MS Word which write .docx files often use Textboxes. Textboxes are NOT part of the OOXML International Standard - they are a Microsoft add-on which is proprietary. See OOXML/Markup Compatibility and Extensibility which says:
“Although the OOXML spec defines a specific set of allowed elements, Microsoft sometimes extend this with additional proprietary elements that are specific to new versions of Office. For example, if you insert a shape into a document in Word 2013, it will be defined in terms of a “word processing shape” element structure, which is not part of the OOXML spec. For the purposes of compatibility with older versions of Word however, they include a second version of the shape which uses an element structure that is defined in the spec, albeit using the legacy VML drawing format.”

LibreOffice recognise Textboxes.

3. Saving as .doc files is not recommended but …

… if you are forced to save as a .doc file, be sure to select Word 97 / 2000 / XP as it is the most recent format. Word 95 and Word 6.0 .doc formats are very old and obsolete and less comprehensive than Word 97 / 2000 / XP .doc format.
If you attempt to save a document as any format other than .odt, Writer warns you that you may lose data as in the pop-up window below. Unfortunately, many users switch off this warning. If you do not get this warning message, you can switch it back on with Tools > Options > LoadSave > General …

4. Microsoft Word Viewer

If you regularly receive .doc or .docx files, you will find it very useful to download the free Microsoft Word Viewer from Microsoft Word Viewer. You can then open the .doc or .docx file, and check to see if any content is missing and, if necessary, copy the content into Writer.

5. MS Word can read and write .odt files

All versions of MS Word later that Word 2007 claim to be able both to read and write .odt files and Microsoft lists its partial support of .odt files in Differences between the OpenDocument Text odt format and the Word docx format.

So, if someone sends you a .doc or .docx file you cannot read, ask them to send you a .odt file instead. If MS Word does not create a proper .odt file, ask the sender to complain vigorously to Microsoft. Similarly, if you send someone who uses MS Word a .odt file, and MS Word does not present it correctly, ask the person who received it to complain vigorously to Microsoft.

Note that LO has some Microsoft compatibility options available under Tools > Options > Load/Save > VBA Properties…, and Tools > Options > Load/Save > Microsoft Office …, which may need changing.

6. Academic study of Interoperability Issues

For an academic study of the problems see the University of Illinois’ 2008 paper Lost in Translation: Interoperability Issues for Open Standards. I felt the paper did not cover very well the fact that the key benefit of an Open Standard is that it guarantees that anyone can extract all the information from the data file without needing to have the application because the file structure is not a commercial secret.

Similarly, I felt the paper only briefly mentioned that applications must support all the “items” coded in the file. Interoperability only exists across those functions implemented in both programs and those functions which are implemented in file format being used to store the document.

Further information on the history of the .doc format can be found in the wiki article at Doc (computing) which includes:

Specification

*Because the DOC file format was a closed specification for many years, inconsistent handling of the format persists and may cause some loss of formatting information when handling the same file with multiple word processing programs. Some specifications for Microsoft Office 97 binary file formats were published in 1997 under a restrictive license, but these specifications were removed from online download in 1999. Specifications of later versions of Microsoft Office binary file formats were not publicly available.

The DOC format specification was available from Microsoft on request since 2006 under restrictive RAND-Z terms until February 2008. Sun Microsystems and OpenOffice.org reverse engineered the file format. On February 15, 2008, Microsoft released a .DOC format specification under the Microsoft Open Specification Promise. However, this specification does not describe all of the features used by DOC format and reverse engineered work remains necessary.

Since 2008 the specification has been updated several times; the last change was made in September 2015.*

7. Microsoft’s OOXML “pseudo-standard” format (.docx etc)

See Why you should never use Microsoft’s OOXML pseudo-standard format where Italo Vignoli of The Document Foundation, the organization responsible for developing LibreOffice, talks about “the dirty tricks Microsoft uses to break interoperability and keep users locked into their platform”.

It includes “ … each version of MS Office since 2007 has a different and non standard implementation of OOXML, which is defined as “transitional” because it contains elements which are supposed to be deprecated at standard level, but are still there for compatibility reasons. Although LibreOffice manages to read and write OOXML in a fairly appropriate way, it will be impossible to achieve a perfect interoperability because of these different non standard versions.

See MS Office 2007 OOXML file format (docx, xslx, pptx, ppsx) for a discussion of OOXML and why many consider OOXML is a deliberate attempt by Microsoft to make it almost impossible for other vendors to read or write fully compliant OOXML files. The “standard” is 6,000 pages long and it is estimated a full import or export filter would take 50 to 500 person-years to write.

And after you have done all that work, all it takes is for Microsoft to make another not-part-of-the-standard change or addition to the so called “standard” … and your interface no longer works.

8. AOO Help has a section About Converting Microsoft Office Documents …

… which discusses some differences. LO probably has one too - check it as it will be more up to date than this.

About Converting Microsoft Office Documents

OpenOffice can automatically open Microsoft Office 97/2000/XP .doc document files. However, some layout features and formatting attributes in more complex Microsoft Office documents are handled differently in OpenOffice or are unsupported. As a result, converted files require some degree of manual reformatting. The amount of reformatting that can be expected is proportional to the complexity of the structure and formatting of the source document. OpenOffice cannot run Visual Basic Scripts, but can load them for you to analyse.

The most recent versions of OpenOffice can load, but not save, the Microsoft Office Open XML document formats with the extensions .docx, .xlsx, and .pptx. The same versions can also run some Microsoft Excel Visual Basic scripts, if you enable this feature at Tools - Options - Load/Save - VBA Properties.

The following lists provide a general overview of Microsoft Office features that may cause conversion challenges. These will not affect your ability to use or work with the content of the document once the MS file has been saved as a .odt etc file.

Microsoft Word

  1. AutoShapes

  2. Revision marks

  3. OLE objects

  4. Certain controls and Microsoft Office form fields

  5. Indexes

  6. Tables, frames and multi-column formatting

  7. Hyperlinks and bookmarks

  8. Microsoft WordArt graphics

  9. Animated characters/text

Microsoft PowerPoint

  1. AutoShapes

  2. Tab, line and paragraph spacing

  3. Master background graphics

  4. Grouped objects

  5. Certain multimedia effects

Microsoft Excel

  1. AutoShapes

  2. OLE objects

  3. Certain controls and Microsoft Office form fields

  4. Pivot tables

  5. New chart types

  6. Conditional formatting

  7. Some functions/formulae (see below)

One example of differences between Calc and Microsoft Excel is the handling of boolean values. Enter TRUE to cells A1 and A2.

In Calc, the formula =A1+A2 returns the value 2, and the formula =SUM(A1;A2) returns 2.

In Excel, the formula =A1+A2 returns 2, but the formula =SUM(A1,A2) returns 0.

For a detailed overview about converting documents to and from Microsoft Office format, see the Migration Guide.
Opening Microsoft Office Documents That Are Protected With a Password

OpenOffice can open the following Microsoft Office document types that are protected by a password.

Note: If you cannot open an encrypted file, ask someone with MS Word to open it for you, and save it without the password.

Microsoft Office format >>>>> Supported encryption method

Word 6.0, Word 95 >>>>> Weak XOR encryption

Word 97, Word 2000, Word XP, Word 2003 >>>>> Office 97/2000 compatible encryption

Word XP, Word 2003 >>>>> Weak XOR encryption from older Word versions

Excel 2.1, Excel 3.0, Excel 4.0, Excel 5.0, Excel 95 >>>>> Weak XOR encryption

Excel 97, Excel 2000, Excel XP, Excel 2003 >>>>> Office 97/2000 compatible encryption

Excel XP, Excel 2003 >>>>> Weak XOR encryption from older Excel versions

Starting from OpenOffice.org 3.2 or StarOffice 9.2, Microsoft Office files that are encrypted by AES128 can be opened. Other encryption methods are not supported.

Disclaimer: Everything in this post is opinion. Please let me know of any errors so they can be corrected.

There seems to be one shortcoming with the .odt format. I just asked the following question:
How would I search for text across a directory of documents?

In Windows Start > type the word in the search box > press enter > choose More options > sort by directory. Or use zipgrep to search the .odt files.

Good information about saving to ODT to preserve formatting. Writing from LO (5x,6x) to docx will not overcome incompatibilities (page break handling, cell addressing differences). Best solution is to upload the doc to Google Drive, open it there and File:Download as in the new format. Trying to pressure a colleague to save/read in a novel format could hamper future communication and will most likely still not get you or them a functional document.