Why are LO documents so large compared with MS Word?

I note when I have written a document and saved it in both LO (.odt) and Word (.doc or .docx), the LO document is typically twice as large as the Word document. Why is this? I thought the open document format was supposed to be more compact that the Office format.

Here are some statistics based on tests with a text file version of The Histories by Herodotus. The UTF-8 text file is here (download, rename from JPG to ZIP and extract the TXT file) for others to test various versions of different products in a comparable manner.

Test parameters

The text file has these dimensions:

  • File size (bytes): 1540211 (on an ext3 volume).
  • Characters (including spaces): 1535760.
  • Words: 270607.
  • Lines: 33494 (it contains many CRLF characters to try and replicate the page layout of the original work).

I have tested opening this text file using:

  • LibreOffice Writer v4.1.4.2 Build ID: 0a0440ccc0227ad9829de5f46be37cfb6edcf72
  • MS Office 2011 Word for Mac v14.3.9

… and saving to these formats:

  • ODF v1.2 Extended i.e., ODT.
  • MS Office 97/2000/XP/2003 (.doc) i.e., the old binary format.
  • MS Office 2007/2010 XML (.docx) i.e., OOXML.

I have also tested re-saving the various DOC and DOCX files back to ODT.

Results

For the versions tested and for the same process when using plain text, ODT produces a noticeably smaller file when compared with the DOC and DOCX equivalent. Note that while file sizes can be directly compared on a per-equivalent-format basis, (e.g., DOC as saved by LO with DOC as saved by MSO), it is inaccurate to the compare percentage gains across formats where the original files differ in size. For this reason the percentages shown in the tables are all in relation to the size of the original text file. Cross-comparisons are more easily made in this way.

For the DOC and DOCX formats LO produces a:

  • Larger DOC file than MSO 2011 e.g., ~3830000 vs 2616320 bytes / ~250% vs ~170% of the TXT file.
  • Smaller DOCX file than MSO 2011 e.g., ~696000 vs 1043855 bytes / ~45% vs ~68% of the TXT file.

Round trips, whether into different formats or the same format, does vary these figures. I have not bothered to show repeated saves into the originating format, with the exception of .doc (MSO created) → .doc (LO saved) and .docx (MSO created) → .docx (LO saved). The trend in file sizes when saving to non-native formats will tend to be: earlier versions of LO will create smaller files and later versions larger files. This is due to improved understandings of the underlying specification and improved implementation of corner cases, etc.

Why are there differences?

So why do the two products differ when saving to the same file format? Why does a DOCX of plain text, as originally saved by MSW2011, come out at ~68% and when this is subsequently saved by LOv4142 it becomes ~46%. Conversely, why does a DOCX of plain text, as originally saved by LOv4142, come out at ~45% and when this is subsequently saved by MSW2011 it becomes ~57%. What is going on?

This is largely due to two different implementations of the underlying specification (OOXML in this case) and is best demonstrated by a simple example. Here is part of the XML generated by MSW2011 for a DOCX containing only a single lowercase “a” character (i.e., word/document.xml):

<w:p w:rsidR="00FC5F07" w:rsidRDefault="00C8189F">
    <w:r>
        <w:t>a</w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="_GoBack"/>
    <w:bookmarkEnd w:id="0"/>
</w:p>

… and here is the XML produced by LOv4142 for the same type of file with the same content:

<w:p>
    <w:pPr>
        <w:pStyle w:val="style0"/>
        <w:rPr/>
    </w:pPr>
    <w:r>
        <w:rPr/>
        <w:t>a</w:t>
    </w:r>
</w:p>

There are other differences in that file (and that is only one file among several in the DOCX) but hopefully it illustrates the point about implementations differing. Both are perfectly valid, but they contain slightly different information.

If we now open the DOCX produced by LOv4142 (the immediately prior example) using MSW2011 and re-save it, again to the DOCX format, it will look like:

<w:p w:rsidR="00CD406C" w:rsidRDefault="002833E4">
    <w:bookmarkStart w:id="0" w:name="_GoBack"/>
    <w:bookmarkEnd w:id="0"/>
    <w:proofErr w:type="gramStart"/>
    <w:r>
        <w:t>a</w:t>
    </w:r>
    <w:proofErr w:type="gramEnd"/>
</w:p>

The XML is different again, which is typical for round trips like this. The same process also happens in reverse, but LO generally takes a simpler approach, tending to produce (IMO) more consistent and cleaner XML. This is not always the case though. What I have shown here is a basic example. Start adding in headers, footers, section breaks, paragraph styles, direct formatting and other elements (all still text) and the underlying markup can become quite complex and the file size in different formats can vary accordingly.

Please take a look to this thread:

Thanks for answering so promptly. However the link you gave doesn’t actually address the problem I put. My documents are text only - no pictures or anything in them. If I have created my documents myself then depending on how I save them (odt or doc/docx) the file size is wildly different, odt being typically twice the size of doc/docx files. And finally the size is consistent, that is if I start with a doc file, edit it and resave as a doc file the size remains broadly the same. Similarly with odt file. If I start with an odt file and save as a doc/docx file the size shrinks and conversely if I start with a doc/docx file and save as odt the size bloats.

@Enkel, can I get you to cut and paste this content from this answer back into your question, as addition information? Thanks for clarifying. I will see if I can expand on my original answer (in the linked thread by @mariosv), but want to do some testing of this first.

@Enkel, please what are sizes on what are you talking?

@Enkel, please what are sizes on what are you talking?

Enkel, you have such a high rank, yet you don’t have the smartness to NOT POST AN ANSWER TO A QUESTION OF YOURS THAT ISN’T AN ANSWER AT ALL !

Enkel, you have such a high rep, yet you don’t have the smartness to NOT POST AN ANSWER TO A QUESTION OF YOURS THAT ISN’T AN ANSWER AT ALL !