Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Here are some statistics based on tests with a text file version of The Histories by Herodotus. The UTF-8 text file is here (download, rename from JPG to ZIP and extract the TXT file) for others to test various versions of different products in a comparable manner.

Test parameters

The text file has these dimensions:

  • File size (bytes): 1540211 (on an ext3 volume).
  • Characters (including spaces): 1535760.
  • Words: 270607.
  • Lines: 33494 (it contains many CRLF characters to try and replicate the page layout of the original work).

I have tested opening this text file using:

  • LibreOffice Writer v4.1.4.2 Build ID: 0a0440ccc0227ad9829de5f46be37cfb6edcf72
  • MS Office 2011 Word for Mac v14.3.9

... and saving to these formats:

  • ODF v1.2 Extended i.e., ODT.
  • MS Office 97/2000/XP/2003 (.doc) i.e., the old binary format.
  • MS Office 2007/2010 XML (.docx) i.e., OOXML.

I have also tested re-saving the various DOC and DOCX files back to ODT.

Results

For the versions tested and for the same process when using plain text, ODT produces a noticeably smaller file when compared with the DOC and DOCX equivalent. Note that while file sizes can be directly compared on a per-equivalent-format basis, (e.g., DOC as saved by LO with DOC as saved by MSO), it is inaccurate to the compare percentage gains across formats where the original files differ in size. For this reason the percentages shown in the tables are all in relation to the size of the original text file. Cross-comparisons are more easily made in this way.

table of results

For the DOC and DOCX formats LO produces a:

  • Larger DOC file than MSO 2011 e.g., ~3830000 vs 2616320 bytes / ~250% vs ~170% of the TXT file.
  • Smaller DOCX file than MSO 2011 e.g., ~696000 vs 1043855 bytes / ~45% vs ~68% of the TXT file.

Round trips, whether into different formats or the same format, does vary these figures. I have not bothered to show repeated saves into the originating format, with the exception of .doc (MSO created) -> .doc (LO saved) and .docx (MSO created) -> .docx (LO saved). The trend in file sizes when saving to non-native formats will tend to be: earlier versions of LO will create smaller files and later versions larger files. This is due to improved understandings of the underlying specification and improved implementation of corner cases, etc.

Why are there differences?

So why do the two products differ when saving to the same file format? Why does a DOCX of plain text, as originally saved by MSW2011, come out at ~68% and when this is subsequently saved by LOv4142 it becomes ~46%. Conversely, why does a DOCX of plain text, as originally saved by LOv4142, come out at ~45% and when this is subsequently saved by MSW2011 it becomes ~57%. What is going on?

This is largely due to two different implementations of the underlying specification (OOXML in this case) and is best demonstrated by a simple example. Here is part of the XML generated by MSW2011 for a DOCX containing only a single lowercase "a" character (i.e., word/document.xml):

<w:p w:rsidR="00FC5F07" w:rsidRDefault="00C8189F">
    <w:r>
        <w:t>a</w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="_GoBack"/>
    <w:bookmarkEnd w:id="0"/>
</w:p>

... and here is the XML produced by LOv4142 for the same type of file with the same content:

<w:p>
    <w:pPr>
        <w:pStyle w:val="style0"/>
        <w:rPr/>
    </w:pPr>
    <w:r>
        <w:rPr/>
        <w:t>a</w:t>
    </w:r>
</w:p>

There are other differences in that file (and that is only one file among several in the DOCX) but hopefully it illustrates the point about implementations differing. Both are perfectly valid, but they contain slightly different information.

If we now open the DOCX produced by LOv4142 (the immediately prior example) using MSW2011 and re-save it, again to the DOCX format, it will look like:

<w:p w:rsidR="00CD406C" w:rsidRDefault="002833E4">
    <w:bookmarkStart w:id="0" w:name="_GoBack"/>
    <w:bookmarkEnd w:id="0"/>
    <w:proofErr w:type="gramStart"/>
    <w:r>
        <w:t>a</w:t>
    </w:r>
    <w:proofErr w:type="gramEnd"/>
</w:p>

The XML is different again, which is typical for round trips like this. The same process also happens in reverse, but LO generally takes a simpler approach, tending to produce (IMO) more consistent and cleaner XML. This is not always the case though. What I have shown here is a basic example. Start adding in headers, footers, section breaks, paragraph styles, direct formatting and other elements (all still text) and the underlying markup can become quite complex and the file size in different formats can vary accordingly.