Use soffice to convert from docx -> pdf deterministically.

mwakerman · July 13, 2018, 11:28am

We use soffice to convert office files to PDFs with the following command:

    soffice --headless --convert-to pdf /path/to/input_file.docx --outdir /tmp/

Running this command consecutively with the same input file produces files that differ at the byte level. When I look at the PDF, it appears that it’s happening because the internal creation date and (therefore) checksum don’t match:

    <</Author<FEFF004E0061007400680061006E0020004D00650072007A00760069006E0073006B00690073>
    /Creator<FEFF005700720069007400650072>
    /Producer<FEFF004C0069006200720065004F0066006600690063006500200036002E0030>
    /CreationDate(D:20180713195647+02'00')>>
    ...
    <</Size 30/Root 28 0 R
    /Info 29 0 R
    /ID [ <C42F3EC6A5ECABE5AA0CE84B1584E2D7>
    <C42F3EC6A5ECABE5AA0CE84B1584E2D7> ]
    /DocChecksum /1EB116D6BE6120C6F961F99F14BCC7DD
    >>>>

We use a hash of the files to prevent duplication and for some caching but these differences prevent the file hash from working. Is there a way to have soffice ignore the creation date (and the rest of the author fields) or get them to be derived from the docx file instead?

mikekaganski · July 13, 2018, 11:33am

A nitpick: placing --outdir /tmp/ not immediately after --convert-to pdf will be error.

And please don’t post as wiki

mwakerman · July 14, 2018, 3:30am

Thanks Mike, actually the above works for us with that argument order (I did have a typo that my input was .pdf instead of .docx). Is there a way I can change this from being a wiki post?

AlexKemp · August 10, 2020, 11:56am