Use soffice to convert from docx -> pdf deterministically.

asked 2018-07-13 13:28:06 +0100

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

We use soffice to convert office files to PDFs with the following command:

    soffice --headless --convert-to pdf /path/to/input_file.docx --outdir /tmp/

Running this command consecutively with the same input file produces files that differ at the byte level. When I look at the PDF, it appears that it's happening because the internal creation date and (therefore) checksum don't match:

    <</Author<FEFF004E0061007400680061006E0020004D00650072007A00760069006E0073006B00690073>
    /Creator<FEFF005700720069007400650072>
    /Producer<FEFF004C0069006200720065004F0066006600690063006500200036002E0030>
    /CreationDate(D:20180713195647+02'00')>>
    ...
    <</Size 30/Root 28 0 R
    /Info 29 0 R
    /ID [ <C42F3EC6A5ECABE5AA0CE84B1584E2D7>
    <C42F3EC6A5ECABE5AA0CE84B1584E2D7> ]
    /DocChecksum /1EB116D6BE6120C6F961F99F14BCC7DD
    >>>>

We use a hash of the files to prevent duplication and for some caching but these differences prevent the file hash from working. Is there a way to have soffice ignore the creation date (and the rest of the author fields) or get them to be derived from the docx file instead?

edit retag flag offensive close merge delete

Comments

A nitpick: placing --outdir /tmp/ not immediately after --convert-to pdf will be error.

And please don't post as wiki

Mike Kaganski gravatar imageMike Kaganski ( 2018-07-13 13:33:41 +0100 )edit

Thanks Mike, actually the above works for us with that argument order (I did have a typo that my input was .pdf instead of .docx). Is there a way I can change this from being a wiki post?

mwakerman gravatar imagemwakerman ( 2018-07-14 05:30:46 +0100 )edit