We use soffice
to convert office files to PDFs with the following command:
soffice --headless --convert-to pdf /path/to/input_file.docx --outdir /tmp/
Running this command consecutively with the same input file produces files that differ at the byte level. When I look at the PDF, it appears that it’s happening because the internal creation date and (therefore) checksum don’t match:
<</Author<FEFF004E0061007400680061006E0020004D00650072007A00760069006E0073006B00690073>
/Creator<FEFF005700720069007400650072>
/Producer<FEFF004C0069006200720065004F0066006600690063006500200036002E0030>
/CreationDate(D:20180713195647+02'00')>>
...
<</Size 30/Root 28 0 R
/Info 29 0 R
/ID [ <C42F3EC6A5ECABE5AA0CE84B1584E2D7>
<C42F3EC6A5ECABE5AA0CE84B1584E2D7> ]
/DocChecksum /1EB116D6BE6120C6F961F99F14BCC7DD
>>>>
We use a hash of the files to prevent duplication and for some caching but these differences prevent the file hash from working. Is there a way to have soffice ignore the creation date (and the rest of the author fields) or get them to be derived from the docx file instead?