PDF Encoding Issues with page-image matching when converting DOCX to PDF

christoph.gmeiner · August 30, 2024, 8:59am

Hi everyone,

we use libreoffice command line to transform several kind of document types into PDF.

When we convert DOCX files, which contain images, to PDF files and then open the PDF file programmatically, e.g. with Python libraries, the matching between pages and images is messed up.

Concrete example:
1 document with 3 pages and 3 images. Every page has one image. When we transform the doc with MS Word to PDF and open it programmatically, we get the exact matching, meaning {page1: image1, page2: image2, page3: image3}.
When doing the same with libreoffice we get {page1: [image1, image2, image3], page2: [image1, image2, image3], page3: [image1, image2, image3]}

Did anyone experience a similar situation and has a potential solution for it?

thanks and best regards

Christoph

mikekaganski · August 30, 2024, 9:02am

A concrete example would be: a sample DOCX; a specific command line to convert to PDF; and exact Python code to check.

christoph.gmeiner · August 30, 2024, 9:59am

Sorry, here are more concrete examples:
File:
test02.docx (62.0 KB)

CLI command:

libreoffice --headless --convert-to pdf:writer_pdf_Export --outdir test test02.docx

Python Code:



from pypdf import PdfReader

local_pdf_path = "test02.pdf"
temp_dir = "imgs04"
temp_dir_path = "%s/*" % temp_dir
extracted_images = []

files = glob.glob(temp_dir_path)
for f in files:
    os.remove(f)

with open(local_pdf_path, "rb") as pdf_file:
    pdf_reader = PdfReader(pdf_file)

    for page in pdf_reader.pages:
        for img_idx, image_file_object in enumerate(page.images, start=1):

            print(page.page_number)
            print(page.images)
            # create a filename for the image
            img_file_path = (
                    f"page_{page.page_number}_"
                    + f"img_{img_idx}_"
                    + image_file_object.name
            )

            # create a path to the temporary file
            temp_img_file_path = os.path.join(temp_dir, img_file_path)

            # write the content to the temporary file
            with open(temp_img_file_path, "wb") as fp:
                fp.write(image_file_object.data)

            # Collect the image file path for later upload
            extracted_images.append(temp_img_file_path)

print(len(extracted_images))

Output:

0
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
0
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
0
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
1
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
1
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
1
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
2
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
2
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
2
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
9

ajlittoz · August 30, 2024, 10:06am

I can’t make a statement about macros etc. because I don’t practice them. However, your document is in the easy case: your images are part of the text as characters. You don’t experience then all the difficulties with frames (which don’t exist in DOCX).

mikekaganski · August 30, 2024, 10:52am

Looks like pypdf error (unless proven otherwise). The PDF is shown correctly - I’d say ask pypdf maintainers. Of course, it’s possible that they point to a specific bug in LibreOffice PDF export - then it would be needed to file a bug report.

(A note: the Python script needs also to import os and glob, and relies not only on pypdf, but also on Pillow)

christoph.gmeiner · August 30, 2024, 11:09am

PyPDF error is very unlikely. When I convert the respective Docx file directly in MS Word to PDF and pass this specific PDF file to the code above … the page image matchings are correct.

So it’s either the DOCX to LibreOffice transformation piece or the writer PDF export function where this mismatch happens

christoph.gmeiner · August 30, 2024, 11:17am

submitted a bug to pypdf here: Image/Page Matching Issue when using PDF file converted by LibreOffice · Issue #2822 · py-pdf/pypdf (github.com)

mikekaganski · August 30, 2024, 11:25am

Note that “some file behaves as expected” is not a proof that “there’s no problem in the software” - as with any complex file formats, there is usually more than one way to do the same thing, and Word-generated PDF is not necessarily similar in internal structure to the same-looking PDF generated by Writer; hence, it’s not granted that pypdf behaves correctly for another PDF structure.

And thanks for filing the bug. For pypdf, it might be useful to simply attach the resulting PDF, because for them, the process of generation is unimportant, only the PDF data is needed.

mikekaganski · October 15, 2024, 12:22pm

Summing up the discussion happened on the pypdf GitHub issue page: “the standard provides “grammar” not “stories”. This approach is compliant and i’ve seen it many times.”

So:

There is no bug in pypdf. It does its job correctly.
There’s also no bug in LibreOffice. It emits a compliant PDF, even though it’s different from what Word generates.

The problem was in the incorrect expectation that resources attached to a page in PDF list all the resources of that page, and only them.

In fact, a page may have no resources attached at all - and use inherited resources. Also, the attached resources are not required to be “limited to this page”. See e.g. 4. Document Structure - PDF Explained [Book]

What LibreOffice does is creation of a single /Resources dictionary, and referencing it in every page - so yes, they share a common dictionary. As far as I understand, this was done to simplify deduplication; however, an alternative could be referencing the dictionary once, at the root node, and omitting it in pages at all. This wouldn’t change anything for @christoph.gmeiner - there would still be no 1:1 relation between the page and its resources dictionary. Why wasn’t it implemented that way? No idea; maybe there are times where the hierarchy is too complex; or there may be reasons to avoid accidental inheritance.

The PDF export code is here.