PDF Encoding Issues with page-image matching when converting DOCX to PDF

Hi everyone,

we use libreoffice command line to transform several kind of document types into PDF.

When we convert DOCX files, which contain images, to PDF files and then open the PDF file programmatically, e.g. with Python libraries, the matching between pages and images is messed up.

Concrete example:
1 document with 3 pages and 3 images. Every page has one image. When we transform the doc with MS Word to PDF and open it programmatically, we get the exact matching, meaning {page1: image1, page2: image2, page3: image3}.
When doing the same with libreoffice we get {page1: [image1, image2, image3], page2: [image1, image2, image3], page3: [image1, image2, image3]}

Did anyone experience a similar situation and has a potential solution for it?

thanks and best regards

Christoph

A concrete example would be: a sample DOCX; a specific command line to convert to PDF; and exact Python code to check.

Sorry, here are more concrete examples:
File:
test02.docx (62.0 KB)

CLI command:

libreoffice --headless --convert-to pdf:writer_pdf_Export --outdir test test02.docx

Python Code:



from pypdf import PdfReader

local_pdf_path = "test02.pdf"
temp_dir = "imgs04"
temp_dir_path = "%s/*" % temp_dir
extracted_images = []

files = glob.glob(temp_dir_path)
for f in files:
    os.remove(f)

with open(local_pdf_path, "rb") as pdf_file:
    pdf_reader = PdfReader(pdf_file)

    for page in pdf_reader.pages:
        for img_idx, image_file_object in enumerate(page.images, start=1):

            print(page.page_number)
            print(page.images)
            # create a filename for the image
            img_file_path = (
                    f"page_{page.page_number}_"
                    + f"img_{img_idx}_"
                    + image_file_object.name
            )

            # create a path to the temporary file
            temp_img_file_path = os.path.join(temp_dir, img_file_path)

            # write the content to the temporary file
            with open(temp_img_file_path, "wb") as fp:
                fp.write(image_file_object.data)

            # Collect the image file path for later upload
            extracted_images.append(temp_img_file_path)

print(len(extracted_images))

Output:

0
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
0
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
0
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
1
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
1
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
1
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
2
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
2
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
2
[Image_0=/Im11, Image_1=/Im15, Image_2=/Im4]
9

I can’t make a statement about macros etc. because I don’t practice them. However, your document is in the easy case: your images are part of the text as characters. You don’t experience then all the difficulties with frames (which don’t exist in DOCX).

Looks like pypdf error (unless proven otherwise). The PDF is shown correctly - I’d say ask pypdf maintainers. Of course, it’s possible that they point to a specific bug in LibreOffice PDF export - then it would be needed to file a bug report.

(A note: the Python script needs also to import os and glob, and relies not only on pypdf, but also on Pillow)

PyPDF error is very unlikely. When I convert the respective Docx file directly in MS Word to PDF and pass this specific PDF file to the code above … the page image matchings are correct.

So it’s either the DOCX to LibreOffice transformation piece or the writer PDF export function where this mismatch happens

submitted a bug to pypdf here: Image/Page Matching Issue when using PDF file converted by LibreOffice · Issue #2822 · py-pdf/pypdf (github.com)

Note that “some file behaves as expected” is not a proof that “there’s no problem in the software” - as with any complex file formats, there is usually more than one way to do the same thing, and Word-generated PDF is not necessarily similar in internal structure to the same-looking PDF generated by Writer; hence, it’s not granted that pypdf behaves correctly for another PDF structure.

And thanks for filing the bug. For pypdf, it might be useful to simply attach the resulting PDF, because for them, the process of generation is unimportant, only the PDF data is needed.