Why does copying a document cause parts of the copy to overlap?

PhilvanKleur · December 29, 2020, 7:09pm

I’m having trouble copying a document into a clean file, because when I do, parts of the copy overlap. This is on the latest Libre Writer for Windows 10, updated this evening. Should I report this as a bug? I’d also be grateful for work-arounds, as I have a student who needs these copies pretty soon.

More details: I produced my document by converting a PDF to Word format with a program called Kofax Power PDF. I don’t know whether that’s relevant. The original PDF was an academic paper, mostly formatted in two columns. I want to paste the converted version, and a number of other papers, into a single file for my student, as they have vision problems and find it easiest to search amongst papers when they’re collected together like that.

However, I found that when I copied and pasted it, the final two pages of the copy overlapped. You can see that in the screenshots below. On the left is the original. On the right is a copy, which I pasted into a blank document made with File > New. Note the position of the scroll bars: because the copy is shorter, the final two shots show the same parts of it. I copied by going to the original, pressing Control-A and then Control-C, and then moving the cursor to the start of the blank document and pressing Control-V.

I’ve attached the original and a copy. These demonstrate the problem: I’ve just checked by opening each one.

The details of my Libre Writer are:

Version: 7.0.4.2 (x64)
Build ID: dcf040e67528d9187c66b2379df5ea4407429775
CPU threads: 4; OS: Windows 10.0 Build 18363; UI render: Skia/Vulkan; VCL: win
Locale: en-GB (en_GB); UI: en-GB
Calc: threaded

dd_Van_der_Veen_Hall_and_May_1993_Woad_and_the_Britons_Painted_Blue.docx

Untitled 1.odt

ajlittoz · December 30, 2020, 1:57pm

I had a look at your original and copy documents. The problem comes from the conversion from PDF to .docx plus the conversion from .docx to ODF.

Let’s talk about the first conversion (the .docx document – unfortunately I have no Word here and could display it only with Writer, so my explanation may not be 100% exact).

Remember that PDF is a page description format and does not keep text flow. Text is broken into lines (or part of lines) and these segment are independently positioned in the page.

Your converter application did quite a good job to collect together line crunches into a vertical text flow. It did synthesized paragraphs within columns but failed to recognise that text flow continued to the next column. And this is not the only failure.

It failed to identify “Oxford J. of Archaeology p. X” as being a footer. Similarly, it missed “Basil Blackwell …” as the publisher name and address.

It was disturbed by the varying number of columns between article text, title and illustration. Consequently (here I don’t know if the converter or Word/docx format is to be blamed), the document is spread over frames which are not related at all between themselves, meaning text does not flow from one to the other. Text in any frame is self contained; if you change frame size, text will not spill over to another frame. Sometimes, data which has nothing to do semantically with fame contents is included in the frame (e.g. “Oxford J. of Archaeology” and/or page number).

Worse, all frames are anchored To character which implies there is some paragraph somewhere in the “standard text flow” to attach the frame to.

When you copy the .docx document to paste it into an .odt, you paste the content without change. But if your pages are not exactly identical, i.e. same size, same margins, same font, you may get a slightly different spacing of the pasted elements,enough to grant space for one line of text on the last page. And now, the supporting “empty” paragraph for the last page frame can be shifted to the preceding page. The attached frames will be laid out with the preceding ones resulting in overlap. I tried to fix this by unchecking the Allow overlap flag for frames but it didn’t work because the frames are positioned with absolute coordinates in the page (thus they cannot move to prevent overlap). The only way was to add a manual page break but it not always obvious to find the correct location for it.

You have other problems:

in Writer all frames have borders (too many frammes to fix easily)
the conversion process (probably *.docx → .odt) created one page style per page (once again too many styles to fix)
the .odt document ends up direct formatted: no way to fix it conveniently, needs to be fully reformatted first
The last page (bibliography) is offset to left and is partly inside the left margin

Your problem has no solution because the “original data” is not text-flow compliant and .docx format has brought its load of compatibility issues.

PS: Apparently, the PDF was acquired through some form of OCR processing as shown by the “glitches” in page 369 and 370.

To show the community your question has been answered, click the ✓ next to the correct answer, and “upvote” by clicking on the ^ arrow of any helpful answers. These are the mechanisms for communicating the quality of the Q&A on this site. Thanks!

In case you need clarification, edit your question (not an answer which is reserved for solutions) or comment the relevant answer.

PhilvanKleur · January 2, 2021, 9:34am

@ajlittoz, thanks for all your work on this. I accepted the answer, but don’t have enough reputation to upvote. I think I should send your analysis to Kofax, the people who make the converter.

Hrbrgr · December 29, 2020, 7:26pm

Why do you work with cut and paste?

I opened the DOCX file in LibreOffice and IMHO everything is displayed correctly.

Hrbrgr · December 29, 2020, 7:31pm

With me Windows 10 Home; Version 20H2; 64-Bit;

LibreOffice - Version: 7.0.4.2 (x64)
Version: 7.0.4.2 (x64)

Build ID: dcf040e67528d9187c66b2379df5ea4407429775
CPU threads: 8; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win
Locale: de-DE (de_DE); UI: de-DE
Calc: CL

PhilvanKleur · December 29, 2020, 8:35pm

How do you mean, why do I work with cut and paste? I want to copy a lot of documents into one file. Opening each document, then doing Control-A and Control-C, then Control-V into the new file: that’s convenient, because it’s a sequence that I’m used to doing for lots of Windows tasks, including image editing.

The point of my post is that the second file displays incorrectly. The first is my original, and is fine. The second is a copy thereof, and has something wrong. It should look the same as the first.

Hrbrgr · December 30, 2020, 5:26am

Sorry, my mistake. Can you still provide the pasted file here?

To do that, edit your Question.

Hrbrgr · December 30, 2020, 5:48am

I’ve now looked at your ODT file and I’m afraid there is no simple solution to this. Unfortunately the conversion of the PDF file does not result in any continuous text. It would be easier to merge PDF files with an appropriate program, e.g. PDF24.

PhilvanKleur · December 30, 2020, 9:12am

Thanks for looking, ebot. I appreciate you taking the time to do that. But surely this is a bug? Copy and paste is meant to make an exact copy, by definition. For it not to is like the Windows copy command not copying a file exactly, or like assignment in a programming language not copying its right-hand side. It violates the Law of Least Astonishment.

Hrbrgr · December 30, 2020, 9:28am

But surely this is a bug?

No, I don’t see it that way.

If the conversion from a PDF does not provide a reasonable body text (that could be an error), the result cannot be repaired by LibreOffice.

My suggestion above to merge the PDF files directly applies.

It is not a problem of LibreOffice!

PhilvanKleur · January 2, 2021, 6:29pm

ebot, thanks for recommending PDF24. I just tried converting a PDF about Roman art to DOCX. It had a complicated format and lots of footnotes. Power PDF mixed up the footnotes with the main text, seemingly unable to tell the difference. PDF24 kept them in their original position and type size, preserving most of the formatting. It’s the first conversion I’ve tried with it, but it definitely did better than Power PDF. And is free.