Anomalous Behavior - Total Page Number Changing Wildly & Images Moving After Being Laid Out Correctly in Large Document

I’m experiencing anomalous layout behavior in LibreOffice Writer. I’ll do my best to summarize the issue.

LibreOffice version: 6.4.7.2.

System Specs (in case it’s a memory issue) -

Processor: i5 3.2ghz quad core
Memory: 8gb DDR4 RAM
OS: Ubuntu 20.04.4 LTS
Gnome version: 3.36.8

My document is a book I’m writing. The raw manuscript is ~1600 pages which will likely be 3,000 pages once I correct all the missing formatting.

I used an official WordPress utility to export 12 years of WP articles as .DOCX documents. I imported those into a master document .ODM file in LIbreOffice. Once I had them in the right order, I saved the project as an .ODT file and removed all the document links so that I could edit the book layout freely.

Due to a bug in the initial conversion, nearly all paragraph breaks were lost so all 3,000 pages of text are sentences butted up against each other. And similarly, the article images were jumbled and scattered at random throughout the document. I submitted a help request and a member of this community was amazing at helping me use the PicTool extension to batch-scale all 1300+ images and to dynamically assign their scale to the paragraph area. This way, if my publisher asks me to modify the margin measurements closer to publication, all images will snap to the new area. I was so grateful for that help!

The remaining task has been to go through the 1600+ page document, pull up each original WP article one at a time on the web, and correct all the missing paragraph breaks and to reposition all the images where they’re supposed to appear throughout each article. It’s a lot of work, but I blocked out several hours each weekend and created a spreadsheet to track which articles I’ve cleaned up. I’m tackling them in the order they appear so that I can move linearly through the document without compromising earlier pages as I work. I’ve been incrementally saving versions of the document as I go in case I suddenly discover that I’ve damaged something earlier in the file.

The document has an unnumbered title page, preface, and dynamically generated TOC. Numbering begins at the first article following the TOC. Periodically I know I need to right-click and Update Index so that the TOC reflects the adjusted pages as I clean them up.

This brings me to the anomalous behavior. I noticed that when I launch the document and scroll to the very end, the final numbered page showed as 1685. I closed and reopened the file and the final page number jumped to 1597. I clicked on Update Index for the TOC and then scrolled to the end and the final page number then showed as 1593. Worried that I was losing content, I tried opening the next-to-most-recent version of the doc alongside the other to compare them. LibreOffice crashed and force-exited, most-likely due to it running out of memory loading two 1600+ page documents. I relaunched them, recovered the lost files when prompted at launch, and updated the index for each file’s TOC. Now the final page number for both versions showed as 1593.

They crashed and force-quit again, and after recovering them one document showed 1594 pages instead of 1593. Then it crashed and exited again. The next time I opened it it showed 1589 pages. Updating the TOC changed it to 1594. I closed and reopened the earlier version and then it showed 1592 numbered pages. I closed and reopened them without attempting recovery and it showed 1658 pages before I updated the TOC. After updating it showed 1591 pages. I should note that I’d made no changes to the layout of any numbered pages during any of this so there is no reason the total number of pages should be changing at all.

Then I scrolled through the document reviewing the articles I’d recently tidied up and found that a few had stray images hanging off the page where I had previously worked incredibly hard to place them correctly. This is so disheartening as it takes me hours of painstaking work to correct the layout of each article. Checking the next-to-most recent version, one of the misplaced images was correct but the other had similarly shifted.

There’s no way for me to audit the thousands of pages of content after I’ve already proofed them and saved incrementally without re-reading my book hundreds of times, so I need a better solution to preserve the changes once I make them. And I need to determine why the page count is changing so wildly without any changes being made to the numbered pages of the document.

I really hope there is a solution so I can eventually publish my book. I sincerely appreciate any insight the community can offer.

Thank you!

Only guesses because updating a TOC should not change the finale page number (since you start page numbering after the TOC and TOC collecting does not change contents).

Only image shift can cause contents layout changes. This depends on anchor and wrap properties. The only way to get predictable, deterministic and reproducible positioning of image is through use of frame styles, despite their difficulty of set up and weird behaviour. Unfortunately, I understand you tuned everything manually.

Another possibility for page number change is a wrong specification with the page number field. A common error is to request a page offset instead of a page restart through a specific manual page break. But this will not move images.

A ~1600-page document bumps certainly into performance issues. A way to address them and partially solve them is to systematically style the document, avoiding direct formatting. This means using all possibilities; paragraph, character, page and frame. Don’t forget list styles if you need to number paragraphs.

3 Likes

When your articles have been saved to Writer’s native file format, it is not likely that content is lost when you save and reopen. Repagination after content reflow is the likely process at play here. This reflowing can occur when Writer attempts reasonable positioning of graphics. Another situation, which I suspect may be at play here, is when there is extensive use of flow-control formatting elements in your document.

My suspicion is based on your exporting content from WordPress via docx to odt. The three formats have inherent differences, which lead to conversion not being entirely straightforward.

  • WordPress uses HTML for structure and CSS for design (roughly). HTML/CSS contain mechanisms to handle the concept of pages, but as far as I have seen they are rarely used because HTML content is usually a web page, output to screen in a continuous stream. Also, the layout width for html content is often dynamic, so it adapts to the reader’s screen width, window size and display settings.
  • Word may use hybrid text/paragraph styles for design/structure, but exported content will often rely on direct formatting for structure and styling. Page layout is by direct formatting on a section. A Word section is a divider between (possibly named) parts of continued document flow.
  • Writer depends heavily on styles for document design and structure. Direct formatting can be used, but for page layout specifically, it is mandatory to use styles. A Writer section is a separate (named) part within the document flow.

Sections are important because in both steps of the translation process (html-docx and docx-odt) they may have been inserted to convey page adjustments. This is often used to make a paper document into an optimal representation of the web/onscreen output. While it may help with being true to the source, extensive use of sections makes further editing into a pain.

The keep with next page flow setting will usually do nothing on a web page, because there are no page breaks. Overusing it on a paper document makes an ugly mess.

Solution

With your situation, I don’t believe there is any substitute for manual labor. On the upside, it is not likely that changing page count indicates content loss. It is useful to know what you are working with, and understand how and why changes (e.g. reflowing) happen.

Good luck!

5 Likes

A precision (otherwise excellent comment):

In Writer, a section is a part of a “region” controlled by a page style. This is fundamentally different from Word where a section is roughly equivalent to a page styled “region”.

Writer sections will not be used during the odx→odt translation. You’ll get many single-use page styles.

Follow @keme1’s advice and review manually your document. But before this, think carefully about the styles you’ll need. Many of them are available as built-ins, such as Heading n for paragraphs or Emphasis/Strong Emphasis for character, not speaking of page styles. You’ll only need to create variants of Text Body for specific paragraphs.

You’ll create less than 5 paragraph styles, tune built-in paragraph and character styles, perhaps create a handful of character styles. Pay special attention to page styles (you should do with 3-5). The most difficult challenge is to design frame styles (or customise builtins). If you can “standardise” your image properties one or two frame styles should do.

After that, it is only a matter of removing direct formatting and assigning styles to your “objects” (a double-click and it’s done).

4 Likes

My sincerest gratitude for the prompt and impressively detailed responses!

It is such a tremendous relief that, as you’ve each said, it is unlikely that the page number anomaly is indicative of content loss, and instead, is more likely the result of content (particularly image) re-flow. That eases my mind that this year of manual cleanup will not be in vain.

I am taking your advice an employing styles to automate layout properties wherever possible. Working smarter, not harder is critical. Fortunately, titles, headlines, and paragraph text were all translated to and assigned styles in the conversion process from DOCX, so I only needed to adjust those styles slightly to achieve the desired result.

The rest of the task, as you’ve each stated, will be the manual process of fixing all the sentences and missing / misplaced paragraph breaks for 3,000 pages of content. And while the PicTool extension made quick work of scaling all the images and dynamically centering them to the defined paragraph area as I said, there remains the task of selecting each image once at a time, cutting to clipboard, positioning the cursor where each image needs to go, pasting the image, and then tidying up the flow of adjacent text for all 1,300 images, many of which are invisibly stacked atop one another at the beginning of each article, sort of like layers in Photoshop.

It doesn’t appear, due to the seemingly random text breaks and image positioning, that there is a way to batch process either the text or the images with styles beyond what I’ve already done. It’s going to be a lot of manual work, but I’m 60 articles into the project already so I’m determined to make this work in the end.

Once again, thank you all for your expert insight. This community has twice now aided me with moving forward with this book authoring project and your knowledge of the nuances of the software are unparalleled. Hopefully in about a year I can go to press and call this project a success thanks to your help!

-James

1 Like

Be aware that the .docx step has pushed in its own mess, notably because Word has no notion of character and page styles. Consequently, methodical styling will be a very tedious process. And even with careful styling, your document will remain clumsy because of the DOCX mess.

1 Like

Good to see that you have found the situation workable.

Some tips from the outside which might make the workflow a little smoother:

  • Disable rendering of images (less load on the graphics subsystem). (All pictures will show as placeholders):
    • Menu Tools - Options,
    • Select branch Writer - View,
    • untick Graphics/images
  • Split to subdocuments and master document. Edit the subdocuments individually. Reflowing would then be limited to the volume of the subdocument.
    I must say I haven’t worked with master-/subdocuments for this purpose, so I am not sure what gain you can expect. DO NOT abandon your original document for this! At least keep it as a backup.

If you experience frequent freezes while you edit, those tips may help. If the situation is workable with your current setup, perhaps no need to change.

If you don’t feel confident about working with master-/subdocuments, note that they have their own intricacies. Abiding by a common set of styles is important in order to achieve consistent output.

1 Like

Thank you so much again for these brilliant solutions!

When I originally constructed the book, I used a Master Document and had all the individual article DOCXs linked. I found it difficult to grasp and control how they flowed from one to the next so for editing purposes I exported ODM to an ODT and removed all the document links from the headings. I found that gave me greater control, though at the expense of performance. I think I’ve gotten the hang of it now, so I’ll press on.

As for disabling the rendering of images, I am ever-impressed by the versatility of LibreOffice to incorporate features like these! Considering the function carefully, disabling rendering won’t work for my particular situation as many images are stacked atop each other at the beginning of each article with only the “highest” layer visible, so I’ll need to be able to see which images are which to cut them to the clipboard and position and paste them where they should be. Without seeing the images this would be difficult. But I sincerely appreciate all these wonderful suggestions!

All features in LO, Writer in particular, will work fine if you save files in native format, i.e. .odt for Writer. Having your subdocuments as .docx causes many of the problems you experienced.

Regarding picture stacking, you can fix it provided you didn’t “botch” your case with direct formatting the images. By default, pictures are assigned Graphics frame style. Right-click on this style name in the frame list of the style side pane and Modify. Go to the Wrap tab and untick Allow overlap. If this is not enough to reflow the document, force it with Tools>Update>Update All.

Remember you can’t tame a long document without styling, more precisely full and integral styling without direct formatting. Writer has pushed the concept of styles far beyond what is available in Word (where direct formatting can’t be avoided because of the lack of character, page and frame styles among others).

2 Likes