Advice on replacing one document's content.xml with another?

Thanks for explaining all that, especially the clear explanation of direct formatting.

Is there an extension which will distil a document down to a set of minimum styles?
E.g. define a single style for all text of a specific set of properties (Font, size, style), e.g. define an EmphasisTR12 style for all TimesRoman italic 12pt?

Or I suppose I could analyse the XML to work this out myself.

I imagine the reason Writer creates a new style each time you make some text italic (which is great ergonomics, a single key stroke or button press), is just in case the user wants each piece of text they directly set to italic a possibly different italic? (I’m trying to guess why Writer would create a new style rather than reuse an existing one.)

In any case, I’ll definitely try out what you’ve described here.

I confess there would be a massive amount of work in finding and changing all directly formatted italics to use a style.
But you’ve given me heaps of great information, thank you!

The reason why Writer assigns a different style every time you press Ctrl+B is it can’t guess whether you consider that bold to have the same meaning as the previous one. This is where styles come in. Consider styles to be author annotations for the text: this sequence is “important”, that sequence is “sarcastic”. You chose that both “important” and “sarcastic” are displayed as bold. They look the same, but the meaning, the semantics is different. Afterwards, you can turn “important” red so that there is no longer ambiguity for the reader. You only change the style and all occurrences are simultaneously updated. This is great but you must be disciplined from the start.

Minimal set of styles: only you can tell. Think of styles as semantic markup, not as typographical effect. Your example of TimesRoman Italic 12 pt is wrong. Why did you choose so? An emphasis? A foreign word? Then there are 2 semantic markups. Later, you map the styles to visual effects.

And since visual effects are rather limited in number, you inevitably end up with several styles looking the same. This is not important if context allows to make the difference. But, you, as an author, can change the look of a specific sequence of TR ital 12 to something else because you carefully markep up the document.

The big task is to correctly mark up the text. Don’t rely on any automatic tool. Don’t even try to scan the XML. It may exhibit non significant style changes because of technical limitations. This is a manual task because you bring added value to it.

You’ll know you have eliminated all direct formatting when Ctrl+M on any selection do not show any formatting change.

Styles may be as handy as the shortcuts you already know because you can assign shortcuts to your favourite styles with Tools>Customize.

That’s exactly what I thought, thanks.
It’s a pity there’s not a mode you can choose for Writer to select between in ‘every markup may be unique’ mode or ‘every markup is the same’. I strongly believe the latter is far more common than the former.
Because I know I only use a specific style change for a single purpose, I know that distilling my document to a minimal set of styles is what I want. There is no other semantic markup: the visible changes represent exactly the semantic differences, and if they look the same they are the same.
Even if only a majority of my uses of italics were for a single semantic purpose, it would be far easier to manually find and split those apart into separate styles than the current implementation, where searching for italic text isn’t reliable
If you’re saying though that I can use Tools>Customise to set CTRL-I as a shortcut for emphasis character style, then that would solve the usability issue I see in the current design.

I’ve also installed DocumentTemplateChanger and have restarted Writer, but can’t find any documentation on how to use it, nor any change to the Writer UI that would let me experiment with it.
No, wait: by looking at the TemplateChanger extension I found the functionality is accessed via a File>Templates item.

I’m currently drawing a blank in trying to find what you mean by
“make sure that the styles in the templates take precedence on those in the document.”

I take your point about the danger of setting myself for a trap in saving a .odt version. I’ll think about that. The opposing risk is forgetting to change a (probably small set) of text items that must be changed, such as the ISBN. Maybe I could manage that risk by marking such items with a comment with some marker text.
OTOH, I’d need to save a .docx for the epub and the mobi versions too, and they have more substantive changes that would need to be reapplied each time, so I think on balance a .odt version for each version would be less likely to lead to errors.
Ideally I could have a front and back section for each version of each book (since these are version specific, with different ISBN, links), saved as .odt, so that when using styles exclusively, copy and paste the large bulk of the text might be reliable instead of unusable.

Style conflicts: without entering into details, make any style adjustment only in the template. The inconvenient is that you must close the document and reopen it for the changes in the template to take effect. When you edit your template, use the Open Template command, not the other usual ones otherwise you get a new document based on the template and you must go through the full procedure to reinstall it.

Multiple versions: I forgot the ISBN issue which means your books show differences, then you need different documents. You end up with 4 files (at least): template .ott, common content .odt and 2 master documents .odm. The master documents will contain the front and back covers plus the “directive(s)” to import the common content. In this case, the page styles will be in the master document and you need no longer change the template.

Caution! If you’re not familiar with templates and master documents, experiment on copies first.

ODF/docx: there are differences between the formats. If your keep a simple structure for your content (template/master/sub-document is not important when exporting), there should not be problems. By simple structure, I mean a linear sequence of paragraphs, few inserted objects (tables are the most problematic), few or no nested frames/objects (avoid tables within tables), simple numbering scheme.

Be aware that exporting to PDF and saving as .docx are not the same and may show differences. I have no experience with ePub.

Many thanks regarding the master documents: that sounds like the solution I need long term, in combination with rigorous use of styles.

Regarding ODF/docx: yes, they’re just fiction books, so they’re a simple structure.
The most complex thing is a small table right at the end, in the ebook editions.

Thanks again, you’ve been a great help. I’ll be doing all this over the next few days.

Thanks also for the note about File>Templates>Open Template. I was doing the wrong thing until you pointed that out, too.
Ah, wait, maybe it’s too late: now, when I open the template file it warns me

“The template ‘Book-5x8’ on which this document is based, has been modified. Do you want to update style based formatting according to the modified template?”

But even if I choose Update styles and then Save, next time I Open Template I still get the same warning.

Ah, maybe fixed it myself: I see there’s also a File>Templates>Save as Template. By doing that, and then choosing to ignore the warning when I close the file that I need to save the document or my changes will be lost, it seems to be okay.

Being rigorous about styles though will probably take a day or two per book, because Writer can’t reliably find italic text. So finding all cases of direct format italics will have to be manual, and error prone.

Then be kind enough to check my answer as satisfactory (click only once on the gray check mark. It may take a while before it turns green).

I’m getting close now to correcting all the problems for all four editions of my first book. (That is, 5"x8" and 4.24"x7" paperback editions, epub edition, and kindle edition).
The problem I’m facing now is that when I tidied up all the page styles in the .docx file (which was all I had for the epub edition), I now have random page breaks in the HTML produced by Calibre after importing the .docx file.
I’m still trying to work out why.
I notice in the XML (I got desperate and saved as .odt and then looked inside that), that there are lots of places where a text:soft-page-break/ appears in the middle of a paragraph. I certainly would not have inserted them, and I can’t find any information about them. What are they for? Are they relevant to my random page-breaks problem? They don’t seem to match up with the page breaks Calibre generates in the HTML for the epub.
The old version of the .docx used to convert correctly using Calibre; the new version has these random page breaks inserted.

Tomorrow I plan to convert the older version of the .docx file to .odt and look inside that, to see if I can see what’s different to the new (supposedly clean and greatly fixed) .docx version.

But this bug sounds highly relevant:
https://bugs.documentfoundation.org/show_bug.cgi?id=43692

If so, maybe the solution will be for me to edit the contents.xml and remove all the soft-pagebreaks.
I can try that tomorrow, too.
(1am here now.)

I try to avoid tweaking the XML: my strategy is to use the styles to express my formatting goal, extracting maximum juice out of them. Then I don’t try to figure out how the style applications will be translated into ODF and its incarnation as XML. I prefer to stay with a single paradigm and not be bothered with the interactions between two.

One point you should consider: apparently your original book is .docx. This is very important and might explain some anomalies. Writer is much richer style-wise than Word and exact conversion cannot be achieved. Notably, Word has no notion of page style. I suspect that part of your problems are probably a consequence of the conversion.

I had a look at one of my complex documents. <text:soft-page-break/> seem to be reminders set by Writer where page breaks occur. When met, the current page style footer and header are laid out on the page. They are “soft” in the sense that Writer …

… has full freedom to move them as opposed to your manual page break which must inflexibly remain where you put it to respect your formatting. My advice: don’t remove them or Writer will need to recompute the page breaks. More generally, don’t play with the XML; this is Writer playground. Yours is the document in plain English. Concentrate on the styles (all of them, paragraph, character, page). They are made to represent your expected formatting. Avoid conversions; always save .odt. Convert only if an addressee requires it And double-check formatting to make sure.

I’m using the XML for diagnosis and understanding.
The original format was .odt; I write my books in LibreOffice. I can’t remember exactly the problem that made me lose the version source .odt file for the epub edition of my book. But because Calibre works much better with .docx files than .odt (the developer, Kovid Goyal advised me back in 2019), I convert the .odt to .docx to make a Word file for Calibre to import.
I insert a manual page break before each new chapter.
If Writer only uses soft-pagebreak in docs converted from Word, why leave them in? Doesn’t Writer normally have to work out where the page breaks go anyway? The soft-pagebreaks are all in stupid places where the page shouldn’t break, so I can’t see how they help Writer.
I’ve now looked inside the .docx too: each spurious page break has moderate-sized slab of xml including an <w:sectPr> section; these also occur just before genuine chapter starts.
The text:soft-page-breaks are more frequent in the .odt and more random.

I’ve also just now looked inside the .docx for the previous version, that doesn’t get the spurious page breaks added by Calibre during the conversion process.

In that .docx file, the <w:sectPr>…</w:sectPr> always encloses the paragraph before a section break (i.e. before each chapter). There are none sprinkled elsewhere. NB: there was always exactly one extra added, on the Ch 1st page

So it seems during my edits or round trips between .docx and .odt, something introduced a whole bunch of these.

I’m going to try deleting them from the problematic .docx file’s XML and see how I go.

Fascinating:

When I simply re-zipped the .docx from the unzipped elements and opened that new file in Writer, it showed me each of the spurious page breaks: they’re visible as page breaks in Writer!

When I re-zip to create a .docx from my edited word/document.xml and reconvert in Calibre, it’s perfect.

In writer the conversion messed up many headers and footers, but they’re ignored in the epub anyway.

Removing the soft-page-breaks from the .odt file made no difference to the .docx file produced, as far as the spurious page breaks.
When I look at the source .odt’s XML, I can’t see any markup that would induce a page break at the spurious points. But if I Save As .docx from the .odt file and look at that .docx in Writer itself, it too shows the spurious page breaks.
And I see it’s Writer that breaks the page styling and messes up the headers and footers.
So I’m thinking I need to file a bug report for Writer’s .docx generation.

I’d like to produce variant versions of my book from a single master version, where the main differences are just in the formatting. The simplest example is creating a 4"x7" edition by just unzipping a 4x7 template .odt and the book’s master 5"x8" edition, copying the 5"x8" edition’s content.xml file into the unzipped template, editing the ISBN and then re-zipping. I use the same name paragraph styles in all variants.

If the differences are simply different formatting, and if you do all your formatting with styles, and if you do in fact use the same named styles, then there is a much easier way to change from your master document to a differently formatted version without having to hack the content.xml file.

  1. Download and install the Template Changer extension.

  2. Create a template for each of the different formats you want, using exactly the same style names in each template.

Now you can use the Template Changer extension to change the template assigned to the document, if you want to change the formatting of the document; or you can make a copy of the document and use Template Changer on the copy if you want a separate document in a different format.