How to match everything up to end of paragraph?

Using RegEx to match some text and then everything up to the end of the paragraph sounds, on the face of it, like it should be a fairly thing to do, and in most flavours of RegEx that I’m familiar with (like InDesign’s), it is.

For example, if I want to match all figure captions, I might do something like this:

^Figure \d+\..+$

That is, match “Figure [numbers].” at the start of a paragraph, followed by any larger-than-zero number of any character (.+), followed by end of paragraph ($). This works fine in InDesign.

In LibreOffice, however, it always matches only “Figure [number].” and exactly one character (or sometimes no characters at all) after that, regardless of how many characters come after before the paragraph ends. It seemingly ignores the + and the $ entirely. I don’t really see how a RegEx pattern ending in $ can match a string that ends in the middle of a paragraph, but it does, somehow.

Looking closer, it seems that it’s not actually ignoring +$ as such; rather, this is to do with formatting. In my case, I’m specifically looking for instances in paragraphs that have the colour red. The pattern that I’m matching on (e.g. “Figure [number].”) happens to also be bold, while the following text is not, but the entire paragraph is red. If I limit the search to red text, however, it seemingly considers the end of the first direct formatting (= where the bold text ends) to be the end of the matchable string.

I recall finding out the hard way before that LibreOffice’s RegEx search works in unexpected ways and applies criteria in a nested way rather than accumulatively, but I would still consider this a violation of RegEx principles: if there is no end of a paragraph in the matchable string, there should be no match – the end of a matchable string is not the same as the end of a paragraph.

Regardless of the cause, is there a way to get past this and actually match a specific pattern plus the rest of the paragraph, specifying formatting for the entire paragraph and ignoring any additional formatting there may be?

You regexp seems correct but have you other criteria checked on the Edit>Find & Replace dialog?

Method-wise, all captions like Figure xxx should be created with Insert>Caption which would 1) automatically number the figure, 2) automatically manage the number (if figure is moved or deleted, if a new figure is inserted before, the whole numbering sequence is recomputed), 3) assign a dedicated paragraph style to the caption.

The last point is the most important because it allows to build automatically an index of figures. You can also search for specific paragraph styles in Find & Replace and thus hit on the figure captions.

Also you seem to heavily direct format your captions. You’re condemned to do so in Word, but Writer has the notion of character style to ease your formatting job.

Apart from the formatting (colour: red), no – no other criteria.

I was aware (dimly, in the back of my mind) that when you specify formatting as well as RegEx in LibreOffice, it first chops the text up and discards everything that doesn’t satisfy the formatting requirement, and only then tests the RegEx pattern individually against each remaining block of text.

What I didn’t realise was that additional formatting (e.g., a paragraph of all-red text where some words are also bold or italic) seemingly acts as ‘borders’ for the purpose of chopping up the text, so that the RegEx matching will be done individually to each specific type of formatting.

So imagine I have a paragraph like this (pretending this is all in red):

Figure 25. A picture of a thing and something else entirely.

If I then specify ‘colour: red’ and use ^Figure \d+\..+$ as the RegEx pattern, the first thing LibreOffice will do is chop the paragraph up into six chunks:

  1. Figure 25.
  2. A picture of a
  3. thing
  4. and
  5. something else
  6. entirely.

Then it will test the RegEx pattern against each of these six chunks individually and (incorrectly, in my view) consider the end of each of them to be equal to an end-of-paragraph ($).

Method-wise, using figure captions is not the way to go here, for several reasons. For one thing, there are no actual figures in the document, only captions (which are only in the document to indicate where figures should appear in relation to the main text).

This is not my own document. It’s a Word file I’ve received, formatted almost entirely with direct formatting, and my purpose is only to transform this direct formatting (which, thankfully, is at least consistent) into character and paragraph styles in preparation for importing it into InDesign for proper typesetting.

I would actually prefer to do this in Word as well, since Word has superior functionality when it comes to searching for and applying styles; but Word is so finicky and bug-prone when doing advanced find-and-replace involving styles that I daren’t – last time I tried, it randomly started replacing all (directly formatted) italics with Heading 1 (yes, a paragraph style applied to individual words in the middle of a paragraph with a different paragraph style applied!) and deleting entire pages of text. </rant>

(It’s not true, incidentally, that you cannot use character styles in Word; you can. When I write documents myself, I use structural styling exclusively, but alas, most people do not. What I really want is for Word and LibreOffice to just support styles properly throughout, the way InDesign does.)

Then it is possible that the load-time conversion between Word format and ODF creates something like the boundaries you describe. I have never experienced such search trouble but I rarely submit complex search criteria involving both regexp and formatting.

It seems like the Word conversion might be responsible. Even if I don’t use RegEx or even specify a search string (just search for ‘colour: red’ and nothing else), the result is the same.

Creating a new document in LibreOffice, typing in the example sentence from my previous comment (formatted as above, in red), and then searching for ‘colour: red’ and nothing else, the first “Find next” highlights the entire line in one go. In the Word-originating file, the same search consistently only finds sub-blocks by their additional formatting, and I have to click through “Find next” six times to get to the end of the line, as per the list above.

I have saved the document as an .odt file, but that didn’t fix it, so once it’s there in the document, it’s seemingly there to stay.

I can only guess that this must be some sort of conversion bug, which I can sympathise with. The .docx format is so bloated and insane I have nothing but sympathy for anyone attempting to write a converter to force its madness into a sensible format – it’s inevitable that bugs will appear.

As you have experienced, exact converion between formats is next to impossible. In the case .docx → ODF, the attempt to translate formatting results in zillions of single-use paragraph, character and page styles. Once this mess is part of your file, saving it as .odt will only “freeze” the situation, avoiding cumulative conversions when loading from and saving to .docx.

You can fix the file by restyling it consistently but this is equivalent to rewriting it. And you better not forget any single modest pseudo style application! The worst trouble is with page styles because nearly every page gets its own style.

But since your goal is to export to In Design, do your best. Are styles undestood by InDesign or must you again reformat everything?

Yes, that’s the whole point of converting to styles: to ease the import.

You can get InDesign to bring in all the direct formatting, but then you end up with a humongous mess of InDesign styles you actually want and pseudo-style crap that you absolutely don’t. Sometimes you even get direct formatting pulled through that isn’t available in InDesign, and which, consequently, you can’t get rid of (usually related to the “Complex Script” options in Word, some of which only have equivalents in the CJK version of InDesign). It’s a dreadful mess.

But InDesign does recognise and understand character and paragraph styles even if your import settings discard all direct formatting (well, most of it, anyway; .docx → .indd conversion is also not free of problems, so some direct formatting always slips through, but that’s easy to remove). You can specify precisely which styles in the Word file should be mapped to which styles in your InDesign document, which to ignore (= apply ‘none’ to), and which to auto-create new styles for. It’s very flexible.

(InDesign is also the only text editing/layout app I know of that allows me to specify both paragraph style, character style, direct formatting and text at the same time, in both the ‘search’ and the ‘replace’ parts of the find-and-replace functionality. E.g., search for word A with paragraph style B and character style C applied, and replace it with word X with paragraph style Y and character style Z applied.)