Regex ignores $ when formatting is specified?

Oisín · April 26, 2020, 6:38am

I’m trying to regex my way through a very long document in order to convert manually created headings (plain-text numbering, direct formatting instead of styles) to proper headings.

The direct formatting is a mess – different fonts and sizes all over the place – but the one thing all the headings have in common is that they’re in bold. The format is:

1(.) Heading 1
1.1 Heading 2
1.1.1 Heading 3

(Occasionally the dot will be missing from Heading 1, but it’s mostly there.)

There are perhaps something like 700 headings in total, which is an awful lot to go through manually. So I figured the easiest way to go through them would be regex: find all occurrences of start-of-paragraph + number(s) + dot(s) + space + text + end-of-paragraph, limit to occurrences where the whole thing is bold.

As far as I can think, this regex ought to do that for Heading 1:

^\d\.? +(.+)$

But the end-of-paragraph marker $ seems to be ignored here – it matches a whole bunch of non-heading paragraphs that happen to start with number + dot + bold text, but then continues with non-bold text, like so:

3. This is part of a list and not a heading, and only the first part is bold.

I would have expected my regex not to match that at all, since the group (.+) with bold formatting does not appear right before the end of the paragraph.

What gives? How can I make LibreOffice’s regex engine match only cases where the formatting applies to the entire paragraph, up until the end-of-paragraph marker?

Also, when I replace the text with $1, which I would think ought to be the first matched group, (.+), i.e., any text that comes after number + dot + space, what it does rather puzzles me: it leaves the entire paragraph intact, i.e., doesn’t change anything in the text itself, but resets all formatting completely, removing direct formatting and applying Default Style instead of whatever style was applied before. Huh?

mikekaganski · April 26, 2020, 7:32am

Do you mean that you look for text that both has bold formatting, and also matches a regex? The search engine first finds bold text, and then applies the regex to the resulting text. So in the example you provided, the search for bold text extracts “3. This is part of a list” from that paragraph, and then this matches the regex you use (“t” being the end of the string passed to regex).

But that changed in upcoming v.7.0. There it works as you expect it (fixes to tdf#65038, tdf#75806, tdf#130984 together changed that).

Oisín · April 26, 2020, 7:38am

Oh, I see! Well, I would never have guessed that. So there is currently no way to achieve what I’m looking for? I suppose I’ll have to look forward to 7.0 coming out, then.

Oisín · April 26, 2020, 12:45pm

Actually, come to think of it, even with the order that it first finds all bold text and then applies the regex to only that text, this still doesn’t quite make sense. If that’s what it’s doing, I wouldn’t expect it to match the example I gave either, since, in the text that matches the formatting (bold), there is no end of paragraph position: since the $ in the regex pattern does not match anything in the bold text, the thing as a whole shouldn’t match either… right?

mikekaganski · April 27, 2020, 8:42am

The regex engine is given the short string: “3. This is part of a list”. This is all what the engine knows. It has no idea about paragraphs or whatever. The text it works with ends with the “t”, and that character is, from the engine’s PoV, the end of text - thus matches the $. That was a bug, and it was fixed as I mentioned above, but that’s what was happening. Thinking about it like “I wouldn’t expect it” is logical from user PoV, but doesn’t make sense when you are trying to understand why it doesn’t work (i.e., looking under the hood) - that’s what we do when find the reasons for bugs, and then what we expect means little, and what happens actually is the most important