I’m trying to regex my way through a very long document in order to convert manually created headings (plain-text numbering, direct formatting instead of styles) to proper headings.
The direct formatting is a mess – different fonts and sizes all over the place – but the one thing all the headings have in common is that they’re in bold. The format is:
1(.) Heading 1
1.1 Heading 2
1.1.1 Heading 3
(Occasionally the dot will be missing from Heading 1, but it’s mostly there.)
There are perhaps something like 700 headings in total, which is an awful lot to go through manually. So I figured the easiest way to go through them would be regex: find all occurrences of start-of-paragraph + number(s) + dot(s) + space + text + end-of-paragraph, limit to occurrences where the whole thing is bold.
As far as I can think, this regex ought to do that for Heading 1:
^\d\.? +(.+)$
But the end-of-paragraph marker $ seems to be ignored here – it matches a whole bunch of non-heading paragraphs that happen to start with number + dot + bold text, but then continues with non-bold text, like so:
3. This is part of a list and not a heading, and only the first part is bold.
I would have expected my regex not to match that at all, since the group (.+)
with bold formatting does not appear right before the end of the paragraph.
What gives? How can I make LibreOffice’s regex engine match only cases where the formatting applies to the entire paragraph, up until the end-of-paragraph marker?
Also, when I replace the text with $1
, which I would think ought to be the first matched group, (.+)
, i.e., any text that comes after number + dot + space, what it does rather puzzles me: it leaves the entire paragraph intact, i.e., doesn’t change anything in the text itself, but resets all formatting completely, removing direct formatting and applying Default Style instead of whatever style was applied before. Huh?