How to find three optional patterns in one regex (abc–abc, abc –abc, abc– abc), but not a fourth one (abc – abc)

Earendil · July 4, 2023, 4:45pm

Hi,
I want to find

word–word
word –word
word– word

but not

word – word

Presently I am using
([a-z]{1,}|[\u00F0-\u02AF]{1,})–([a-z]{1,}|[\u00F0-\u02AF]{1,})

which may be simplified in some way.

I replace the result with:

$1 ⁠–⁠ $2
In order to have 4.

Now, would there be a way to create one regex that covers also 2. and 3. (but not 4.) and also allows replacements in order to obtain 4.?

ajlittoz · July 4, 2023, 4:53pm

([a-z\u00F0-\u02AF]+)\s?–\s?([a-z\u00F0-\u02AF]+)

It also matches on 4 but it replaces with itself, so it does not matter, except for a negligible loss of performance.

mikekaganski · July 4, 2023, 5:04pm

Another option:

(?:(?<=\w)–\s*)|(\s*–(?=\w))

Earendil · July 5, 2023, 10:18am

Thanks, @ajlittoz I needed to avoid 4. to be able to concentrate on the expressions that need work.
And thank you @mikekaganski I actually needed to capture words in a specific order in order to replace only the part I need.

mikekaganski · July 5, 2023, 10:58am

Hmm.

That was your original question. Was it misleading/incorrect? Because my pattern allowed to replace with " – "⁠ , without any backreferences, and have 4.

erAck · July 4, 2023, 5:06pm

Your definition of word characters as ([a-z]{1,}|[\u00F0-\u02AF]{1,}) is somewhat odd, I’d simply use \w+ instead, but… anyway, the dash part is just a three case alternative, so
\w+(?:–| –|– )\w+
(here with the (?:) non-capturing parentheses that have less overhead than () capturing parentheses) or, if you insist,
([a-z]{1,}|[\u00F0-\u02AF]{1,})(?:–| –|– )([a-z]{1,}|[\u00F0-\u02AF]{1,})

Edit: changed outer non-capturing parentheses to capturing due to replacement requirement.

Earendil · July 5, 2023, 10:26am

Hi @erAck working on your suggestion I came up with something I can use (I needed capturing word for the replacements):

(\w+)(–| –|– )(\w+)

Now this works as expected, but I need to solve a problem that I recently noticed.
Sometimes – are not followed or preceded by a word but by a comma or a full stop (or other typographical characters).

I still want to avoid finding 4. but want to include those cases. Would I have to include all possible characters within the brackets separated by | or is there some other way to do that?

mikekaganski · July 5, 2023, 10:49am

Use \S instead of \w. Or, if you specifically need to match (or exclude) the SPACE (U+0020) character, not any whitespace character, then [^ ].
More information in documentation.

Note that this is based on reading your “Sometimes – are not followed or preceded by a word but by a comma or a full stop (or other typographical characters)” as “I want to treat those typographical characters the same as words”; but it could be different. You need to explain your patterns better, and explain what you expect in all cases like “.–,”, “foo –.”, etc.

Earendil · July 5, 2023, 11:53am

Hi @mikekaganski I actually had tried your previous answer (?:(?<=\w)–\s*)|(\s*–(?=\w)) and had no results in my test file, but now I realize something was amiss (I probably had some formatting on in LO search and replace window) and it now works. Also, I took some time to understand it and I also looked at the documentation. So now I am using (?:(?<=\S)–\s*)|(\s*–(?=\S)) and I have no problems at all. Thanks a lot!