Ask Your Question
1

Why doesn't LibreOffice find word boundaries correctly?

asked 2017-07-17 09:01:43 +0100

lomacar gravatar image

updated 2017-07-17 14:05:09 +0100

If I search for the regular expression \bhey\b, it matches " .hey. " and of course " hey ", but not " hey.hey ".

Is this a bug?

Furthermore, if I change the search to (\b|\.)hey(\b|\.) it will match the first "hey" in "hey.hey", but not the second one!

UPDATE: It seems that periods/full stops "erase" word boundaries when they occur in the middle of a word. They aren't considered a word character at that point though. Sounds like a bug to me, but I need a way to select words separated by decimals. My "furthermore" search above doesn't work because when it searches "hey.hey" it will actually match the first "hey." leaving "hey" which is mysteriously missing a word boundary on the left.

The best I can manage is hey(\b|\.) which will match everything I need it to, but will also match "they", which is hopefully OK for my purposes. ("Hey" would actually be a long list of abbreviations like (ABC|XYZ|ETC).)

edit retag flag offensive close merge delete

1 Answer

Sort by » oldest newest most voted
0

answered 2017-07-17 14:39:32 +0100

Lupp gravatar image

updated 2017-07-18 11:02:40 +0100

LibreOffice does not contain an extra RegEx engine, but one provided by ICU. If there is a bug it is most likely one of the "third-party-software".

As \b in ICU Regex is an abbreviated lookbehind assertion, but a lookahead assertion at the same time, it may be that the enginge cannot recognize the two-sided \b compliance of a single point (\.) at the same time: between two \w characters. This, however, should affect many cases where the point is replaced by different \W characters - but it doesn't in some cases I tested.

Thus the best I can do for you without a lot of reseaech is to suggest to use the workaround
(\b|(?<=\W))hey(\b|(?=\W)) or a similar one.

edit flag offensive delete link more

Comments

Thanks, I don't know what that crazy stuff with the question marks and equal signs but it works!

lomacar gravatar imagelomacar ( 2017-07-18 04:11:43 +0100 )edit

I am not an "educated specialist" in RegEx, but I am well pleased by their usefulness. Searching for reliable information I found this site some time ago: http://www.regular-expressions.info/ It's by far the best work on RegEx I ever saw.
You may want to visit it. The expressions starting with (? are "zero-length lookbehind/lookahead assertions used to restrict matches to a context without including that context with the search results.

Lupp gravatar imageLupp ( 2017-07-18 10:24:09 +0100 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2017-07-17 09:01:43 +0100

Seen: 64 times

Last updated: Jul 18