How to search sentences with optional endigs and words in them

Earendil · June 15, 2023, 8:55am

Hi,
I am trying to find barbarisms in my texts.
I am currently working with Spanish, so the examples will be in that language.
These are some of the sentences (in the regex you’ll see more words) I would like to find (part in bold):

Decia que bajo esta circunstancia nal
abc Bajo preparación abc bajo revisión. (Update, this was not contemplated by this regex)

I do not want to find:

Bajo esta idea

Currently the regex I have developed exceeds the boundries of the bold sentences and also finds example 3 which I do not want.

This is the regex string:

(bajo .{1,} [\bbase\b \bcircunstancia[s]?\b \bcondici[ón ones]?\b \baspecto[s]?\b \bconcepto[s]?\b \bsupuesto[s]?\b]+)

In square brackets I’ve put the optional endings I am looking for.

Can someone explain why? Thanks!

mikekaganski · June 15, 2023, 8:57am

Is this meant to list words? [] is a list of characters. So your regex converts this way:

[\bbase\b \bcircunstancia[s] turns into [abceinrstu[\ ] (note that \b is not “inside sets” in the documentation, so the \ will likely be treated as a literal character);
[ón ones] turns into [enoós ].

(or something like that; I’d even expect the whole regex to be faulty.)
And the whole piece will work not the way you expect.

.{1,} is equivalent to .+
There’s no need to put \b after a space before word characters, because this combination implies word break.
There’s no need to put \b after each of the searched variant - you may put it once in the end.
There’s no need to put a single character into [] before ? - the latter makes one previous term (character in this case) optional anyway.

And the change could look like

(bajo .+ (base|circunstancias?|condici(ón|ones)?|aspectos?|conceptos?|supuestos?)\b)

which indeed doesn’t handle your #2, because neither preparación, nor revisión is in the list you authored so far. (I kept a suspicious ? after (ón|ones), because I know nothing about Spanish, leave alone its barbarisms, so I don’t know if it’s possible to expect simple condici without an ending).

Earendil · June 15, 2023, 11:35am

Hi @mikekaganski and thanks a million for your explanation.

Correct about the fact that your solution does not handle #2, I updated my previous post to say this.

I tried your search string and it works as expected except for a case like this one:

Bajo estas condiciones o bajo tu condición

where it selects the whole, instead of selecting the sentence before “o” and then the sentence after “o”.

If it is the best and economical way to write it, I will mark your answer as the solution anyway, it really helped me.

Could you please explain better your comment below?

ajlittoz · June 15, 2023, 11:41am

IMHO your goal exceeds the capabilities of the simple regexp engine in Writer. This engine works with characters while your purpose addresses higher-level components such as words and their flexions.

Your problem operates on two levels: lexical level to recognise words and syntactical level to group words into grammatical sequences.

I faced such an issue but could not find any suitable tool. Consequently I created one. If you’re interested I can send you its description by private mail so that you can see if it fits your need.

PS: you didn’t mention your OS nor your LO version. My tool works under Linux.

mikekaganski · June 15, 2023, 11:53am

Indeed, because the .{1,} (and .+) are “greedy”, so will try to match as much as possible. Replace .+ with non-greedy .+? (where +? is also described in the mentioned documentation), to avoid this.

The set - the group of characters between [ and ] - use a different set of rules compared to those outside of []. So outside of [], the \b means word boundary; but [\b] would not treat it like a boundary. Reading the documentation again, I think that in this case, the \ will be simply ignored, and [\b] will be equivalent to [b].

Earendil · June 15, 2023, 12:17pm

Hi @ajlittoz happy to read your note, yes please send it’s description by mail. How do I share my mail with you?

My specs:
Mac Os. M1 (apple sylicon) Ventura 13.4
Version: 7.4.7.2 / LibreOffice Community
Build ID: 723314e595e8007d3cf785c16538505a1c878ca5
CPU threads: 8; OS: Mac OS X 13.4; UI render: default; VCL: osx
Locale: en-US (en_IT.UTF-8); UI: en-US
Calc: threaded

Earendil · June 15, 2023, 12:24pm

Thanks @mikekaganski I now get it: I had not seen the outside of sets and [inside sets] columns in the documentation!
Have a good day.

LeroyG · June 15, 2023, 12:54pm

Try with:

> (bajo (.*? )*?(preparación?|revisión?|base?|circunstancias?|condici(ón|ones)?|aspectos?|conceptos?|supuestos?){1})

imagen

Earendil · June 15, 2023, 1:15pm

Hi @LeroyG thank you for proposing your solution. In fact it does work for all, and shows another way to get the desired result, unfortunately, in my case I fear that I need a separate string for:

abc Bajo preparación abc bajo revisión .

since for them I should not have anything between bajo and the substantive.
I am following also this list of barbarisms in Spanish for the revision work I do.

I guess that it would be easier to add a separate string than to “patch” the one you propose, right?

Anyhow, thanks again for your proposed solution.

LeroyG · June 15, 2023, 1:50pm

Then, do you need to find both as the same instance?

Earendil · June 15, 2023, 2:19pm

Well, to be fair, I am managing with two separate strings, the second one being
bajo (preparación|revisión)
So, I think that with both strings I got what I am looking for, covered!