Regex: behaviour of \w ("any word character")

eteb3 · October 30, 2024, 8:18am

Word characters are [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d].

Can anyone translate this ^ for me?
What is in the classes \p{Mark}, and \p{Connector_Punctuation}?

Reason for asking:

I have this string:
The Old Gymnasium, Vicarage Playing Fields, Llanbadarn Road.

I want to return Llandbadarn Road only

I’m trying to tell the regex I want “a string starting just after comma-space and ending just before full stop/period, but ignore all commas except the last one”.

I thought I could do that by returning everything between ,_ and . if the only letters found are 0-9A-Za-z (but would be nice to add ' and - too)

"(?<=,\s).*(?=\.)" returns Vicarage Playing Fields, Llanbadarn Road
"(?<=,\s)\w*(?=\.)" returns #NA
"(?<=,\s)[0-9A-Za-z]*(?=\.)" returns #NA

Any tips? Thanks.

*(underscore is a space: markdown is unhappy without it)

mikekaganski · October 30, 2024, 10:24am

The ICU regex documentation page explains the [\p...] metacaracter in Set Expressions (Character Classes) section. It mentions Unicode classes and properties; and that itself is a huge topic. One starting point could be another ICU page: Properties | ICU Documentation - having come tabular data and further links. E.g., UAX #44: Unicode Character Database

fpy · October 30, 2024, 8:39am

UTS #18: Unicode Regular Expressions Compatibility_Properties
UTS #18: Unicode Regular Expressions General_Category_Property

[^,]+

, [^,]+$

still …

eteb3 · October 30, 2024, 10:06am

Thanks, ordered, arriving in a few days. (Yes, I know it’s there in PDF, but that’s going to break my head even more than reading about regular expressions…)

karolus · October 30, 2024, 8:53am

=REGEX(A1;"(.*?, )([ \w'-]+)\." ; "$2")

jeshkhol · October 30, 2024, 8:54am

Try this:

(?<=,\s)[\p{L} ’-]+(?=\.)

eteb3 · October 30, 2024, 10:21am

This is actually closer to my first approach. It works except where other addresses I have include numbers. Adding \p{Nd} to the regex after the \p{L} fixes this.
But it returns a #NA error if there are no commas, which karolus’s doesn’t, so I’ve preferred that one.

Thanks all for help. I’ll try not to bother you again until I’ve read the Frield book…