Regex: behaviour of \w ("any word character")

Word characters are [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d].

Can anyone translate this ^ for me?
What is in the classes \p{Mark}, and \p{Connector_Punctuation}?

Reason for asking:

I have this string:
The Old Gymnasium, Vicarage Playing Fields, Llanbadarn Road.

I want to return Llandbadarn Road only

I’m trying to tell the regex I want “a string starting just after comma-space and ending just before full stop/period, but ignore all commas except the last one”.

I thought I could do that by returning everything between ,_ and . if the only letters found are 0-9A-Za-z (but would be nice to add ' and - too)

"(?<=,\s).*(?=\.)" returns Vicarage Playing Fields, Llanbadarn Road
"(?<=,\s)\w*(?=\.)" returns #NA
"(?<=,\s)[0-9A-Za-z]*(?=\.)" returns #NA

Any tips? Thanks.

*(underscore is a space: markdown is unhappy without it)

The ICU regex documentation page explains the [\p...] metacaracter in Set Expressions (Character Classes) section. It mentions Unicode classes and properties; and that itself is a huge topic. One starting point could be another ICU page: Properties | ICU Documentation - having come tabular data and further links. E.g., UAX #44: Unicode Character Database

2 Likes

[^,]+

, [^,]+$

still …

image

1 Like

Thanks, ordered, arriving in a few days. (Yes, I know it’s there in PDF, but that’s going to break my head even more than reading about regular expressions…)

=REGEX(A1;"(.*?, )([ \w'-]+)\." ; "$2")

Try this:

(?<=,\s)[\p{L} ’-]+(?=\.)

2 Likes

This is actually closer to my first approach. It works except where other addresses I have include numbers. Adding \p{Nd} to the regex after the \p{L} fixes this.
But it returns a #NA error if there are no commas, which karolus’s doesn’t, so I’ve preferred that one.

Thanks all for help. I’ll try not to bother you again until I’ve read the Frield book…