In a list of 19000 street addresses to get a list of streets and how many times the street repeats?

Hi, English is not my first language. I was given a list of delivery addresses that has 19000 entries.
I was asked to know how many streets are and how many times every street name repeats.
The data is enter as follow : streetname adress, city ( Nothingham 1234, Utopia)
I know that i need to remove the duplicates from the list so i have the names of the street, and then using that list to count how many time a street name appears.
How can i isolate the street name from the rest, to then eliminate the duplicated?
Is there a way to do it that doesn’t involved other programs or complex scripting?
i use LibreOffice version 7.2.1.2 in Windows10
All help is appreciated
-edit for clarification

1 Like

Obviously this is not a question for Writer because it doesn’t involve any pretty formatting.

I’d suggest first pre-processing your list with a macro-processor (perhaps regexp are powerful enough). Under Linux, writing a small shell script should do the trick. You may eventually invoke awk for sophisticated extraction but the shell script language already contains basic instruction for that. Sorting the result with duplicate elimination and counting the lines will give you the total number of streets.

Depending on what you want to do with the list, you can also load it into a database for increased versatility.

There immediately come a few questions to my mind:

Can you assure the StreetName in the first place (like usual in some European (continental) countries? (No number or BuildingName prefixed?)
What countrie’s standard is applied? Is it assured to be the same for all your addresses?
I would suppose you can’t assure that StreetName is a single word, can you?
Is the StreetName assured to NOT contain a number (think of “Via 20 Settembre” e.g.)?
In what way would a StreetName containing digits and/or multiple words be syntactically delimited behind its end? (Specific character? LF?) Spreadsheets don’t know magic spells.

Please edit your question to supply additional information.
Best also attach an .ods containing a substantial list of example addresses.

If there is a clear and useful answers concerning the “separation” of the StreetName, there isn’t a serious problem. Otherwise there may not be a satisfying solution at all. (Depending.)

“Can you assure the StreetName in the first place (like usual in some European (continental) countries?”
-yes i can
“What countries’ standard is applied? Is it assured to be the same for all your addresses?”
-Spanish standards and it is the same for all addresses
“I would suppose you can’t assure that StreetName is a single word, can you?”
-I can assure that it is not a single word. Some are names.
" Is the StreetName assured to NOT contain a number (think of “Via 20 Settembre” e.g.)"
-I can assure that contains number , usually are dates important to the country.
" In what way would a StreetName containing digits and/or multiple words be syntactically delimited behind its end? (Specific character? LF?)"
-sadly there is only a space between the name and the number of the house, follow by the coma and the city .
“Spreadsheets don’t know magic spells.”
:frowning: i know
" Best also attach an .ods containing a substantial list of example addresses."
Sorry I don’t feel comfortable doing this, there is real house addresses in the sheet.

Well, I don’t expect you to attach your actual data, but should the volunteer trying to help create a valid set of sample data, or should the questioner?

Anyway:

If your above assurances hold, and you can also assure that StreetName never can contain a comma (NOT “Street hailing Aperson, Otherperson, and Thirdperson”), and that it’s always followed by zero or more decimal digits, probably spaces, and a comma, the formula =TRIM(REGEX(A2;"(( |\d)*,.*$)";"";1)) should extract the StreetName.
Thinking of the (similar) German standards, this would not work reliably because a HouseNumber in the needed sense may contain a disambiguating letter at the end, or a dash abbreviating for a sequence ( “15c”, “21 a”, “12-14” e.g.).
I would never feel sure there aren’t additional complications!

1 Like

You are a savior!!
=TRIM(REGEX(A2;"(( |\d),.$)";"";1))
it work!
the only thing that I see now is that some addresses have double number and looks like this:
Streetname number/number, city ( 12 de Octubre 1076/78, Cataluña)
Using the formula i get this:
Street name /number ( 12 de Octubre /78 )
Again thank you soo much!!

I told you!

You will see another “thing” next time. Compound addresses are a hell of a mess. Never accept them!

1 Like

Hahahaha yes i can attest that compound addresses are hell!!
If only I could not accept them, but I’m under the thumb of bureaucracy!!
But thankfully there is not so much “cleaning” to do so I feel confident that I can get the data i need.
Thank you soo much and have an awesome Sunday!!

If you want to get along with similar problems, you should probably study https://www.regular-expressions.info/ (not only the “Quick Start”).

See also attachment. (My sunday is over now. Have 2021-09-27T00:03 now.

disask68513splitCompoundAddressesAfewExamples.ods (9.8 KB)

2 Likes

Thanks for the additional info, you went above and beyond!! I will study them!! :smiley:

Having sufficient assured information about the way the wanted StreetName is separated from different parts of te compound address, it can be extracted with the help of a REGEX() formula.
The attached example assumes that the wanted name is at te beginning of the string (or preceeded by spaces only), and that the final delimiter may contain a HouseNumber (or an extended version of it), but surely ends with a comma.
disask68513splitCompoundAddressesAfewExamples_2.ods (12.7 KB)

3 Likes