In a list of 19000 street addresses to get a list of streets and how many times the street repeats?

Alxacha · September 26, 2021, 4:58pm

Hi, English is not my first language. I was given a list of delivery addresses that has 19000 entries.
I was asked to know how many streets are and how many times every street name repeats.
The data is enter as follow : streetname adress, city ( Nothingham 1234, Utopia)
I know that i need to remove the duplicates from the list so i have the names of the street, and then using that list to count how many time a street name appears.
How can i isolate the street name from the rest, to then eliminate the duplicated?
Is there a way to do it that doesn’t involved other programs or complex scripting?
i use LibreOffice version 7.2.1.2 in Windows10
All help is appreciated
-edit for clarification

ajlittoz · September 26, 2021, 5:38pm

Obviously this is not a question for Writer because it doesn’t involve any pretty formatting.

I’d suggest first pre-processing your list with a macro-processor (perhaps regexp are powerful enough). Under Linux, writing a small shell script should do the trick. You may eventually invoke awk for sophisticated extraction but the shell script language already contains basic instruction for that. Sorting the result with duplicate elimination and counting the lines will give you the total number of streets.

Depending on what you want to do with the list, you can also load it into a database for increased versatility.

Lupp · September 26, 2021, 8:17pm

There immediately come a few questions to my mind:

Can you assure the StreetName in the first place (like usual in some European (continental) countries? (No number or BuildingName prefixed?)
What countrie’s standard is applied? Is it assured to be the same for all your addresses?
I would suppose you can’t assure that StreetName is a single word, can you?
Is the StreetName assured to NOT contain a number (think of “Via 20 Settembre” e.g.)?
In what way would a StreetName containing digits and/or multiple words be syntactically delimited behind its end? (Specific character? LF?) Spreadsheets don’t know magic spells.

Please edit your question to supply additional information.
Best also attach an .ods containing a substantial list of example addresses.

If there is a clear and useful answers concerning the “separation” of the StreetName, there isn’t a serious problem. Otherwise there may not be a satisfying solution at all. (Depending.)

Alxacha · September 26, 2021, 9:02pm

“Can you assure the StreetName in the first place (like usual in some European (continental) countries?”
-yes i can
“What countries’ standard is applied? Is it assured to be the same for all your addresses?”
-Spanish standards and it is the same for all addresses
“I would suppose you can’t assure that StreetName is a single word, can you?”
-I can assure that it is not a single word. Some are names.
" Is the StreetName assured to NOT contain a number (think of “Via 20 Settembre” e.g.)"
-I can assure that contains number , usually are dates important to the country.
" In what way would a StreetName containing digits and/or multiple words be syntactically delimited behind its end? (Specific character? LF?)"
-sadly there is only a space between the name and the number of the house, follow by the coma and the city .
“Spreadsheets don’t know magic spells.”
i know
" Best also attach an .ods containing a substantial list of example addresses."
Sorry I don’t feel comfortable doing this, there is real house addresses in the sheet.

Lupp · September 26, 2021, 9:30pm

Well, I don’t expect you to attach your actual data, but should the volunteer trying to help create a valid set of sample data, or should the questioner?

Anyway:

If your above assurances hold, and you can also assure that StreetName never can contain a comma (NOT “Street hailing Aperson, Otherperson, and Thirdperson”), and that it’s always followed by zero or more decimal digits, probably spaces, and a comma, the formula =TRIM(REGEX(A2;"(( |\d)*,.*$)";"";1)) should extract the StreetName.
Thinking of the (similar) German standards, this would not work reliably because a HouseNumber in the needed sense may contain a disambiguating letter at the end, or a dash abbreviating for a sequence ( “15c”, “21 a”, “12-14” e.g.).
I would never feel sure there aren’t additional complications!

Alxacha · September 26, 2021, 9:41pm

You are a savior!!
=TRIM(REGEX(A2;"(( |\d),.$)";"";1))
it work!
the only thing that I see now is that some addresses have double number and looks like this:
Streetname number/number, city ( 12 de Octubre 1076/78, Cataluña)
Using the formula i get this:
Street name /number ( 12 de Octubre /78 )
Again thank you soo much!!

Lupp · September 26, 2021, 9:44pm

I told you!

You will see another “thing” next time. Compound addresses are a hell of a mess. Never accept them!

Alxacha · September 26, 2021, 9:54pm

Hahahaha yes i can attest that compound addresses are hell!!
If only I could not accept them, but I’m under the thumb of bureaucracy!!
But thankfully there is not so much “cleaning” to do so I feel confident that I can get the data i need.
Thank you soo much and have an awesome Sunday!!

Lupp · September 26, 2021, 10:02pm

If you want to get along with similar problems, you should probably study https://www.regular-expressions.info/ (not only the “Quick Start”).

See also attachment. (My sunday is over now. Have 2021-09-27T00:03 now.

disask68513splitCompoundAddressesAfewExamples.ods (9.8 KB)

Alxacha · September 26, 2021, 10:04pm

Thanks for the additional info, you went above and beyond!! I will study them!!

Lupp · September 26, 2021, 10:39pm

Having sufficient assured information about the way the wanted StreetName is separated from different parts of te compound address, it can be extracted with the help of a REGEX() formula.
The attached example assumes that the wanted name is at te beginning of the string (or preceeded by spaces only), and that the final delimiter may contain a HouseNumber (or an extended version of it), but surely ends with a comma.
disask68513splitCompoundAddressesAfewExamples_2.ods (12.7 KB)