CSV file. Only keep last entry for each country

I am trying to update a country list at Wikipedia:
*List of countries by firearm-related death rate - Wikipedia

The data reference is the csv file (homicide-rates-from-firearms.csv) found via the download link below the table/map/chart here:
*Homicide rate from firearms - Our World in Data

I know how to keep multiple years (one year per column) in order to get the latest available data for all countries in the list. But that would mean keeping these 11 years: 2008, 2011, 2012, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021. That is too many columns.

So I want to keep only the latest year for each country.

When you open the CSV file in Calc you see that it is in alphabetical order by country in the leftmost column. Within each country one sees that the years are in order (latest year on bottom).

I want to keep only the last entry for each country. It would end up being a 3 column table: Country. Year. Rate.

Who the f… did that?

Within seconds, you create a pivot table from the source data.
homicide-rates-from-firearms.ods (69.4 KB)

Source data on Sheet1, pivot with 2 data fields (sum, count) on third sheet. The count field verifies that there are no duplicates summed up in the sum field. Use menu:Data>Pivot>Edit… to remove the count field.
Use the filter button in C1 to filter by years…

The second sheet gets the max date for each entity and looks up the rate.

2 Likes

Since csv is a database exchange format and we have database tools at hand:
Latest_Rates.odb (39.8 KB) (query and report)

Villeroy. I finally figured out how to filter better, and get to a small file size. The attached file goes back more years, and goes up to 2022. I would really like to know all the steps you take with LibreOffice Calc to pull out only the latest available year for each country. Then I can add the how-to info to one of the Help:Table pages at Wikipedia. There are many tables at Wikipedia with the same format. Here are the instructions I have written up so far:
User:Timeshifter/Sandbox249#Instructions
data_cts_intentional_homicide. Victims. Firearms. Rates. 3 columns. A-Z. Freeze.ods (30.2 KB)

The most straight forward solution in this (! fields don’t contain embedded , comma characters nor line feeds) case is to not use Calc but pre-process the data before importing it, i.e. piping it through this little gawk script:

BEGIN {
    FS = ","
}

FNR == 1 { next }

entity[$1][1] < $3 {
    entity[$1][0] = $1
    entity[$1][1] = $3
    entity[$1][2] = $4
}

END {
    OFS = ",";
    print "Entity", "Year", "Rate"
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (e in entity) {
        for (i in entity[e]) {
            printf "%s", entity[e][i]
            if (i < 2)
                printf "%s", OFS
            else
                printf "%s", "\n"
        }
    }
}

Attached is the output
105990-ask-processed.csv.ods (2.3 KB)
Note: after download rename to .csv, the .ods extension is added to be able to upload the file.

1 Like

Thanks. This is what I have so far:
User:Timeshifter/Sandbox248

I dug up the UNODC source. It has later and earlier data. I filtered for just
“Firearms or explosives - firearms”

It has different column heads. Do I just substitute them for “Entity”, “Year”, “Rate” in the gawk script?

I will be linking to this thread from the list article talk page, and one of the Help:Table pages on Wikipedia. There are many list articles that could use this.

Is there a way and/or place for a newb (like me as concerns gawk) to plug in your gawk code and a csv file (or ods file)?

I attached the filtered UNODC ods file.
data_cts_intentional_homicide. Firearms filter 2.ods (52.8 KB)

I didn’t think to sort it alphabetically. Does it need to be in order for the gawk script to work?

The print “Entity”, “Year”, “Rate” is just the output of the column names, you can name them anything you like. But if the input data has different columns the script needs to be modified, replacing

entity[$1][1] < $3 {
    entity[$1][0] = $1
    entity[$1][1] = $3
    entity[$1][2] = $4
}

(where $1 is Entity/Country, $3 is Year, $4 is Rate, and $2 is the unused country code), with

entity[$1][1] < $2 {
    entity[$1][0] = $1
    entity[$1][1] = $2
    entity[$1][2] = $3
}

(where $1 is Country, $2 is Year, and $3 is Value).
The input data doesn’t need to be sorted, the script sorts the output.

However, note that this is fragile, and with this second set of data it fails already on the "China, Hong Kong Special Administrative Region" and "China, Macao Special Administrative Region" entries because they contain an embedded comma and the script is just dumb taking all commas as field separators, and it eliminates one record and one value. This is not a general problem solver, it was specifically for your first set of data.

Anyway, attached
105990-ask-data_cts_intentional_homicide. Firearms filter 2-processed.csv.ods (2.0 KB)

I don’t understand. Plug-in what where?

Thanks again. I messed up the last file I uploaded. Counts and rates were both in the value column. Sorry.

Here is link to a new file:
*https://drive.google.com/file/d/1lYy_-UAlgXa5dtBFdIyS7AJfIyOAjsb7/view?usp=sharing

It is filtered for firearms and rates. I also removed all the regions. I hadn’t noticed them at the end. I also removed all the commas.

I labeled the headers as Location, Year, and Value. Can you give me the gawk code for it, and the result? That way I may be able to use it as a more general solution. Does the order of the columns matter? Or does the gawk code depend solely on the column header name?

I noticed the spam filter got me. I work on several wikis, and have set up spam filters myself. So I understand. Maybe they can put wikipedia and wikimedia on the spam whitelist. Wikimedia is nonprofit, and is in the top ten of visited websites:

So Wikimedia doesn’t need any help to become more popular, and should not be considered spam here. Libreoffice Calc is mentioned many times in the many Help:Table pages. I am a donor to Libreoffice (though not with the alias email address I use to log in here and elsewhere to avoid problems). Others may have tried to ask questions here about Wikipedia tables, etc., and given up if they got hit with the spam filter. The filter message doesn’t say what URL is the problem. I had no idea since I had left data source URLs too. Reminds me of those annoying Facebook post deletions that don’t tell what post was deleted, or why. I only have one other thread in this forum. It is from 2021, and is about another Wikipedia list article that also happens to use a UNODC data source. So I guess the duplication is what initiated the spam filter. The other forum thread was much more complicated. I was never able to use it elsewhere since it is so complicated, and since I only regularly use very basic Calc functions. But I am very skilled with Wikipedia tables, and have written large parts of the various Help:Table pages.

So I hope to add some basic gawk info to the Help:Table pages.

About gawk, I would like to be able to copy the latest gawk code from this thread, and use it. But I have no clue how to use it. What would be the simplest way? Let’s say I have another 3-column file with Location, Year, and Value. And want only the latest year for each country, state, province, region, or whatever location is in the file.

I got this notice: “Sorry you cannot post a link to that host.” I removed the Wikipedia link, and then got this message: “Sorry, new users can only put 2 links in a post.” I now have only one link. To the Google Drive location of the 1 MB file that the forum wouldn’t let me upload while saying that 4 MB was the limit. Oops, I put .org and .com after Wikipedia and Wikimedia. I removed that. Let’s try again. It worked. I edited the comment and added back the Wikipedia link to “List of most-visited websites”. And it was allowed. I did not get this message: “Sorry you cannot post a link to that host.” So I have no idea what host they were talking about originally.

I won’t login to Google to obtain your data.

Yes.

No. Though with some extra logic that could be made such.

erAck: I just changed the Google Drive file to the least restricted access. Supposedly anyone with the link can now download it. I don’t know if login is required now.

105990-ask-data_cts_intentional_homicide - Rates countries headers commas-processed.csv.ods (6.8 KB)

Hallo
Reading the original …csv and dump only the latest Year data for each country into current sheet:

import pandas as pd

def group_entity_by_last_Year():
    df = pd.read_csv('homicide-rates-from-firearms.csv', usecols=[0,2,3] )
    latest = df.groupby('Entity').max('Year')

    latest_list = [('country','Year','homicide_rate')]
    for entry in latest.itertuples():
        latest_list.append(tuple(entry))
    
    doc = XSCRIPTCONTEXT.getDocument()
    sheet = doc.CurrentController.ActiveSheet
    sheet[:len(latest_list), :len(latest_list[0]) ].DataArray = latest_list

groupby_pandas.ods (34.4 KB)

3 Likes

How do I ping someone in these forums? I want to ping Villeroy. I responded to him higher up a few days ago, but I don’t know if he has seen my reply to him. He was trying to explain how to use Libreoffice to get the latest available year for all countries. I want to use Libreoffice. If anybody can tell me how to use Libreoffice to do this I would greatly appreciate it. I finally got the correct dataset narrowed down correctly. I attached it here. My older datasets were all incorrect in various ways. Of course, I want results from any method. I can use that now in the Wikipedia article table. While I figure out how to use Libreoffice to do it, and then write it up further on a Help:Table page on Wikipedia. Since the columns are named: Location. Year. Value. Then the method should work on many country, state, and province lists.
data_cts_intentional_homicide. 3 column. Corrected.ods (30.2 KB)

@Villeroy provided a sample document with a pivot table for you to look at. You might also want to look at Chapter 8 Pivot Tables - LibreOffice for more information on how to work with Pivot Tables.

@robleyd - Thanks. I see the user popup list when I type in @. Villeroy was working with a limited dataset coming directly from OWID, and not from me. It had 4 columns. My last reply to him is an expanded dataset covering more years, and with 3 columns. My other datasets I uploaded, or linked to, in this thread were completely screwed up. Since I hadn’t learned how to filter correctly.