The Counted Part 2: Analysis Using R
July 14, 2015 Leave a comment
Following up on this post last week on analyzing The Counted using F# and R, I decided to look a bit closer at the data. In last week’s post, I had a data frame of all of the people killed by law enforcement collected by The Guardian for the 1st half of 2015. Although interesting to looks at, I am not sure that the map tells us anything. The data frame looks like this:
My first thought is to look at killing by population by US State. Step #1 was to sum up the number of rows by state code:
1 the.counted.state <- data.frame(table(the.counted$state)) 2 colnames(the.counted.state ) <- c("StateCode","NumberKilled") 3 summary(the.counted.state) 4
I then brought in the latest population figures by state from the US Census:
1 state.population <- read.csv("http://www.census.gov/popest/data/state/asrh/2014/files/SCPRC-EST2014-18+POP-RES.csv") 2 state.population 3
And finally I bought in a cross walk table of US State Codes (what The.Counted data is in and the US State Names, which is what US Census data is in)
1 state.crosswalk <- read.csv("http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv") 2 state.crosswalk 3
I then merged all three data frames together using the state name and state code as the common key:
1 state.population.2 <- state.population[c(5,6)] 2 state.population.3 <- merge(x=state.population.2, 3 y=state.crosswalk, 4 by.x="NAME", 5 by.y="State") 6 #The Counted With Population 7 the.counted.state <- merge(x=the.counted.state, 8 y=state.population.3, 9 by.x="StateCode", 10 by.y="Abbreviation") 11
I then tried to add in a column that took the total number of killed individuals by the number of people in the state.
1 the.counted.state.2 <- the.counted.state 2 the.counted.state.2$KilledRatio <- the.counted.state.2$NumberKilled/the.counted.state.2$POPESTIMATE2014 3
The problem became quickly obvious: there were not enough people in the numerator to make a meaningful straight division. To compensate for this issue, I divided the number of people in each state by 10,000. I also increased the kill ratio by a factor of 10 so that we have a scale between of 0 and 1 of .1 which is easily digestible. Finally, I renamed the variable “Name” to “StateName” because my OCD couldn’t let such an affront to the naming gods go unpunished.
1 the.counted.state.3 <- the.counted.state 2 the.counted.state.3$AdjustedPopulation <- the.counted.state.2$POPESTIMATE2014/10000 3 the.counted.state.3$KilledRatio <- the.counted.state.3$NumberKilled/the.counted.state.3$AdjustedPopulation 4 the.counted.state.3$AdjKilledRatio <- the.counted.state.3$KilledRatio * 10 5 6 names(the.counted.state.3)[names(the.counted.state.3)=="NAME"] <- "StateName" 7 the.counted.state.3$StateName <- tolower(the.counted.state.3$StateName)
With the data prepped, I created a choropleth to show the kill ratio by state using a gradient scale:
1 choropleth <- merge(x=all.states, 2 y=the.counted.state.3, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth <- choropleth[order(choropleth$order), ] 8 summary(choropleth) 9 10 qplot(long, lat, data = choropleth, group = group, fill = AdjKilledRatio, 11 geom = "polygon")
Note that I had to use the all.x=TRUE to account for the fact that South Dakota and Vermont did not have any killings so far in 2015. This is equiv to a left outer join for you Sql folks. On a side note, what’s up with Oklahoma?
I then decided to bin the data into high,medium, and low categories. Looking at the detail of the adjustedKillRatio, there seems to be some natural breaks around 10% and 20%:
1 the.counted.state.4$AdjKilledRatio 2 summary(the.counted.state.4$AdjKilledRatio) 3
So I binned like that:
1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=seq(0,1,.1)) 3 summary(the.counted.state.4$KilledBin) 4
The problem with my code is that this gives me 10 bins and I only really need 3. Fortunately, this stack overflow post helped me re-write the bin into 3 factors. Note the Inf on the high side and the labels.
1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=c(seq(0,.2,.1),Inf), 3 labels=c("low","med","high")) 4
And this gives me a pretty good distribution of bins:
With things binned up, I added another chiropleth and map:
1 choropleth.2 <- merge(x=all.states, 2 y=the.counted.state.4, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth.2 <- choropleth.2[order(choropleth.2$order), ] 8 summary(choropleth.2) 9 10 qplot(long, 11 lat, 12 data = choropleth.2, 13 group = group, 14 fill = KilledBin, 15 geom = "polygon")
If your squint, it almost looks like a map of the civil war, no?