The Counted Part 2: Analysis Using R

Following up on this post last week on analyzing The Counted using F# and R, I decided to look a bit closer at the data.  In last week’s post, I had a data frame of all of the people killed by law enforcement collected by The Guardian for the 1st half of 2015.  Although interesting to looks at, I am not sure that the map tells us anything.  The data frame looks like this:

image

My first thought is to look at killing by population by US State.  Step #1 was to sum up the number of rows by state code:

1 the.counted.state <- data.frame(table(the.counted$state)) 2 colnames(the.counted.state ) <- c("StateCode","NumberKilled") 3 summary(the.counted.state) 4

image

I then brought in the latest population figures by state from the US Census:

1 state.population <- read.csv("http://www.census.gov/popest/data/state/asrh/2014/files/SCPRC-EST2014-18+POP-RES.csv") 2 state.population 3

image

And finally I bought in a cross walk table of US State Codes (what The.Counted data is in and the US State Names, which is what US Census data is in)

1 state.crosswalk <- read.csv("http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv") 2 state.crosswalk 3

image

 

I then merged all three data frames together using the state name and state code as the common key:

1 state.population.2 <- state.population[c(5,6)] 2 state.population.3 <- merge(x=state.population.2, 3 y=state.crosswalk, 4 by.x="NAME", 5 by.y="State") 6 #The Counted With Population 7 the.counted.state <- merge(x=the.counted.state, 8 y=state.population.3, 9 by.x="StateCode", 10 by.y="Abbreviation") 11

I then tried to add in a column that took the total number of killed individuals by the number of people in the state.

1 the.counted.state.2 <- the.counted.state 2 the.counted.state.2$KilledRatio <- the.counted.state.2$NumberKilled/the.counted.state.2$POPESTIMATE2014 3

image

The problem became quickly obvious: there were not enough people in the numerator to make a meaningful straight division.  To compensate for this issue, I divided the number of people in each state by 10,000.  I also increased the kill ratio by a factor of 10 so that we have a scale between of 0 and 1 of .1 which is easily digestible.  Finally, I renamed the variable “Name” to “StateName” because my OCD couldn’t let such an affront to the naming gods go unpunished.

1 the.counted.state.3 <- the.counted.state 2 the.counted.state.3$AdjustedPopulation <- the.counted.state.2$POPESTIMATE2014/10000 3 the.counted.state.3$KilledRatio <- the.counted.state.3$NumberKilled/the.counted.state.3$AdjustedPopulation 4 the.counted.state.3$AdjKilledRatio <- the.counted.state.3$KilledRatio * 10 5 6 names(the.counted.state.3)[names(the.counted.state.3)=="NAME"] <- "StateName" 7 the.counted.state.3$StateName <- tolower(the.counted.state.3$StateName)

 

image

With the data prepped, I created a choropleth to show the kill ratio by state using a gradient scale:

1 choropleth <- merge(x=all.states, 2 y=the.counted.state.3, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth <- choropleth[order(choropleth$order), ] 8 summary(choropleth) 9 10 qplot(long, lat, data = choropleth, group = group, fill = AdjKilledRatio, 11 geom = "polygon")

image

Note that I had to use the all.x=TRUE to account for the fact that South Dakota and Vermont did not have any killings so far in 2015.  This is equiv to a left outer join for you Sql folks.  On a side note, what’s up with Oklahoma? 

I then decided to bin the data into high,medium, and low categories.  Looking at the detail of the adjustedKillRatio, there seems to be some natural breaks around 10% and 20%:

1 the.counted.state.4$AdjKilledRatio 2 summary(the.counted.state.4$AdjKilledRatio) 3

image

So I binned like that:

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=seq(0,1,.1)) 3 summary(the.counted.state.4$KilledBin) 4

image

The problem with my code is that this gives me 10 bins and I only really need 3.  Fortunately, this stack overflow post helped me re-write the bin into 3 factors.  Note the Inf on the high side and the labels.

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=c(seq(0,.2,.1),Inf), 3 labels=c("low","med","high")) 4

And this gives me a pretty good distribution of bins:

image

With things binned up, I added another chiropleth and map:

1 choropleth.2 <- merge(x=all.states, 2 y=the.counted.state.4, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth.2 <- choropleth.2[order(choropleth.2$order), ] 8 summary(choropleth.2) 9 10 qplot(long, 11 lat, 12 data = choropleth.2, 13 group = group, 14 fill = KilledBin, 15 geom = "polygon")

image

If your squint, it almost looks like a map of the civil war, no?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: