← The Counted: Initial Analysis Using FSharp and R

The Counted Part 3: Law Enforcement Officers Killed In Line Of Duty →

The Counted Part 2: Analysis Using R

July 14, 2015 Leave a comment

Following up on this post last week on analyzing The Counted using F# and R, I decided to look a bit closer at the data. In last week’s post, I had a data frame of all of the people killed by law enforcement collected by The Guardian for the 1st half of 2015. Although interesting to looks at, I am not sure that the map tells us anything. The data frame looks like this:

My first thought is to look at killing by population by US State. Step #1 was to sum up the number of rows by state code:

1 the.counted.state <- data.frame(table(the.counted$state))
2 colnames(the.counted.state ) <- c("StateCode","NumberKilled")
3 summary(the.counted.state)
4

I then brought in the latest population figures by state from the US Census:

1 state.population <- read.csv("http://www.census.gov/popest/data/state/asrh/2014/files/SCPRC-EST2014-18+POP-RES.csv")
2 state.population
3

And finally I bought in a cross walk table of US State Codes (what The.Counted data is in and the US State Names, which is what US Census data is in)

1 state.crosswalk <- read.csv("http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv")
2 state.crosswalk
3

I then merged all three data frames together using the state name and state code as the common key:

 1 state.population.2 <- state.population[c(5,6)]
 2 state.population.3 <- merge(x=state.population.2,
 3                             y=state.crosswalk,
 4                             by.x="NAME",
 5                             by.y="State")
 6 #The Counted With Population
 7 the.counted.state <- merge(x=the.counted.state,
 8                            y=state.population.3,
 9                             by.x="StateCode",
10                             by.y="Abbreviation")
11

I then tried to add in a column that took the total number of killed individuals by the number of people in the state.

1 the.counted.state.2 <- the.counted.state
2 the.counted.state.2$KilledRatio <- the.counted.state.2$NumberKilled/the.counted.state.2$POPESTIMATE2014
3

The problem became quickly obvious: there were not enough people in the numerator to make a meaningful straight division. To compensate for this issue, I divided the number of people in each state by 10,000. I also increased the kill ratio by a factor of 10 so that we have a scale between of 0 and 1 of .1 which is easily digestible. Finally, I renamed the variable “Name” to “StateName” because my OCD couldn’t let such an affront to the naming gods go unpunished.

1 the.counted.state.3 <- the.counted.state
2 the.counted.state.3$AdjustedPopulation <- the.counted.state.2$POPESTIMATE2014/10000
3 the.counted.state.3$KilledRatio <- the.counted.state.3$NumberKilled/the.counted.state.3$AdjustedPopulation
4 the.counted.state.3$AdjKilledRatio <- the.counted.state.3$KilledRatio * 10
5 
6 names(the.counted.state.3)[names(the.counted.state.3)=="NAME"] <- "StateName"
7 the.counted.state.3$StateName <- tolower(the.counted.state.3$StateName)

With the data prepped, I created a choropleth to show the kill ratio by state using a gradient scale:

 1 choropleth <- merge(x=all.states, 
 2                y=the.counted.state.3, 
 3                sort = FALSE, 
 4                by.x = "region", 
 5                by.y = "StateName",
 6                all.x=TRUE)
 7 choropleth <- choropleth[order(choropleth$order), ]
 8 summary(choropleth)
 9 
10 qplot(long, lat, data = choropleth, group = group, fill = AdjKilledRatio,
11         geom = "polygon")

Note that I had to use the all.x=TRUE to account for the fact that South Dakota and Vermont did not have any killings so far in 2015. This is equiv to a left outer join for you Sql folks. On a side note, what’s up with Oklahoma?

I then decided to bin the data into high,medium, and low categories. Looking at the detail of the adjustedKillRatio, there seems to be some natural breaks around 10% and 20%:

1 the.counted.state.4$AdjKilledRatio
2 summary(the.counted.state.4$AdjKilledRatio)
3

So I binned like that:

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio,
2                                      breaks=seq(0,1,.1))
3 summary(the.counted.state.4$KilledBin)
4

The problem with my code is that this gives me 10 bins and I only really need 3. Fortunately, this stack overflow post helped me re-write the bin into 3 factors. Note the Inf on the high side and the labels.

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio,
2                                      breaks=c(seq(0,.2,.1),Inf),
3                                      labels=c("low","med","high"))
4

And this gives me a pretty good distribution of bins:

With things binned up, I added another chiropleth and map:

 1 choropleth.2 <- merge(x=all.states, 
 2                     y=the.counted.state.4, 
 3                     sort = FALSE, 
 4                     by.x = "region", 
 5                     by.y = "StateName",
 6                     all.x=TRUE)
 7 choropleth.2 <- choropleth.2[order(choropleth.2$order), ]
 8 summary(choropleth.2)
 9 
10 qplot(long, 
11       lat, 
12       data = choropleth.2,
13       group = group,
14       fill = KilledBin,
15       geom = "polygon")

If your squint, it almost looks like a map of the civil war, no?

Filed under Analytics, R

Jamie Dixon's Home

The Counted Part 2: Analysis Using R

Leave a comment Cancel reply

Categories

Recent Posts

Archives

Blogroll

Meta

Jamie Dixon's Home

The Counted Part 2: Analysis Using R

Share this:

Related

Leave a comment Cancel reply

Categories

Recent Posts

Archives

Blogroll

Meta