The Counted Part 2: Analysis Using R

Following up on this post last week on analyzing The Counted using F# and R, I decided to look a bit closer at the data.  In last week’s post, I had a data frame of all of the people killed by law enforcement collected by The Guardian for the 1st half of 2015.  Although interesting to looks at, I am not sure that the map tells us anything.  The data frame looks like this:

My first thought is to look at killing by population by US State.  Step #1 was to sum up the number of rows by state code:

```1 the.counted.state <- data.frame(table(the.counted\$state))
2 colnames(the.counted.state ) <- c("StateCode","NumberKilled")
3 summary(the.counted.state)
4 ```

I then brought in the latest population figures by state from the US Census:

```1 state.population <- read.csv("http://www.census.gov/popest/data/state/asrh/2014/files/SCPRC-EST2014-18+POP-RES.csv")
2 state.population
3 ```

And finally I bought in a cross walk table of US State Codes (what The.Counted data is in and the US State Names, which is what US Census data is in)

```1 state.crosswalk <- read.csv("http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv")
2 state.crosswalk
3 ```

I then merged all three data frames together using the state name and state code as the common key:

``` 1 state.population.2 <- state.population[c(5,6)]
2 state.population.3 <- merge(x=state.population.2,
3                             y=state.crosswalk,
4                             by.x="NAME",
5                             by.y="State")
6 #The Counted With Population
7 the.counted.state <- merge(x=the.counted.state,
8                            y=state.population.3,
9                             by.x="StateCode",
10                             by.y="Abbreviation")
11 ```

I then tried to add in a column that took the total number of killed individuals by the number of people in the state.

```1 the.counted.state.2 <- the.counted.state
2 the.counted.state.2\$KilledRatio <- the.counted.state.2\$NumberKilled/the.counted.state.2\$POPESTIMATE2014
3 ```

The problem became quickly obvious: there were not enough people in the numerator to make a meaningful straight division.  To compensate for this issue, I divided the number of people in each state by 10,000.  I also increased the kill ratio by a factor of 10 so that we have a scale between of 0 and 1 of .1 which is easily digestible.  Finally, I renamed the variable “Name” to “StateName” because my OCD couldn’t let such an affront to the naming gods go unpunished.

```1 the.counted.state.3 <- the.counted.state
4 the.counted.state.3\$AdjKilledRatio <- the.counted.state.3\$KilledRatio * 10
5
6 names(the.counted.state.3)[names(the.counted.state.3)=="NAME"] <- "StateName"
7 the.counted.state.3\$StateName <- tolower(the.counted.state.3\$StateName)```

With the data prepped, I created a choropleth to show the kill ratio by state using a gradient scale:

``` 1 choropleth <- merge(x=all.states,
2                y=the.counted.state.3,
3                sort = FALSE,
4                by.x = "region",
5                by.y = "StateName",
6                all.x=TRUE)
7 choropleth <- choropleth[order(choropleth\$order), ]
8 summary(choropleth)
9
10 qplot(long, lat, data = choropleth, group = group, fill = AdjKilledRatio,
11         geom = "polygon")```

Note that I had to use the all.x=TRUE to account for the fact that South Dakota and Vermont did not have any killings so far in 2015.  This is equiv to a left outer join for you Sql folks.  On a side note, what’s up with Oklahoma?

I then decided to bin the data into high,medium, and low categories.  Looking at the detail of the adjustedKillRatio, there seems to be some natural breaks around 10% and 20%:

```1 the.counted.state.4\$AdjKilledRatio
3 ```

So I binned like that:

```1 the.counted.state.4\$KilledBin <- cut(the.counted.state.4\$AdjKilledRatio,
2                                      breaks=seq(0,1,.1))
3 summary(the.counted.state.4\$KilledBin)
4 ```

The problem with my code is that this gives me 10 bins and I only really need 3.  Fortunately, this stack overflow post helped me re-write the bin into 3 factors.  Note the Inf on the high side and the labels.

```1 the.counted.state.4\$KilledBin <- cut(the.counted.state.4\$AdjKilledRatio,
2                                      breaks=c(seq(0,.2,.1),Inf),
3                                      labels=c("low","med","high"))
4 ```

And this gives me a pretty good distribution of bins:

With things binned up, I added another chiropleth and map:

``` 1 choropleth.2 <- merge(x=all.states,
2                     y=the.counted.state.4,
3                     sort = FALSE,
4                     by.x = "region",
5                     by.y = "StateName",
6                     all.x=TRUE)
7 choropleth.2 <- choropleth.2[order(choropleth.2\$order), ]
8 summary(choropleth.2)
9
10 qplot(long,
11       lat,
12       data = choropleth.2,
13       group = group,
14       fill = KilledBin,
15       geom = "polygon")```

If your squint, it almost looks like a map of the civil war, no?