July | 2015 | Jamie Dixon's Home

The Wright Brothers and Scrum

July 28, 2015 Leave a comment

I recently read two books that, on the cover have nothing to do with each other, but actually have very much similar lessons. The first is David MacCullugh’s The Wright Brothers and Jeff Sutherland’s Scrum: The art of doing twice the work in half the time.

Although the human-interest side of the story was kind of interesting to me, what really stood out was how the Wright brothers got their machine in the air. If you are not familiar with the details of how they constructed the Wright Flyer, there are some pretty interesting points:

1) The Wright brothers were a small agile team that comprised of no fewer than two (the brothers themselves) and no more than seven. I did not realize how important Charlie Taylor and William Tate was to different pieces of the project.

2) The Wright brothers spent the 1st part of their journey doing research, combing all of the scientific literature, and engaging with the current though leaders of the day in a very much open-source style where they would freely share knowledge but retain the final product for themselves. This was in direct contrast to other teams that operated in silos and secrecy.

3) The Wright Brothers believed in doing one thing at a time well. They realized that there were two major problems with heavier than air flight –> thrust and balance. They separated these two concerns and tackled the balance problem first. Once they figured out how to make a glider stable in flight, they then tacked how to add a motor to it.

4) The Wright Brothers made hundreds of small incremental changes with each change able to stand for itself. For example, they went out to Kitty Hawk in the summer of 1900, 1901, and 1902 with gliders before going out the forth time in 1903 with their airplane. Each time, the designs got bigger and closer to the final goal.

5) The Wright brothers were willing to challenge conventional and commonly accepted “facts” when their evidence did not support it. The Wright brothers relied heavily on the calculations of Lilienthal and Chanute to measure lift and drag. After several failed experiments, the Wright brothers ditched those and went with their own, painstakingly researched, measurement tables.

3) The Wright Brothers were in direct competition with Samuel Langley’s airplane. In contrast to the Wright Brother’s agile approach, Langley had a large team that operated in absolutely secrecy while taking massive (at the time) amounts of public funds. When Langley finally rolled out his “final product” it failed miserably every time.

So what does the Wright Brother’s methodology have to do with Scrum? Everything. If you look art the core tenants of Sutherland’s book, most of them can be found in how the Wrights conquered the air. I went though the end of each chapter of Scrum and pulled out some of the take-away points that direct match to how Orville and Wilber did things:

The Wright brothers were doing scrum a full hundred years before it became a thing. As amazing what they created, how they did it is really remarkable. Interestingly for me, the “It’s the journey, not the destination” just rang home. As I write this blog, it is Friday night and I am on my front porch. A neighbor stopped by to say “hello”. When told her I was working on a blog post related to my profession, she said “Oh, I am sorry you are not doing anything fun tonight.” And I said “But this is fun.” and internally I was thinking “I wonder why so many people think work is not fun? Why are some many people socialized that way? I hope my kids don’t wind up like that.”

Filed under Book Review, Coding Best Practices

The Wright Brothers and Scrum

July 28, 2015 Leave a comment

Although the human-interest side of the story was kind of interesting to me, what really stood out was how the Wright brothers got their machine in the air. If you are not familiar with the details of how they constructed the Write Flyer, there are some pretty interesting points:

2) The Write brothers spent the 1st part of their journey doing research, combing all of the scientific literature, and engaging with the current though leaders of the day in a very much open-source style where they would freely share knowledge but retain the final product for themselves. This was in direct contrast to other teams that operated in silos and secrecy.

3) The Wright Brothers were in direct competition with Samuel Langley’s airplane. In contrast to the Write Brother’s agile approach, Langley had a large team that operated in absolutely secrecy while taking massive (at the time) amounts of public funds. When Langley finally rolled out his “final product” it failed miserably every time.

Filed under Book Review, Coding Best Practices

The Counted Part 3: Law Enforcement Officers Killed In Line Of Duty

July 22, 2015 1 Comment

As a follow up to this post, I decided to look on the other side of the gun –> police officers killed in the line of duty. Fortunately, the FBI collects this data here. It looks like the FBI is a bit behind on their summary reports:

So taking the 2013 data as the closest data point to The Counted 2015 data, it took a couple of minutes to download the excel spreadsheet and format it as a useable .csv:

After importing in the data in R studio, I did a quick summary on the data frame. The most striking thing out of the gate is how few Officers are killed. There were 27 in 2013, compared to over 500 people killed by police officers in the 1st half of 2015:

1 officers.killed <- read.csv("./Data/table_1_leos_fk_region_geographic_division_and_state_2013.csv") 
2 sum(officers.killed$OfficersKilled)
3

I then added in the state population to do a similar ratio and map:

 1 officers.killed.2 <- merge(x=officers.killed,
 2                            y=state.population.3,
 3                            by.x="StateName",
 4                            by.y="NAME")
 5 
 6 officers.killed.2$AdjustedPopulation <- officers.killed.2$POPESTIMATE2014/10000
 7 officers.killed.2$KilledRatio <- officers.killed.2$OfficersKilled/officers.killed.2$AdjustedPopulation
 8 officers.killed.2$AdjKilledRatio <- officers.killed.2$KilledRatio * 10
 9 officers.killed.2$StateName <- tolower(officers.killed.2$StateName)
10 
11 choropleth.3 <- merge(x=all.states, 
12                     y=officers.killed.2, 
13                     sort = FALSE, 
14                     by.x = "region", 
15                     by.y = "StateName",
16                     all.x=TRUE)
17 choropleth.3 <- choropleth.3[order(choropleth.3$order), ]
18 summary(choropleth.3)
19 
20 qplot(long, lat, data = choropleth.3, group = group, fill = AdjKilledRatio,
21       geom = "polygon")
22

So Louisiana and West Virginia seem to have the highest number of officers killed per capita. I am not surprised, being that I had no expectations about states that would have higher and lower numbers. It seems likely a case of “gee-wiz” data.

Since there is so few instances, I decided to forgo any more analysis on police killed and instead combined this data with the people who were killed by police:

 1 the.counted.state.5 <- merge(x=the.counted.state.4,
 2                            y=officers.killed.2,
 3                            by.x="StateName",
 4                            by.y="StateName")
 5 
 6 names(the.counted.state.5)[names(the.counted.state.5)=="AdjKilledRatio.x"] <- "NonPoliceKillRatio"
 7 names(the.counted.state.5)[names(the.counted.state.5)=="AdjKilledRatio.y"] <- "PoliceKillRatio"
 8 
 9 the.counted.state.6 <- data.frame(the.counted.state.5$NonPoliceKillRatio,
10                                   the.counted.state.5$PoliceKillRatio,
11                                   log(the.counted.state.5$NonPoliceKillRatio),
12                                   log(the.counted.state.5$PoliceKillRatio))
13 
14 colnames(the.counted.state.6) <- c("NonPoliceKilledRatio","PoliceKilledRatio","LoggedNonPoliceKilledRatio","LoggedPoliceKilledRatio")
15 
16 plot(the.counted.state.6)
17

and certainly the log helps out and there seems to be a relationship between states that have police killed and people being killed by police (my hand-drawn red lines added):

With that in mind, I created a couple of linear models

 1 non.police <- the.counted.state.6$LoggedNonPoliceKilledRatio
 2 police <- the.counted.state.6$LoggedPoliceKilledRatio
 3 police[police==-Inf] <- NA
 4 
 5 model <- lm( non.police ~ police )
 6 summary(model)
 7 
 8 model.2 <- lm( police ~ non.police)
 9 summary(model.2)
10

Since there are only 2 variables, the adjusted R square is the same for x~y and y~x.

The interesting thing is the model has to account that many states had 0 police fatalities but had at least 1 person killed by the police. The next interesting thing is the value of the coefficient: in starts where there was at least 1 police fatality and 1 person killed by the police, every police fatality increases the number of people killed by police .96 –> and this .96 is the log of the ratio of population. So it shows that the police are better at killing then getting killed, which makes sense.

The full gist is found here.

Filed under Uncategorized

The Counted Part 2: Analysis Using R

July 14, 2015 Leave a comment

Following up on this post last week on analyzing The Counted using F# and R, I decided to look a bit closer at the data. In last week’s post, I had a data frame of all of the people killed by law enforcement collected by The Guardian for the 1st half of 2015. Although interesting to looks at, I am not sure that the map tells us anything. The data frame looks like this:

My first thought is to look at killing by population by US State. Step #1 was to sum up the number of rows by state code:

1 the.counted.state <- data.frame(table(the.counted$state))
2 colnames(the.counted.state ) <- c("StateCode","NumberKilled")
3 summary(the.counted.state)
4

I then brought in the latest population figures by state from the US Census:

1 state.population <- read.csv("http://www.census.gov/popest/data/state/asrh/2014/files/SCPRC-EST2014-18+POP-RES.csv")
2 state.population
3

And finally I bought in a cross walk table of US State Codes (what The.Counted data is in and the US State Names, which is what US Census data is in)

1 state.crosswalk <- read.csv("http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv")
2 state.crosswalk
3

I then merged all three data frames together using the state name and state code as the common key:

 1 state.population.2 <- state.population[c(5,6)]
 2 state.population.3 <- merge(x=state.population.2,
 3                             y=state.crosswalk,
 4                             by.x="NAME",
 5                             by.y="State")
 6 #The Counted With Population
 7 the.counted.state <- merge(x=the.counted.state,
 8                            y=state.population.3,
 9                             by.x="StateCode",
10                             by.y="Abbreviation")
11

I then tried to add in a column that took the total number of killed individuals by the number of people in the state.

1 the.counted.state.2 <- the.counted.state
2 the.counted.state.2$KilledRatio <- the.counted.state.2$NumberKilled/the.counted.state.2$POPESTIMATE2014
3

The problem became quickly obvious: there were not enough people in the numerator to make a meaningful straight division. To compensate for this issue, I divided the number of people in each state by 10,000. I also increased the kill ratio by a factor of 10 so that we have a scale between of 0 and 1 of .1 which is easily digestible. Finally, I renamed the variable “Name” to “StateName” because my OCD couldn’t let such an affront to the naming gods go unpunished.

1 the.counted.state.3 <- the.counted.state
2 the.counted.state.3$AdjustedPopulation <- the.counted.state.2$POPESTIMATE2014/10000
3 the.counted.state.3$KilledRatio <- the.counted.state.3$NumberKilled/the.counted.state.3$AdjustedPopulation
4 the.counted.state.3$AdjKilledRatio <- the.counted.state.3$KilledRatio * 10
5 
6 names(the.counted.state.3)[names(the.counted.state.3)=="NAME"] <- "StateName"
7 the.counted.state.3$StateName <- tolower(the.counted.state.3$StateName)

With the data prepped, I created a choropleth to show the kill ratio by state using a gradient scale:

 1 choropleth <- merge(x=all.states, 
 2                y=the.counted.state.3, 
 3                sort = FALSE, 
 4                by.x = "region", 
 5                by.y = "StateName",
 6                all.x=TRUE)
 7 choropleth <- choropleth[order(choropleth$order), ]
 8 summary(choropleth)
 9 
10 qplot(long, lat, data = choropleth, group = group, fill = AdjKilledRatio,
11         geom = "polygon")

Note that I had to use the all.x=TRUE to account for the fact that South Dakota and Vermont did not have any killings so far in 2015. This is equiv to a left outer join for you Sql folks. On a side note, what’s up with Oklahoma?

I then decided to bin the data into high,medium, and low categories. Looking at the detail of the adjustedKillRatio, there seems to be some natural breaks around 10% and 20%:

1 the.counted.state.4$AdjKilledRatio
2 summary(the.counted.state.4$AdjKilledRatio)
3

So I binned like that:

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio,
2                                      breaks=seq(0,1,.1))
3 summary(the.counted.state.4$KilledBin)
4

The problem with my code is that this gives me 10 bins and I only really need 3. Fortunately, this stack overflow post helped me re-write the bin into 3 factors. Note the Inf on the high side and the labels.

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio,
2                                      breaks=c(seq(0,.2,.1),Inf),
3                                      labels=c("low","med","high"))
4

And this gives me a pretty good distribution of bins:

With things binned up, I added another chiropleth and map:

 1 choropleth.2 <- merge(x=all.states, 
 2                     y=the.counted.state.4, 
 3                     sort = FALSE, 
 4                     by.x = "region", 
 5                     by.y = "StateName",
 6                     all.x=TRUE)
 7 choropleth.2 <- choropleth.2[order(choropleth.2$order), ]
 8 summary(choropleth.2)
 9 
10 qplot(long, 
11       lat, 
12       data = choropleth.2,
13       group = group,
14       fill = KilledBin,
15       geom = "polygon")

If your squint, it almost looks like a map of the civil war, no?

Filed under Analytics, R

The Counted: Initial Analysis Using FSharp and R

July 7, 2015 3 Comments

(Note: this is post one of three. Next week is a deeper dive into the data and the following week is an analysis of law enforcement officers killed in the line of duty)

Andrew Oliver hit me up on Twitter with a new dataset that he stumbled across. The dataset is called “The Counted” and it is an attempt to count all of the deaths at the hand of police in America in 2015. Apparently, this data is not collected systematically by the US government, which is kind of puzzling. You can read about and download the data here. A sample looks like:

John asked what we could do with the dataset –> esp when comparing to other variables like socio-economic status. Step #1 in my mind was to geo-locate the data. Since this is a .csv, the first-first thing was to remove extra commas and replace them with semi-colons or blank spaces (for example, US Marshals Service, Pennsylvania State Police, Allegheny County Sheriff’s Office became US Marshals Service; Pennsylvania State Police; Allegheny County Sheriff’s Office and Corrections Department, 1400 E 4th Ave became Corrections Department 1400 E 4th Ave)

Adding Geolocations

Drawing on my code that I wrote using Texas A&M’s Geoservice found here, I converted the json type provider script into a function that takes address info and returns a geolocation:

 1 let getGeoCoordinates(streetAddress:string, city:string, state:string) =
 2     let apiKey = "xxxxx"
 3     let stringBuilder = new StringBuilder()
 4     stringBuilder.Append("https://geoservices.tamu.edu/Services/Geocode/WebService/GeocoderWebServiceHttpNonParsed_V04_01.aspx") |> ignore
 5     stringBuilder.Append("?streetAddress=") |> ignore
 6     stringBuilder.Append(streetAddress) |> ignore
 7     stringBuilder.Append("&city=") |> ignore
 8     stringBuilder.Append(city) |> ignore
 9     stringBuilder.Append("&state=") |> ignore
10     stringBuilder.Append(state) |> ignore
11     stringBuilder.Append("&apiKey=") |> ignore
12     stringBuilder.Append(apiKey) |> ignore
13     stringBuilder.Append("&version=4.01") |> ignore
14     stringBuilder.Append("&format=json") |> ignore
15 
16     let searchUri = stringBuilder.ToString()
17     let searchResult = GeoLocationServiceContext.Load(searchUri)
18 
19     let firstResult = searchResult.OutputGeocodes |> Seq.head
20     firstResult.OutputGeocode.Latitude, firstResult.OutputGeocode.Longitude, firstResult.OutputGeocode.MatchScore

I then loaded in the dataset via the .csv type provider:

1 [<Literal>]
2 let theCountedSample = "..\Data\TheCounted.csv"
3 type TheCountedContext = CsvProvider<theCountedSample>
4 let theCountedData = TheCountedContext.Load(theCountedSample)
5

I then mapped the geofunction to the imported dataset:

1 let theCountedGeoLocated = theCountedData.Rows 
2                             |> Seq.map(fun r -> r, getGeoCoordinates(r.Streetaddress, r.City, r.State))
3                             |> Seq.toList
4                             |> Seq.map(fun (r,(lat,lon,ms)) -> String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15}",
5                                                                      r.Name,r.Age,r.Gender,r.Raceethnicity,r.Month,r.Day,r.Year, r.Streetaddress, r.City,r.State,r.Cause,r.Lawenforcementagency,r.Armed,lat,lon,ms))
6

And then finally exported the data

1 let baseDirectory = __SOURCE_DIRECTORY__
2 let baseDirectory' = Directory.GetParent(baseDirectory)
3 let filePath = "Data\TheCountedWithGeo.csv"
4 let fullPath = Path.Combine(baseDirectory'.FullName, filePath)
5 File.WriteAllLines(fullPath,theCountedGeoLocated)

The gist is here. Using the csv and json type providers made the analysis a snap –> a majority code is just building up the string for the service call. +1 for simplicity.

Analyzing The Results

After adding geolocations to the dataset, I opened R studio and imported the dataset.

1 theCounted <- read.csv("./Data/TheCountedWithGeo.csv") 
2 summary(theCounted)
3

So this is good news that we have good confidence on all of the observations so we don’t have to drop any records (making the counted, un-counted, as it were).

I then googled how to create a US map and put some data points on them and ran across this post. I copied and pasted the code, changed the variable names, said “there is no way it is this easy” out loud, and hit CTRL+ENTER.

 1 library(ggplot2)
 2 library(maps)
 3 
 4 all.states <- map_data("state")
 5 plot <- ggplot() 
 6 plot <- plot + geom_polygon(data=all.states, aes(x=long, y=lat, group = group),
 7                       colour="grey", fill="white" )
 8 plot <- plot + geom_point(data=theCounted, aes(x=lon, y=lat), 
 9                     colour="#FF0040")
10 plot <- plot + guides(size=guide_legend(title="Homicides"))
11 plot

The gist is here.

Filed under Analytics, F#, R

Jamie Dixon's Home

The Wright Brothers and Scrum

The Wright Brothers and Scrum

The Counted Part 3: Law Enforcement Officers Killed In Line Of Duty

The Counted Part 2: Analysis Using R

The Counted: Initial Analysis Using FSharp and R

Adding Geolocations

Analyzing The Results

Categories

Recent Posts

Archives

Blogroll

Meta