Wake County Voter Analysis Using FSharp, AzureML, and R

One of the real strengths of FSharp its ability to plow through and transform data in a very intuitive way,  I was recently looking at Wake Country Voter Data found here to do some basic voter analysis.  My first thought was to download the data into R Studio.  Easy?  Not really.  The data is available as a ginormous Excel spreadsheet of database of about 154 MB in size.  I wanted to slim the dataset down and make it a .csv for easy import into R but using Excel to export the data as a .csv kept screwing up the formatting and importing it directly into R Studio from Excel resulting in out of memory crashes.  Also, the results of the different election dates were not consistent –> sometimes null, sometimes not.   I managed to get the data into R Studio without a crash and wrote a function of either voted “1” or not “0” for each election

1 #V = voted in-person on Election Day 2 #A = voted absentee by mail or early voting (through May 2006) 3 #M = voted absentee by mail (November 2006 - present) 4 5 #O = voted One-Stop early voting (November 2006 - present) 6 #T = voted at a transfer precinct on Election Day 7 #P = voted a provisional ballot 8 #L = Legacy data (prior to 2006) 9 #D = Did not show 10 11 votedIndicated <- function(votedCode) { 12 switch(votedCode, 13 "V" = 1, 14 "A" = 1, 15 "M" = 1, 16 "O" = 1, 17 "T" = 1, 18 "P" = 1, 19 "L" = 1, 20 "D" = 0) 21 } 22

However, every time I tried to run it, the IDE would crash with an out of memory issue. 

 Stepping back, I decided to transform the data in Visual Studio using FSharp. I created a sample from the ginormous excel spreadsheet and then imported the data using a type provider.  No memory crashes!

1 #r "../packages/ExcelProvider.0.1.2/lib/net40/ExcelProvider.dll" 2 open FSharp.ExcelProvider 3 4 [<Literal>] 5 let samplePath = "../../Data/vrdb-Sample.xlsx" 6 7 open System.IO 8 let baseDirectory = __SOURCE_DIRECTORY__ 9 let baseDirectory' = Directory.GetParent(baseDirectory) 10 let baseDirectory'' = Directory.GetParent(baseDirectory'.FullName) 11 let inputFilePath = @"Data\vrdb.xlsx" 12 let fullInputPath = Path.Combine(baseDirectory''.FullName, inputFilePath) 13 14 type WakeCountyVoterContext = ExcelFile<samplePath> 15 let context = new WakeCountyVoterContext(fullInputPath) 16 let row = context.Data |> Seq.head

I then applied a similar function for voted or not and then exported the data as a .csv

1 let voted (voteCode:obj) = 2 match voteCode = null with 3 | true -> "0" 4 | false -> "1" 5 6 open System 7 let header = "Id,Race,Party,Gender,Age,20080506,20080624,20081104,20091006,20091103,20100504,20100622,20101102,20111011,20111108,20120508,20120717,20121106,20130312,20131008,20131105,20140506,20140715,20141104" 8 9 let createOutputRow (row:WakeCountyVoterContext.Row) = 10 String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}", 11 row.voter_reg_num, 12 row.race_lbl, 13 row.party_lbl, 14 row.gender_lbl, 15 row.eoy_age, 16 voted(row.``05/06/2008``), 17 voted(row.``06/24/2008``), 18 voted(row.``11/04/2008``), 19 voted(row.``10/06/2009``), 20 voted(row.``11/03/2009``), 21 voted(row.``05/04/2010``), 22 voted(row.``06/22/2010``), 23 voted(row.``11/02/2010``), 24 voted(row.``10/11/2011``), 25 voted(row.``11/08/2011``), 26 voted(row.``05/08/2012``), 27 voted(row.``07/17/2012``), 28 voted(row.``11/06/2012``), 29 voted(row.``03/12/2013``), 30 voted(row.``10/08/2013``), 31 voted(row.``11/05/2013``), 32 voted(row.``05/06/2014``), 33 voted(row.``07/15/2014``), 34 voted(row.``11/04/2014``) 35 ) 36 37 let outputFilePath = @"Data\vrdb.csv" 38 39 let data = context.Data |> Seq.map(fun row -> createOutputRow(row)) 40 let fullOutputPath = Path.Combine(baseDirectory''.FullName, outputFilePath) 41 42 let file = new StreamWriter(fullOutputPath,true) 43 44 file.WriteLine(header) 45 context.Data |> Seq.map(fun row -> createOutputRow(row)) 46 |> Seq.iter(fun r -> file.WriteLine(r)) 47

The really great thing is that I could write and then dispose of each line so I could do it without any crashes.  Once the data was into a a .csv (10% the size of Excel), I could then import it into R Studio without a problem.  It is a common lesson but really shows that using the right tool for the job saves tons of headaches.

I knew from a previous analysis of voter data that the #1 determinate of a person from wake county voting in a off-cycle election was their age:

image

image

image

So then in R, I created a decision tree for just age to see what the split was:

1 library(rpart) 2 temp <- rpart(all.voters$X20131008 ~ all.voters$Age) 3 plot(temp) 4 text(temp)

Thanks to Placidia for answering my question on stats.stackoverflow

image

So basically politicians should be targeting people 50 years or older or perhaps emphasizing issues that appeal to the over 50 crowd.

 

 

 

 

Kaggle and R

Following up on last week’s post on doing a Kaggle competition, I then decided to see if I could explore the data more in R on my local desktop.  The competition is about analyzing a large group of house claims to give them a risk score.

I started the R studio to take a look at the initial data:

1 train <- read.csv("../Data/train.csv") 2 head(train) 3 summary(train) 4 5 plot(train$Hazard)

image

A couple of things popped out.  All of the X variables look to be categorical.  Even the result “Hazard” is an integer with most of the values falling between 1 and 9.

With that in mind, I decided to split the dataset into two sections: the majority and the minority.

1 train.low <- subset(train, Hazard < 9) 2 train.high <- subset(train, Hazard >= 9) 3 4 plot(train.low$Hazard) 5 plot(train.high$Hazard)

With the under as:

image

And the over 9 is like this

image

But I want to look at the Hazard score from a distribution point of view:

1 hazard.frame <- as.data.frame(table(train$Hazard)) 2 colnames(hazard.frame) <- c("hazard","freq") 3 hist(hazard.frame$freq) 4 plot(x=hazard.frame$hazard, y=hazard.frame$freq) 5 plot(x=hazard.frame$hazard, log(y=hazard.frame$freq)) 6

The hist shows the left skew

image

 

image

and the log plot really shows the distribution

image

So there is clearly a diminishing return going on.   As of this writing, the leader is at 40%, which is about 20,400 of the 51,000 entries.   So if you could identify all of the ones correctly, you should get 37% of the way there.  To test it out, I submitted to Kaggle only ones:

image

LOL, so they must take away for incorrect answers as it is same as “all 0” benchmark.  So going back, I know that if I can predict the ones correctly and make a reasonable guess at the rest, I might be OK.   I went back and tuned my model some to get me out of the bottom 25% and then let it be.  I assume that there is something obvious/industry standard that I am missing because there are so many people between my position and the top 25%.

The Counted Part 2: Analysis Using R

Following up on this post last week on analyzing The Counted using F# and R, I decided to look a bit closer at the data.  In last week’s post, I had a data frame of all of the people killed by law enforcement collected by The Guardian for the 1st half of 2015.  Although interesting to looks at, I am not sure that the map tells us anything.  The data frame looks like this:

image

My first thought is to look at killing by population by US State.  Step #1 was to sum up the number of rows by state code:

1 the.counted.state <- data.frame(table(the.counted$state)) 2 colnames(the.counted.state ) <- c("StateCode","NumberKilled") 3 summary(the.counted.state) 4

image

I then brought in the latest population figures by state from the US Census:

1 state.population <- read.csv("http://www.census.gov/popest/data/state/asrh/2014/files/SCPRC-EST2014-18+POP-RES.csv") 2 state.population 3

image

And finally I bought in a cross walk table of US State Codes (what The.Counted data is in and the US State Names, which is what US Census data is in)

1 state.crosswalk <- read.csv("http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv") 2 state.crosswalk 3

image

 

I then merged all three data frames together using the state name and state code as the common key:

1 state.population.2 <- state.population[c(5,6)] 2 state.population.3 <- merge(x=state.population.2, 3 y=state.crosswalk, 4 by.x="NAME", 5 by.y="State") 6 #The Counted With Population 7 the.counted.state <- merge(x=the.counted.state, 8 y=state.population.3, 9 by.x="StateCode", 10 by.y="Abbreviation") 11

I then tried to add in a column that took the total number of killed individuals by the number of people in the state.

1 the.counted.state.2 <- the.counted.state 2 the.counted.state.2$KilledRatio <- the.counted.state.2$NumberKilled/the.counted.state.2$POPESTIMATE2014 3

image

The problem became quickly obvious: there were not enough people in the numerator to make a meaningful straight division.  To compensate for this issue, I divided the number of people in each state by 10,000.  I also increased the kill ratio by a factor of 10 so that we have a scale between of 0 and 1 of .1 which is easily digestible.  Finally, I renamed the variable “Name” to “StateName” because my OCD couldn’t let such an affront to the naming gods go unpunished.

1 the.counted.state.3 <- the.counted.state 2 the.counted.state.3$AdjustedPopulation <- the.counted.state.2$POPESTIMATE2014/10000 3 the.counted.state.3$KilledRatio <- the.counted.state.3$NumberKilled/the.counted.state.3$AdjustedPopulation 4 the.counted.state.3$AdjKilledRatio <- the.counted.state.3$KilledRatio * 10 5 6 names(the.counted.state.3)[names(the.counted.state.3)=="NAME"] <- "StateName" 7 the.counted.state.3$StateName <- tolower(the.counted.state.3$StateName)

 

image

With the data prepped, I created a choropleth to show the kill ratio by state using a gradient scale:

1 choropleth <- merge(x=all.states, 2 y=the.counted.state.3, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth <- choropleth[order(choropleth$order), ] 8 summary(choropleth) 9 10 qplot(long, lat, data = choropleth, group = group, fill = AdjKilledRatio, 11 geom = "polygon")

image

Note that I had to use the all.x=TRUE to account for the fact that South Dakota and Vermont did not have any killings so far in 2015.  This is equiv to a left outer join for you Sql folks.  On a side note, what’s up with Oklahoma? 

I then decided to bin the data into high,medium, and low categories.  Looking at the detail of the adjustedKillRatio, there seems to be some natural breaks around 10% and 20%:

1 the.counted.state.4$AdjKilledRatio 2 summary(the.counted.state.4$AdjKilledRatio) 3

image

So I binned like that:

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=seq(0,1,.1)) 3 summary(the.counted.state.4$KilledBin) 4

image

The problem with my code is that this gives me 10 bins and I only really need 3.  Fortunately, this stack overflow post helped me re-write the bin into 3 factors.  Note the Inf on the high side and the labels.

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=c(seq(0,.2,.1),Inf), 3 labels=c("low","med","high")) 4

And this gives me a pretty good distribution of bins:

image

With things binned up, I added another chiropleth and map:

1 choropleth.2 <- merge(x=all.states, 2 y=the.counted.state.4, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth.2 <- choropleth.2[order(choropleth.2$order), ] 8 summary(choropleth.2) 9 10 qplot(long, 11 lat, 12 data = choropleth.2, 13 group = group, 14 fill = KilledBin, 15 geom = "polygon")

image

If your squint, it almost looks like a map of the civil war, no?

The Counted: Initial Analysis Using FSharp and R

(Note: this is post one of three.  Next week is a deeper dive into the data and the following week is an analysis of law enforcement officers killed in the line of duty)

Andrew Oliver hit me up on Twitter with a new dataset that he stumbled across.  The dataset is called “The Counted” and it is an attempt to count all of the deaths at the hand of police in America in 2015.  Apparently, this data is not collected systematically by the US government, which is kind of puzzling.  You can read about and download the data here.  A sample looks like:

image

John asked what we could do with the dataset –> esp when comparing to other variables like socio-economic status.  Step #1 in my mind was to geo-locate the data.  Since this is a .csv, the first-first thing was to remove extra commas and replace them with semi-colons or blank spaces (for example, US Marshals Service, Pennsylvania State Police, Allegheny County Sheriff’s Office became US Marshals Service; Pennsylvania State Police; Allegheny County Sheriff’s Office and Corrections Department, 1400 E 4th Ave became Corrections Department 1400 E 4th Ave)

Adding Geolocations

Drawing on my code that I wrote using Texas A&M’s Geoservice found here, I converted the json type provider script into a function that takes address info and returns a geolocation:

1 let getGeoCoordinates(streetAddress:string, city:string, state:string) = 2 let apiKey = "xxxxx" 3 let stringBuilder = new StringBuilder() 4 stringBuilder.Append("https://geoservices.tamu.edu/Services/Geocode/WebService/GeocoderWebServiceHttpNonParsed_V04_01.aspx") |> ignore 5 stringBuilder.Append("?streetAddress=") |> ignore 6 stringBuilder.Append(streetAddress) |> ignore 7 stringBuilder.Append("&city=") |> ignore 8 stringBuilder.Append(city) |> ignore 9 stringBuilder.Append("&state=") |> ignore 10 stringBuilder.Append(state) |> ignore 11 stringBuilder.Append("&apiKey=") |> ignore 12 stringBuilder.Append(apiKey) |> ignore 13 stringBuilder.Append("&version=4.01") |> ignore 14 stringBuilder.Append("&format=json") |> ignore 15 16 let searchUri = stringBuilder.ToString() 17 let searchResult = GeoLocationServiceContext.Load(searchUri) 18 19 let firstResult = searchResult.OutputGeocodes |> Seq.head 20 firstResult.OutputGeocode.Latitude, firstResult.OutputGeocode.Longitude, firstResult.OutputGeocode.MatchScore

I then loaded in the dataset via the .csv type provider:

1 [<Literal>] 2 let theCountedSample = "..\Data\TheCounted.csv" 3 type TheCountedContext = CsvProvider<theCountedSample> 4 let theCountedData = TheCountedContext.Load(theCountedSample) 5

I then mapped the geofunction to the imported dataset:

1 let theCountedGeoLocated = theCountedData.Rows 2 |> Seq.map(fun r -> r, getGeoCoordinates(r.Streetaddress, r.City, r.State)) 3 |> Seq.toList 4 |> Seq.map(fun (r,(lat,lon,ms)) -> String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15}", 5 r.Name,r.Age,r.Gender,r.Raceethnicity,r.Month,r.Day,r.Year, r.Streetaddress, r.City,r.State,r.Cause,r.Lawenforcementagency,r.Armed,lat,lon,ms)) 6

And then finally exported the data

1 let baseDirectory = __SOURCE_DIRECTORY__ 2 let baseDirectory' = Directory.GetParent(baseDirectory) 3 let filePath = "Data\TheCountedWithGeo.csv" 4 let fullPath = Path.Combine(baseDirectory'.FullName, filePath) 5 File.WriteAllLines(fullPath,theCountedGeoLocated)

image

The gist is here.  Using the csv and json type providers made the analysis a snap –> a majority code is just building up the string for the service call.  +1 for simplicity.

Analyzing The Results

After adding geolocations to the dataset, I opened R studio and imported the dataset.

1 theCounted <- read.csv("./Data/TheCountedWithGeo.csv") 2 summary(theCounted) 3

 

image

So this is good news that we have good confidence on all of the observations so we don’t have to drop any records (making the counted, un-counted, as it were).

I then googled how to create a US map and put some data points on them and ran across this post.  I copied and pasted the code, changed the variable names, said “there is no way it is this easy” out loud, and hit CTRL+ENTER.

1 library(ggplot2) 2 library(maps) 3 4 all.states <- map_data("state") 5 plot <- ggplot() 6 plot <- plot + geom_polygon(data=all.states, aes(x=long, y=lat, group = group), 7 colour="grey", fill="white" ) 8 plot <- plot + geom_point(data=theCounted, aes(x=lon, y=lat), 9 colour="#FF0040") 10 plot <- plot + guides(size=guide_legend(title="Homicides")) 11 plot

 

image

The gist is here.

R for the .NET Developer

I spent some time over the last week putting my ideas down for a new speaking topic: “R for the .NET Developer.” With Microsoft acquiring Revolution Analytics and making a concerted push into analytics tooling and platforms, it makes sense that .NET developers have some exposure to the most common language in the data science space – R.

I started the presentation using Prezi (thanks David Green) and set up the major points I wanted to cover:

  • · R Overview
  • · R Language Features
  • · R In Action
  • · R Lessons Learned

You can see the Prezi here.

I worked through and then borrowed from several different books:

clip_image002clip_image004clip_image006clip_image008

clip_image010image image image image

this great you tube clip

clip_image012

and this Pluralsight course

clip_image014

I then jumped into R Studio to work though some of the code ideas that the Prezi illustrates. The entire set of code is found here on Github here but I wanted to show a couple of the cooler things that I did.

First, I implemented the Automotive In R from Data Mining and Business Analytics Book. This is pretty much a straight port of his exercise, with the exception is that I convert some vectors to factors to demonstrate who/when to do it:

1 setwd("C:\\Git\\R4DotNet") 2 3 #y = x1 + x2 + x3 + E 4 #y is what you are trying explain 5 #x1, x2, x3 are the variables that cause/influence y 6 #E is things that we are not measuring/ using for calculations 7 8 fuel.efficiency <- read.csv("C:/Git/R4DotNet/Data/FuelEfficiency.csv") 9 summary(fuel.efficiency) 10 11 #MPG = Miles per gallon 12 #GPM = Gallons per 100 miles 13 #WT = Weight of car in 1000 lbs 14 #DIS = Displacment in cubic inches 15 #NC = number of cylinders 16 #HP = Horsepower 17 #ACC = Acceleration in seconds from 0-60 18 #ET = Engine Type 0 = V, 1 = Straight 19 20 plot(GPM~WT,data=fuel.efficiency) 21 plot(GPM~DIS,data=fuel.efficiency) 22 23 fuel.efficiency$NC <- factor(fuel.efficiency$NC) 24 fuel.efficiency$ET <- factor(fuel.efficiency$ET) 25 summary(fuel.efficiency) 26 27 plot(GPM~NC,data=fuel.efficiency) 28 29 model <- lm(GPM~.,data=fuel.efficiency) 30 summary(model) 31 32 # Multiple R-squared: 0.9804 33 # means that we can explain 98% of the GPM with the variables we have E = 2% 34 # That is pretty friggen good 35 36 # turning back to numeric so we can do cor accross data frame 37 fuel.efficiency$NC <- as.integer(fuel.efficiency$NC) 38 fuel.efficiency$ET <- as.integer(fuel.efficiency$ET) 39 cor(fuel.efficiency) 40 41 #DIS -> WT = 0.9507647 42 43 library(leaps) 44 x=fuel.efficiency[,3:7] 45 y=fuel.efficiency[,2] 46 out = summary(regsubsets(x,y,nbest=2,nvmax=ncol(x))) 47 tab=cbind(out$which,out$req,out$adjr2,out$cp) 48 tab 49 50 #trade off between model size and model fit 51 #just weight is 52 53 model2 = lm(GPM~WT,data=fuel.efficiency) 54 summary(model2)

Here are the plots (as continuous and as a factor):

image image

Then, I implemented this K-Means from Azure ML to show the difference between the two implementations. The AzureML experiment is found here.  And my code looks like this. Note that I did not do a regression

1 flowers <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data") 2 summary(flowers) 3 4 colnames(flowers) <- c("F1", "F2", "F3", "F4", "Label") 5 summary(flowers) 6 7 8 indexes = sample(1:nrow(flowers), size=0.6*nrow(flowers)) 9 flowers.train <- flowers[-indexes,] 10 flowers.test <- flowers[indexes,] 11 12 fit <- kmeans(flowers.train[,1:4],5) 13 fit 14 15 plot(flowers.train[c("F1", "F2")], col=fit$cluster) 16 points(fit$centers[,c("F1", "F2")], col=1:3, pch=8, cex=2)

With a plot example like this:

image

So I think I am ready for the presentation.  It is really true, the best way to learn about something is to teach it…

Global Azure Bootcamp Racing Game: More Analytics Using R and AzureML

Alan Smith, the creator and keeper of the Global Azure Bootcamp Racing Game, was kind enough to put the telemetry data from the races out on Azure Blob Storage.  The data was already available as XML from Table Storage but AzureML was choking on the format so Alan was kind enough to turn it in to csv and put the file out here:

https://alanazuredemos.blob.core.windows.net/alan/TelemetryData0.csv
https://alanazuredemos.blob.core.windows.net/alan/TelemetryData1.csv
https://alanazuredemos.blob.core.windows.net/alan/TelemetryData2.csv
https://alanazuredemos.blob.core.windows.net/alan/PlayerLapTimes0.csv
https://alanazuredemos.blob.core.windows.net/alan/PlayerLapTimes1.csv
https://alanazuredemos.blob.core.windows.net/alan/PlayerLapTimes2.csv

Note that there are 3 races with race0, race1, and race2 each having 2 datasets.  The TelemetryData is a reading foreaceach car in the race every 10 MS or so and the PlayerLapTimes is a summary of the demographics of the player as well as some final results.

I decided to do some unsupervised learning using Chapter 8 of Practical Data Science With R as my guide.  I pulled down all 972,780 observations from the Race0 telemetry data in R Studio.  It took a bit :-)  I then ran the following script to do a cluster dendrogram.  Alas, I killed the job after several minutes (actually the job killed my machine and I got a out of memory exception)

1 summary(TelemetryData0) 2 pmatrix <- scale(TelemetryData0[,]) 3 d <- dist(pmatrix, method="euclidean") 4 pfit <- hclust(d,method="ward") 5 plot(pfit) 6

I then tried to narrow my search down to damage and speed:

1 damage <- TelemetryData0$Damage 2 speed <- TelemetryData0$Speed 3 4 plot(damage, speed, main="Damage and Speed", 5 xlab="Damage ", ylab="Speed ", pch=20) 6 7 abline(lm(speed~speed), col="red") # regression line (y~x) 8 lines(lowess(speed,speed), col="blue") # lowess line (x,y) 9

(I added the red line manually)

image

So that is interesting.  It looks like there is a slight downhill (more damage) the lower the speed.  So perhaps speed does not automatically mean more damage to the car.  Anyone who drives in San Francisco can attest to that 🙂

I then went back and took a sample of the telemetry data

1 telemetry <- TelemetryData0[sample(1:nrow(TelemetryData0),10000),] 2 telemetry <- telemetry[0:10000,c("Damage","Speed")] 3 summary(telemetry) 4 pmatrix <- scale(telemetry[,]) 5 d <- dist(pmatrix, method="euclidean") 6 pfit <- hclust(d,method="ward") 7 plot(pfit) 8

And I got this:

image

And the fact that it is not showing me anything made me think of this clip:

image

In any event, I decided to try a similar analysis using AzureML to see if AzureML can handle the 975K records better than my desktop.

I fired up AzureML and added a data reader to the original file and then added some cleaning:

image

The problem is that these steps would take 10-12 minutes to complete.  I decided to give up and bring a copy of the data locally via the “Save As Dataset” context menu.  This speed things up significantly.  I added in a k-means module for speed and damage and ran the model

image 

The first ten times or so I ran this, I got a this

image

After I added in the “Clean Missing Data” module before the normalization step,

image

I got some results.  Note that Removing the entire row is what R does as a default when cleaning the data via import so I thought I would keep it matching.  In any event, the results look like this:

image

So I am not sure what this shows, other than there is overlap of speed and damage and there seems to be a relationship.

So there are some other questions I want to answer, like:

1) After a player sustains some damage, do they have a generic response (like breaking, turning right, etc…)

2) Are there certain “lines’’” that winner players take going though individual curves?

3) Do you really have to avoid damage to win?

I plan to try and answer these questions and more in the coming weeks.

Global Azure Bootcamp: Car Lab Analysis

As part of the Global Azure Bootcamp, the organizers created a hand-on lab where individuals could install a racing game and compete against other drivers.  The cool thing was the amount of telemetry that the game pushed to Azure (I assume using Event Hubs to Azure Tables).  The lab also had a basic “hello world” web app that could read data from the Azure Table REST endpoints so newcomers could see how easy it was to create and then deploy a website on Azure.

I decided to take a bit of a jaunt though the data endpoint to see what analytics I could run on it using Azure ML.  I went to the initial endpoint here and sure enough, the data comes down in the browser.  Unfortunately, when I set it up in Azure ML using a data reader:

image

I got 0 records returned.  I think this has something to do with how the datareader deals with XML.  I quickly used F# in Visual Studio with the XML type provider:

1 #r "../packages/FSharp.Data.2.2.0/lib/net40/FSharp.Data.dll" 2 3 open FSharp.Data 4 5 [<Literal>] 6 let uri = "https://reddoggabtest-secondary.table.core.windows.net/TestTelemetryData0?tn=TestTelemetryData0&sv=2014-02-14&si=GabLab&sig=GGc%2BHEa9wJYDoOGNE3BhaAeduVOA4MH8Pgss5kWEIW4%3D" 7 8 type CarTelemetry = XmlProvider<uri> 9 let carTelemetry = CarTelemetry.Load(uri) 10 11

I reached out to the creator of the lab and he put a summary file on Azure Blob Storage that was very easy to consume with AzureML, you can find it herehere.  I created Regression to predict the amount of damage a car will sustain based on the country and car type:

image

This was great, but I wanted to working on my R chops some so I decided to play around with the data in R Studio.  I imported the data into R Studio and then fired up the scripting window.  The first question I wanted to answer was “how does each country stack up against each other in terms of car crashes?”

I did some basic data exploration like so:

1 summary(PlayerLapTimes) 2 3 aggregate(Damage ~ Country, PlayerLapTimes, sum) 4 aggregate(Damage ~ Country, PlayerLapTimes, FUN=length) 5

image

And then getting down to the business of answering the question:

1 2 dfSum <- aggregate(Damage ~ Country, PlayerLapTimes, sum) 3 dfCount <- aggregate(Damage ~ Country, PlayerLapTimes, FUN=length) 4 5 dfDamage <- merge(x=dfSum, y=dfCount, by.x="Country", by.y="Country") 6 names(dfDamage)[2] <- "Sum" 7 names(dfDamage)[3] <- "Count" 8 dfDamage$Avg <- dfDamage$Sum/dfDamage$Count 9 dfDamage2 <- dfDamage[order(dfDamage$Avg),] 10

image

So that is kinda interesting that France has the most damage per race.  I have to ask Mathias Brandewinder about that.

In any event, I then wanted to ask “what county finished first”.  I decided to apply some R charting to the same biolerplate that I created earlier

1 dfSum <- aggregate(LapTimeMs ~ Country, PlayerLapTimes, sum) 2 dfCount <- aggregate(LapTimeMs ~ Country, PlayerLapTimes, FUN=length) 3 dfSpeed <- merge(x=dfSum, y=dfCount, by.x="Country", by.y="Country") 4 names(dfSpeed)[2] <- "Sum" 5 names(dfSpeed)[3] <- "Count" 6 dfSpeed$Avg <- dfSpeed$Sum/dfSpeed$Count 7 dfSpeed2 <- dfSpeed[order(dfSpeed$Avg),] 8 plot(PlayerLapTimes$Country,PlayerLapTimes$Damage) 9

image

 

image

So even though France appears to have the slowest drivers, the average is skewed by 2 pretty bad races –> perhaps the person never finished.

In any event, this was a fun exercise and I hope to continue with the data to show the awesomeness of Azure, F#, and R…

 

 

 

WCPSS Scores and Property Tax Valuations Using R

With all of the data gathered and organized I was ready to do some analytics using R.  The first thing I did was to load the four major datasets into R.

image

  • NCScores is the original dataset that has the school score.  I had already done did an analysis on it here.
  • SchoolValuation is the aggrgrate property values for each school as determined by scraping the Wake County Tax Website and Wake County School Assignment websites.  You can read how it was created here and here.
  • SchoolNameMatch is a crosswalk table between the school name as found in the NCScores dataframe and the School Valuation dataframe.  You can read how it was created here
  • WakeCountySchoolInfo is an export from WCPSS that was tossed around at open data day.

Step one was to reduce the North Carolina Scores data to only Wake County

1 #Create Wake County Scores From NC State Scores 2 WakeCountyScores <- NCScores[NCScores$District == 'Wake County Schools',] 3

The next step was to add in the SchoolNameMatch so that we have the Tax Valuation School Name

1 #Join SchoolNameMatch to Wake County Scores 2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolNameMatch, by.x="School", by.y="WCPSS") 3

Interestingly, R is smart enough that the common field not duplicated, just the additional field(s) are added

image

The next step was to add in the Wake County Property Values, remove the Property field as it is no longer needed, and convert the TaxBase field from string to numeric

1 #Join Property Values 2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolValuation, by.x="Property", by.y="SchooName") 3 4 #Remove Property column 5 WakeCountyScores$Property = NULL 6 7 #Turn tax base to numeric 8 WakeCountyScores$TaxBase <- as.numeric(WakeCountyScores$TaxBase) 9

Eager to do an analysis, I pumped the data into a correlation

1 #Do a Correlation 2 cor(WakeCountyScores$TaxBase,WakeCountyScores$SchoolScore,use="complete") 3

image

So clearly my expectations that property values track with FreeAndReducedLunch (.85 correlation) were not met.  I decided to use Practical Data Science with R Chapter 3 (Exploring Data)  as a guide to better understand the dataset.

1 #Practical Data Science With R, Chapter3 2 summary(WakeCountyScores) 3 summary(WakeCountyScores$TaxBase) 4

image

image

So there is quite a range in tax base!  The next task was to use some graphs to explore the data.  I added in ggplot2

image

and followed the books example for a histogram.  I started with score and it comes out as expected.  I then tried a historgram on TaxBase and had to tinker with the binwidth to make a meaningful chart:

1 #Historgrams 2 ggplot(WakeCountyScores) + geom_histogram(aes(x=SchoolScore),binwidth=5,fill="gray") 3 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=10000,fill="gray") 4 #Ooops 5 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=5000000,fill="gray") 6

 

image

image

The book then moves to an example studying income, which is directly analogous to the TaxBase so I followed it very closely.  The next graph were some density graphs.  Note the second one is a logarithmic one:

1 #Density 2 library(scales) 3 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_continuous(labels=dollar) 4 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_log10(labels=dollar) + annotation_logticks(sides="bt") 5

 

image

 

image

So kinda interesting that most schools cluster in terms of their tax base, but because there is such a wide range with a majority clustered to the low end, the logarithmic curve is much more revealing.

The book then moved into showing the relationship between two variables.  In this case, SchoolScore as the Y variable and TaxBase as the X variable:

1 #Relationship between TaxBase and Scores 2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() 3 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + stat_smooth(method="lm") 4 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + geom_smooth() 5

image

image

image

So what is interesting is that there does not seem to be a strong relationship between scores and tax base.  There looks like an equal number of schools both below the score curve than above it.  Note that using a smoothing curve is much better than the linear fit curve in showing the relationship of scores to tax base.  You can see the dip in the lower quartile and the increase at the tail.  It makes sense that the higher tax base shows an increase in scores, but what’s up with that dip?

Finally, the same data is shown using a hax chart

1 library(hexbin) 2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_hex(binwidth=c(100000000,5)) + geom_smooth(color="white",se=F) 3

image

So taking a step back, it is clear that there is a weakness in this analysis.  Some schools have thousands of students, some schools have a couple hundred.  (high schools versus elementary students). Using the absolute dollars from the tax valuation is misleading.  What we really need is revenue per student.  Going back to the SchoolInfo dataframe, I added it in and pulled a student count column.

1 WakeCountyScores <- merge(x=WakeCountyScores, y=WakeCountySchoolInfo, by.x="School", by.y="School.Name") 2 names(WakeCountyScores)[names(WakeCountyScores)=="School.Membership.2013.14..ADM..Mo2."] <- "StudentCount" 3 WakeCountyScores$StudentCount <- as.numeric(WakeCountyScores$StudentCount) 4 5 WakeCountyScores["TaxBasePerStudent"] <- WakeCountyScores$TaxBase/WakeCountyScores$StudentCount 6 summary(WakeCountyScores$TaxBasePerStudent) 7

Interestingly, the number of records in the base frame dropped from 166 to 152, which means that perhaps we need a second mapping table.  In any event, you can see that the average tax base is $6.5 million with a max of $114 Million.  Quite a range!

image

Going back to the point and hex graphs

1 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_point() + geom_smooth() 2 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_hex(binwidth=c(25000000,5)) + geom_smooth(color="white",se=F) 3

 

image

image

There is some interesting going on.  First, the initial conclusion that a higher tax base leads to a gradual increase in scores is wrong once you move from total tax to tax per student.

Also, note the significant drop in school scores once you move away from the lowest tax base schools, the recovery, and then the drop again.  From a real estate perspective, these charts suggest that the marginal value of a really expensive or really inexpensive house in Wake County is not worth it (at least in terms of where you send you kids), and there is a sweet spot of value above a certain price point. 

You can find the gist here and the repo is here.

Some lessons I learned in doing this exercise:

  • Some records got dropped between the scores dataframe and the info dataframe -> so there needs to be another mapping table
  • Make the tax base in millions
  • What’s up with that school with 114 million per student?
  • An interesting question is location of dollars to school compared to tax base.  I wonder if that is on WCPSS somewhere.  Hummm…
  • You can’t use the tick(‘) notation, which means you do a lot of overwriting of dataframes.  This can be a costly and expensive feature of the language.  It is much better to assume immutably, even if you clutter up your data window.

As a final note, I was using the Console window b/c that is what the intro books do.  This is a huge mistake in R Studio.  It is much better to create a script and send the results to the console

image

So that you can make changes and run things again.  It is a cheap way of avoid head-scratching bugs…

Wake County School Report Cards Using R

Recently Wake County School Systems released a “school report card” that can be used to compare how well a school is doing relative to the other schools in the state. As expected, it made front-page news in our local newspaper here.  The key theme was that schools that have kids from poorer families have worse results than schools from more affluent families.  Though this shouldn’t come as a surprise, the follow-up op eds were equally predictable: more money for poorer schools, changing the rating scale, etc..

I thought it would be an interesting data set to analyze to see if the conclusion that the N&O came up with was, in fact, the only conclusion you can get out of the dataset..  Since they did simple crosstab analysis, perhaps there was some other analysis that could be done?  I know that news paper articles are at a pretty low level reading level and perhaps they are also written at a low analytical level also?  I went to the website to download the data here and quickly ran into two items:

1) The dataset is very limited –> there are only 3 creditable variables in the dataset (county, free and reduced percent, and the school score).  It is almost as if the dataset was purposely limited to only support the conclusion.

2) The dataset is shown in a way that alternative analysis is very hard.  You have to install Tableau if you want to look the data yourself.  Parsing Tableau was a pain because even with Fiddler, they don’t render the results as HTML with some tags but as images.

Side Note –> My guess is that Tableau is trying to be the Flash for the analytics space.  I find it curious that companies/organizations that think they are just “one tool away” from good analytics.   Even the age of Watson,  it is never the tooling – it is always the analyst that determines the usefulness of a dataset.  It would much better if WCPSS embraced open data and had higher expectations of the people using the datasets.

In any event, with the 14 day trial of Tableau, I could download into Access.  I then exported the data into a .txt file (where it should have been in the 1st place).  I the pumped it into R Studio like so:

image

I then created 2 variables from the FreeAndReducedLunch and SchoolScores vectors.  When I ran the correlation the 1st time, I got an NA, meaning that there are some mal-formed data. 

image

I re-ran the correlation using only complete data and sure enough, there is a creditable correlation –> higher the percent of free and reduced lunch, the lower the score.  The N&O is right. 

image

I then added a filter to only look at Wake County and there is even a stronger correlation in Wake County than the state as a whole:

image

As I mentioned earlier, the dataset was set up for a pre-decided conclusion by limited the number of independent variables and the choice of using Tableau as the reporting mechanism.  I decided to augment the dataset with additional information.  My son plays in TYO and I unsuccessful tried to set up an orchestra at our local elementary school 8 years ago.  I also thought of this article where  some families tried to get more orchestras in Wake County schools.  Fortunately, the list of schools with orchestra can be found here and it did not take very long to add an “HasAnStringsProgram” field to the dataset.

image

Running a correlation for just the WCPSS schools shows that there is no relationship  between a school having an orchestra and their performance grade. 

image

So the statement by the parents in the N&O like this

… that music students have higher graduation rates, grades and test scores …

might be true for all music but a specialized strings program does not seem to impact the school’s score –> at least with this data.

Multiple Linear Regression Using R and F#

Following up on my previous post, I decided to test calling R from F# for a multiple linear regression.  I decided to use the dataset from chapter 1 of Machine Learning For Hackers (ufo sightings).

Step #1 was to open R from F#

  1. #r @"C:\TFS\Tff.RDotNetExample_Solution\packages\R.NET.1.5.3\lib\net40\RDotNet.dll"
  2. #r @"C:\TFS\Tff.RDotNetExample_Solution\packages\R.NET.1.5.3\lib\net40\RDotNet.NativeLibrary.dll"
  3.  
  4. open System.IO
  5. open RDotNet
  6.  
  7.  
  8. //open R
  9. let environmentPath = System.Environment.GetEnvironmentVariable("PATH")
  10. let binaryPath = @"C:\Program Files\R\R-3.0.1\bin\x64"
  11. System.Environment.SetEnvironmentVariable("PATH",environmentPath+System.IO.Path.PathSeparator.ToString()+binaryPath)
  12.  
  13. let engine = RDotNet.REngine.CreateInstance("RDotNet")
  14. engine.Initialize()

 

Step #2 was to import the ufo dataset and clean it:

  1. //open dataset
  2. let path = @"C:\TFS\Tff.RDotNetExample_Solution\Tff.RDotNetExample\ufo_awesome.txt"
  3. let fileStream = new FileStream(path,FileMode.Open,FileAccess.Read)
  4. let streamReader = new StreamReader(fileStream)
  5. let contents = streamReader.ReadToEnd()
  6. let usStates = [|"AL";"AK";"AZ";"AR";"CA";"CO";"CT";"DE";"DC";"FL";"GA";"HI";"ID";"IL";"IN";"IA";
  7.                     "KS";"KY";"LA";"ME";"MD";"MA";"MI";"MN";"MS";"MO";"MT";"NE";"NV";"NH";"NJ";"NM";
  8.                     "NY";"NC";"ND";"OH";"OK";"OR";"PA";"RI";"SC";"SD";"TN";"TX";"UT";"VT";"VA";"WA";
  9.                     "WV";"WI";"WY"|]
  10. let cleanContents =
  11.     contents.Split([|'\n'|])
  12.     |> Seq.map(fun line -> line.Split([|'\t'|]))
  13.     |> Seq.filter(fun values -> values |> Seq.length = 6)
  14.     |> Seq.filter(fun values -> values.[0].Length = 8)
  15.     |> Seq.filter(fun values -> values.[1].Length = 8)
  16.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) > 1900)
  17.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) > 1900)
  18.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) < 2100)
  19.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) < 2100)
  20.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) > 0)
  21.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) > 0)
  22.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) <= 12)
  23.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) <= 12)      
  24.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) > 0)
  25.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) > 0)
  26.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) <= 31)
  27.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) <= 31)
  28.     |> Seq.filter(fun values -> values.[2].Split(',').[1].Trim().Length = 2)
  29.     |> Seq.filter(fun values -> Seq.exists(fun elem -> elem = values.[2].Split(',').[1].Trim().ToUpperInvariant()) usStates)
  30.     |> Seq.map(fun values ->
  31.         System.DateTime.ParseExact(values.[0],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
  32.         System.DateTime.ParseExact(values.[1],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
  33.         values.[2].Split(',').[0].Trim(),
  34.         values.[2].Split(',').[1].Trim().ToUpperInvariant(),
  35.         values.[3],
  36.         values.[4],
  37.         values.[5])
  38. cleanContents
  39.  
  40. let relevantContents =
  41.     cleanContents
  42.     |> Seq.map(fun (a,b,c,d,e,f,g) -> a.Year,d,g.Length)

 

Step #3 was to run the regression using the dataset.  You will notice that I made the length of the report the Y (dependent) variable – not that I think I will find any causality but it was a good enough to use).  Also, notice the Seq.Map of each column in the larger Seq(Int*String*Int) into the Vector.

  1. let reportLength = engine.CreateIntegerVector(relevantContents |> Seq.map (fun (a,b,c) -> c))
  2. engine.SetSymbol("reportLength", reportLength)
  3. let year = engine.CreateIntegerVector(relevantContents |> Seq.map (fun (a,b,c) -> a))
  4. engine.SetSymbol("year", year)
  5. let state = engine.CreateCharacterVector(relevantContents |> Seq.map (fun (a,b,c) -> b))
  6. engine.SetSymbol("state", state)
  7.  
  8. let calcExpression = "lm(formula = reportLength ~ year + state)"
  9. let testResult = engine.Evaluate(calcExpression).AsList()

 

Sure enough, you can get the results of the regression.  The challenge is teasing out the values that are interesting from the real data structure that is returned (testResult in this example)

> testResult.Item(0).AsCharacter();;
val it : CharacterVector =
  seq
    ["31775.5599180962"; "-15.2760355122386"; "37.8028841898059";
     "-91.2309146099364"; …]

Intercepts, I think, are Item(0).