Wake County Voter Analysis Using FSharp, AzureML, and R

One of the real strengths of FSharp its ability to plow through and transform data in a very intuitive way,  I was recently looking at Wake Country Voter Data found here to do some basic voter analysis.  My first thought was to download the data into R Studio.  Easy?  Not really.  The data is available as a ginormous Excel spreadsheet of database of about 154 MB in size.  I wanted to slim the dataset down and make it a .csv for easy import into R but using Excel to export the data as a .csv kept screwing up the formatting and importing it directly into R Studio from Excel resulting in out of memory crashes.  Also, the results of the different election dates were not consistent –> sometimes null, sometimes not.   I managed to get the data into R Studio without a crash and wrote a function of either voted “1” or not “0” for each election

1 #V = voted in-person on Election Day 2 #A = voted absentee by mail or early voting (through May 2006) 3 #M = voted absentee by mail (November 2006 - present) 4 5 #O = voted One-Stop early voting (November 2006 - present) 6 #T = voted at a transfer precinct on Election Day 7 #P = voted a provisional ballot 8 #L = Legacy data (prior to 2006) 9 #D = Did not show 10 11 votedIndicated <- function(votedCode) { 12 switch(votedCode, 13 "V" = 1, 14 "A" = 1, 15 "M" = 1, 16 "O" = 1, 17 "T" = 1, 18 "P" = 1, 19 "L" = 1, 20 "D" = 0) 21 } 22

However, every time I tried to run it, the IDE would crash with an out of memory issue. 

 Stepping back, I decided to transform the data in Visual Studio using FSharp. I created a sample from the ginormous excel spreadsheet and then imported the data using a type provider.  No memory crashes!

1 #r "../packages/ExcelProvider.0.1.2/lib/net40/ExcelProvider.dll" 2 open FSharp.ExcelProvider 3 4 [<Literal>] 5 let samplePath = "../../Data/vrdb-Sample.xlsx" 6 7 open System.IO 8 let baseDirectory = __SOURCE_DIRECTORY__ 9 let baseDirectory' = Directory.GetParent(baseDirectory) 10 let baseDirectory'' = Directory.GetParent(baseDirectory'.FullName) 11 let inputFilePath = @"Data\vrdb.xlsx" 12 let fullInputPath = Path.Combine(baseDirectory''.FullName, inputFilePath) 13 14 type WakeCountyVoterContext = ExcelFile<samplePath> 15 let context = new WakeCountyVoterContext(fullInputPath) 16 let row = context.Data |> Seq.head

I then applied a similar function for voted or not and then exported the data as a .csv

1 let voted (voteCode:obj) = 2 match voteCode = null with 3 | true -> "0" 4 | false -> "1" 5 6 open System 7 let header = "Id,Race,Party,Gender,Age,20080506,20080624,20081104,20091006,20091103,20100504,20100622,20101102,20111011,20111108,20120508,20120717,20121106,20130312,20131008,20131105,20140506,20140715,20141104" 8 9 let createOutputRow (row:WakeCountyVoterContext.Row) = 10 String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}", 11 row.voter_reg_num, 12 row.race_lbl, 13 row.party_lbl, 14 row.gender_lbl, 15 row.eoy_age, 16 voted(row.``05/06/2008``), 17 voted(row.``06/24/2008``), 18 voted(row.``11/04/2008``), 19 voted(row.``10/06/2009``), 20 voted(row.``11/03/2009``), 21 voted(row.``05/04/2010``), 22 voted(row.``06/22/2010``), 23 voted(row.``11/02/2010``), 24 voted(row.``10/11/2011``), 25 voted(row.``11/08/2011``), 26 voted(row.``05/08/2012``), 27 voted(row.``07/17/2012``), 28 voted(row.``11/06/2012``), 29 voted(row.``03/12/2013``), 30 voted(row.``10/08/2013``), 31 voted(row.``11/05/2013``), 32 voted(row.``05/06/2014``), 33 voted(row.``07/15/2014``), 34 voted(row.``11/04/2014``) 35 ) 36 37 let outputFilePath = @"Data\vrdb.csv" 38 39 let data = context.Data |> Seq.map(fun row -> createOutputRow(row)) 40 let fullOutputPath = Path.Combine(baseDirectory''.FullName, outputFilePath) 41 42 let file = new StreamWriter(fullOutputPath,true) 43 44 file.WriteLine(header) 45 context.Data |> Seq.map(fun row -> createOutputRow(row)) 46 |> Seq.iter(fun r -> file.WriteLine(r)) 47

The really great thing is that I could write and then dispose of each line so I could do it without any crashes.  Once the data was into a a .csv (10% the size of Excel), I could then import it into R Studio without a problem.  It is a common lesson but really shows that using the right tool for the job saves tons of headaches.

I knew from a previous analysis of voter data that the #1 determinate of a person from wake county voting in a off-cycle election was their age:




So then in R, I created a decision tree for just age to see what the split was:

1 library(rpart) 2 temp <- rpart(all.voters$X20131008 ~ all.voters$Age) 3 plot(temp) 4 text(temp)

Thanks to Placidia for answering my question on stats.stackoverflow


So basically politicians should be targeting people 50 years or older or perhaps emphasizing issues that appeal to the over 50 crowd.





Kaggle and R

Following up on last week’s post on doing a Kaggle competition, I then decided to see if I could explore the data more in R on my local desktop.  The competition is about analyzing a large group of house claims to give them a risk score.

I started the R studio to take a look at the initial data:

1 train <- read.csv("../Data/train.csv") 2 head(train) 3 summary(train) 4 5 plot(train$Hazard)


A couple of things popped out.  All of the X variables look to be categorical.  Even the result “Hazard” is an integer with most of the values falling between 1 and 9.

With that in mind, I decided to split the dataset into two sections: the majority and the minority.

1 train.low <- subset(train, Hazard < 9) 2 train.high <- subset(train, Hazard >= 9) 3 4 plot(train.low$Hazard) 5 plot(train.high$Hazard)

With the under as:


And the over 9 is like this


But I want to look at the Hazard score from a distribution point of view:

1 hazard.frame <- as.data.frame(table(train$Hazard)) 2 colnames(hazard.frame) <- c("hazard","freq") 3 hist(hazard.frame$freq) 4 plot(x=hazard.frame$hazard, y=hazard.frame$freq) 5 plot(x=hazard.frame$hazard, log(y=hazard.frame$freq)) 6

The hist shows the left skew




and the log plot really shows the distribution


So there is clearly a diminishing return going on.   As of this writing, the leader is at 40%, which is about 20,400 of the 51,000 entries.   So if you could identify all of the ones correctly, you should get 37% of the way there.  To test it out, I submitted to Kaggle only ones:


LOL, so they must take away for incorrect answers as it is same as “all 0” benchmark.  So going back, I know that if I can predict the ones correctly and make a reasonable guess at the rest, I might be OK.   I went back and tuned my model some to get me out of the bottom 25% and then let it be.  I assume that there is something obvious/industry standard that I am missing because there are so many people between my position and the top 25%.

Kaggle and AzureML

If you are not familiar with Kaggle, it is probably the de-facto standard for data science competitions.  The competitions can be hosted by a private company with cash prizes or it can be a general competition with bragging rights on the line.  The Titanic Kaggle competition is one of the more popular “hello world” data science projects that is a must-try for aspiring data scientists.

Recently, Kaggle hosted a competition sponsored by Liberty Mutual to help predict the insurance risk of houses.  I decided to see how well AzureML could stack up against the best data scientists that Kaggle could offer.

My first step was to get the mechanics down (I am a big believer in getting dev ops done first).  I imported the train and test datasets from Kaggle into AzureML.  I visualized the data and was struck that all of the vectors were categorical, even the Y variable (“Hazard”) –> it is an int with a range between 1 and 70.


I created a quick categorical model and ran it.  Note I did a 60/40 train/test split of the data


Once I had a trained model, I hit the “Set Up Web Service” button.


I then went into that “web service” and changed the input from a web service input to the test dataset that Kaggle provided.  I then outputted the data to azure blob storage.  I also added a transform to only export the data that Kaggle wants to evaluate the results: ID and Hazard:


Once the data was in blob storage, I could download it to my desktop and then upload it to Kaggle to get an evaluation and a ranking.


With the mechanics out of the way, I decided to try a series of out of box models to see what gives the best result.  Since the result was categorical, I stuck to the classification models and this is what I found:



The OOB Two Class Bayes Point Machine is good for 1,278 place, out of about 1,200 competitors.

Stepping back, the hazard is definitely left-skewed so perhaps I need two models.  If I can predict if the hazard is between low and high group, I should be able to be right with most of the predictions and then let the fewer outlier predictions use a different model.  To test that hypotheses, I went back to AzureML and added a filter module for Hazard < 9


The problem is that the AUC dropped 3%.  So it looks like the outliers are not really skewing the analysis.  The next thought is that perhaps AzureML can help me identify the x variables that have the greatest predictive power.  I dragged in a Filter Based Feature Selection module and ran that with the model

image image

The results are kinda interesting.  There is a significant drop-off after these top 9 columns


So I recreated the model with only these top 9 X variables


And the AUC moved to .60, so I am not doing better.

I then thought of treating the Hazard score not as a factor but as a continuous variable.   I rejiggered the experiment to use a boosted decision tree regression


So then sending that over to Kaggle, I moved up.  I then rounded the decimal but that did this:


So Kaggle rounds to an int anyway.  Interestingly, I am at 32% and the leader is at 39%. 

I then used all of the OOB models for regression in AzureML and got the following results:


Submitting the Poisson Regression, I got this:


I then realized that I could mike my model <slightly> more accurate by not including the 60/40 split when doing the predictive effort.  Rather, I would put all 100% of the training data to the model:


Which moved me up another 10 spots…


So that is a good way to stop with the out of the box modeling in AzureML. 

There are a couple of notes here

1) Kaggle knows how to run a competition.  I love how easy it is to set up a team, submit an entry, and get immediate feedback.

2) AzureML OOB is a good place to start and explore different ideas.  However, it is obvious that stacked against more traditional teams, it does not do well

3) Speaking of which.  You are allowed to submit 5 entries a day and the competition lasts 90 days or so.  With 450 entries, I am imagine a scenario where a person can spend their time gaming their submissions.  There are 51,000 entries so and the leading entry (as of this writing) is around 39% so there are 20,000 correct answers.  That is about 200 correct answers a day or 40 each submission.

The Counted Part 2: Analysis Using R

Following up on this post last week on analyzing The Counted using F# and R, I decided to look a bit closer at the data.  In last week’s post, I had a data frame of all of the people killed by law enforcement collected by The Guardian for the 1st half of 2015.  Although interesting to looks at, I am not sure that the map tells us anything.  The data frame looks like this:


My first thought is to look at killing by population by US State.  Step #1 was to sum up the number of rows by state code:

1 the.counted.state <- data.frame(table(the.counted$state)) 2 colnames(the.counted.state ) <- c("StateCode","NumberKilled") 3 summary(the.counted.state) 4


I then brought in the latest population figures by state from the US Census:

1 state.population <- read.csv("http://www.census.gov/popest/data/state/asrh/2014/files/SCPRC-EST2014-18+POP-RES.csv") 2 state.population 3


And finally I bought in a cross walk table of US State Codes (what The.Counted data is in and the US State Names, which is what US Census data is in)

1 state.crosswalk <- read.csv("http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv") 2 state.crosswalk 3



I then merged all three data frames together using the state name and state code as the common key:

1 state.population.2 <- state.population[c(5,6)] 2 state.population.3 <- merge(x=state.population.2, 3 y=state.crosswalk, 4 by.x="NAME", 5 by.y="State") 6 #The Counted With Population 7 the.counted.state <- merge(x=the.counted.state, 8 y=state.population.3, 9 by.x="StateCode", 10 by.y="Abbreviation") 11

I then tried to add in a column that took the total number of killed individuals by the number of people in the state.

1 the.counted.state.2 <- the.counted.state 2 the.counted.state.2$KilledRatio <- the.counted.state.2$NumberKilled/the.counted.state.2$POPESTIMATE2014 3


The problem became quickly obvious: there were not enough people in the numerator to make a meaningful straight division.  To compensate for this issue, I divided the number of people in each state by 10,000.  I also increased the kill ratio by a factor of 10 so that we have a scale between of 0 and 1 of .1 which is easily digestible.  Finally, I renamed the variable “Name” to “StateName” because my OCD couldn’t let such an affront to the naming gods go unpunished.

1 the.counted.state.3 <- the.counted.state 2 the.counted.state.3$AdjustedPopulation <- the.counted.state.2$POPESTIMATE2014/10000 3 the.counted.state.3$KilledRatio <- the.counted.state.3$NumberKilled/the.counted.state.3$AdjustedPopulation 4 the.counted.state.3$AdjKilledRatio <- the.counted.state.3$KilledRatio * 10 5 6 names(the.counted.state.3)[names(the.counted.state.3)=="NAME"] <- "StateName" 7 the.counted.state.3$StateName <- tolower(the.counted.state.3$StateName)



With the data prepped, I created a choropleth to show the kill ratio by state using a gradient scale:

1 choropleth <- merge(x=all.states, 2 y=the.counted.state.3, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth <- choropleth[order(choropleth$order), ] 8 summary(choropleth) 9 10 qplot(long, lat, data = choropleth, group = group, fill = AdjKilledRatio, 11 geom = "polygon")


Note that I had to use the all.x=TRUE to account for the fact that South Dakota and Vermont did not have any killings so far in 2015.  This is equiv to a left outer join for you Sql folks.  On a side note, what’s up with Oklahoma? 

I then decided to bin the data into high,medium, and low categories.  Looking at the detail of the adjustedKillRatio, there seems to be some natural breaks around 10% and 20%:

1 the.counted.state.4$AdjKilledRatio 2 summary(the.counted.state.4$AdjKilledRatio) 3


So I binned like that:

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=seq(0,1,.1)) 3 summary(the.counted.state.4$KilledBin) 4


The problem with my code is that this gives me 10 bins and I only really need 3.  Fortunately, this stack overflow post helped me re-write the bin into 3 factors.  Note the Inf on the high side and the labels.

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=c(seq(0,.2,.1),Inf), 3 labels=c("low","med","high")) 4

And this gives me a pretty good distribution of bins:


With things binned up, I added another chiropleth and map:

1 choropleth.2 <- merge(x=all.states, 2 y=the.counted.state.4, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth.2 <- choropleth.2[order(choropleth.2$order), ] 8 summary(choropleth.2) 9 10 qplot(long, 11 lat, 12 data = choropleth.2, 13 group = group, 14 fill = KilledBin, 15 geom = "polygon")


If your squint, it almost looks like a map of the civil war, no?

The Counted: Initial Analysis Using FSharp and R

(Note: this is post one of three.  Next week is a deeper dive into the data and the following week is an analysis of law enforcement officers killed in the line of duty)

Andrew Oliver hit me up on Twitter with a new dataset that he stumbled across.  The dataset is called “The Counted” and it is an attempt to count all of the deaths at the hand of police in America in 2015.  Apparently, this data is not collected systematically by the US government, which is kind of puzzling.  You can read about and download the data here.  A sample looks like:


John asked what we could do with the dataset –> esp when comparing to other variables like socio-economic status.  Step #1 in my mind was to geo-locate the data.  Since this is a .csv, the first-first thing was to remove extra commas and replace them with semi-colons or blank spaces (for example, US Marshals Service, Pennsylvania State Police, Allegheny County Sheriff’s Office became US Marshals Service; Pennsylvania State Police; Allegheny County Sheriff’s Office and Corrections Department, 1400 E 4th Ave became Corrections Department 1400 E 4th Ave)

Adding Geolocations

Drawing on my code that I wrote using Texas A&M’s Geoservice found here, I converted the json type provider script into a function that takes address info and returns a geolocation:

1 let getGeoCoordinates(streetAddress:string, city:string, state:string) = 2 let apiKey = "xxxxx" 3 let stringBuilder = new StringBuilder() 4 stringBuilder.Append("https://geoservices.tamu.edu/Services/Geocode/WebService/GeocoderWebServiceHttpNonParsed_V04_01.aspx") |> ignore 5 stringBuilder.Append("?streetAddress=") |> ignore 6 stringBuilder.Append(streetAddress) |> ignore 7 stringBuilder.Append("&city=") |> ignore 8 stringBuilder.Append(city) |> ignore 9 stringBuilder.Append("&state=") |> ignore 10 stringBuilder.Append(state) |> ignore 11 stringBuilder.Append("&apiKey=") |> ignore 12 stringBuilder.Append(apiKey) |> ignore 13 stringBuilder.Append("&version=4.01") |> ignore 14 stringBuilder.Append("&format=json") |> ignore 15 16 let searchUri = stringBuilder.ToString() 17 let searchResult = GeoLocationServiceContext.Load(searchUri) 18 19 let firstResult = searchResult.OutputGeocodes |> Seq.head 20 firstResult.OutputGeocode.Latitude, firstResult.OutputGeocode.Longitude, firstResult.OutputGeocode.MatchScore

I then loaded in the dataset via the .csv type provider:

1 [<Literal>] 2 let theCountedSample = "..\Data\TheCounted.csv" 3 type TheCountedContext = CsvProvider<theCountedSample> 4 let theCountedData = TheCountedContext.Load(theCountedSample) 5

I then mapped the geofunction to the imported dataset:

1 let theCountedGeoLocated = theCountedData.Rows 2 |> Seq.map(fun r -> r, getGeoCoordinates(r.Streetaddress, r.City, r.State)) 3 |> Seq.toList 4 |> Seq.map(fun (r,(lat,lon,ms)) -> String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15}", 5 r.Name,r.Age,r.Gender,r.Raceethnicity,r.Month,r.Day,r.Year, r.Streetaddress, r.City,r.State,r.Cause,r.Lawenforcementagency,r.Armed,lat,lon,ms)) 6

And then finally exported the data

1 let baseDirectory = __SOURCE_DIRECTORY__ 2 let baseDirectory' = Directory.GetParent(baseDirectory) 3 let filePath = "Data\TheCountedWithGeo.csv" 4 let fullPath = Path.Combine(baseDirectory'.FullName, filePath) 5 File.WriteAllLines(fullPath,theCountedGeoLocated)


The gist is here.  Using the csv and json type providers made the analysis a snap –> a majority code is just building up the string for the service call.  +1 for simplicity.

Analyzing The Results

After adding geolocations to the dataset, I opened R studio and imported the dataset.

1 theCounted <- read.csv("./Data/TheCountedWithGeo.csv") 2 summary(theCounted) 3



So this is good news that we have good confidence on all of the observations so we don’t have to drop any records (making the counted, un-counted, as it were).

I then googled how to create a US map and put some data points on them and ran across this post.  I copied and pasted the code, changed the variable names, said “there is no way it is this easy” out loud, and hit CTRL+ENTER.

1 library(ggplot2) 2 library(maps) 3 4 all.states <- map_data("state") 5 plot <- ggplot() 6 plot <- plot + geom_polygon(data=all.states, aes(x=long, y=lat, group = group), 7 colour="grey", fill="white" ) 8 plot <- plot + geom_point(data=theCounted, aes(x=lon, y=lat), 9 colour="#FF0040") 10 plot <- plot + guides(size=guide_legend(title="Homicides")) 11 plot



The gist is here.

Global Azure Bootcamp: Car Lab Analysis

As part of the Global Azure Bootcamp, the organizers created a hand-on lab where individuals could install a racing game and compete against other drivers.  The cool thing was the amount of telemetry that the game pushed to Azure (I assume using Event Hubs to Azure Tables).  The lab also had a basic “hello world” web app that could read data from the Azure Table REST endpoints so newcomers could see how easy it was to create and then deploy a website on Azure.

I decided to take a bit of a jaunt though the data endpoint to see what analytics I could run on it using Azure ML.  I went to the initial endpoint here and sure enough, the data comes down in the browser.  Unfortunately, when I set it up in Azure ML using a data reader:


I got 0 records returned.  I think this has something to do with how the datareader deals with XML.  I quickly used F# in Visual Studio with the XML type provider:

1 #r "../packages/FSharp.Data.2.2.0/lib/net40/FSharp.Data.dll" 2 3 open FSharp.Data 4 5 [<Literal>] 6 let uri = "https://reddoggabtest-secondary.table.core.windows.net/TestTelemetryData0?tn=TestTelemetryData0&sv=2014-02-14&si=GabLab&sig=GGc%2BHEa9wJYDoOGNE3BhaAeduVOA4MH8Pgss5kWEIW4%3D" 7 8 type CarTelemetry = XmlProvider<uri> 9 let carTelemetry = CarTelemetry.Load(uri) 10 11

I reached out to the creator of the lab and he put a summary file on Azure Blob Storage that was very easy to consume with AzureML, you can find it herehere.  I created Regression to predict the amount of damage a car will sustain based on the country and car type:


This was great, but I wanted to working on my R chops some so I decided to play around with the data in R Studio.  I imported the data into R Studio and then fired up the scripting window.  The first question I wanted to answer was “how does each country stack up against each other in terms of car crashes?”

I did some basic data exploration like so:

1 summary(PlayerLapTimes) 2 3 aggregate(Damage ~ Country, PlayerLapTimes, sum) 4 aggregate(Damage ~ Country, PlayerLapTimes, FUN=length) 5


And then getting down to the business of answering the question:

1 2 dfSum <- aggregate(Damage ~ Country, PlayerLapTimes, sum) 3 dfCount <- aggregate(Damage ~ Country, PlayerLapTimes, FUN=length) 4 5 dfDamage <- merge(x=dfSum, y=dfCount, by.x="Country", by.y="Country") 6 names(dfDamage)[2] <- "Sum" 7 names(dfDamage)[3] <- "Count" 8 dfDamage$Avg <- dfDamage$Sum/dfDamage$Count 9 dfDamage2 <- dfDamage[order(dfDamage$Avg),] 10


So that is kinda interesting that France has the most damage per race.  I have to ask Mathias Brandewinder about that.

In any event, I then wanted to ask “what county finished first”.  I decided to apply some R charting to the same biolerplate that I created earlier

1 dfSum <- aggregate(LapTimeMs ~ Country, PlayerLapTimes, sum) 2 dfCount <- aggregate(LapTimeMs ~ Country, PlayerLapTimes, FUN=length) 3 dfSpeed <- merge(x=dfSum, y=dfCount, by.x="Country", by.y="Country") 4 names(dfSpeed)[2] <- "Sum" 5 names(dfSpeed)[3] <- "Count" 6 dfSpeed$Avg <- dfSpeed$Sum/dfSpeed$Count 7 dfSpeed2 <- dfSpeed[order(dfSpeed$Avg),] 8 plot(PlayerLapTimes$Country,PlayerLapTimes$Damage) 9




So even though France appears to have the slowest drivers, the average is skewed by 2 pretty bad races –> perhaps the person never finished.

In any event, this was a fun exercise and I hope to continue with the data to show the awesomeness of Azure, F#, and R…




Refactoring McCaffrey’s Regression to F#

James McCaffrey’s most recent MSDN article is about multi-class regression article is a great starting place for folks interested in the ins and outs of creating a regression.  You can find the article here.  He wrote the code in C# in a very much imperative style so the FSharp in me immediately wanted to rewrite it in F#.

Interestingly, Mathias Brandewinder also had the same idea and did a better (and more complete) job than me.  You can see his post here.

I decided to duck into McCaffrey’s code and see where I could rewrite part of the code.  My first step was to move his C# code to a more manageable format.


I changed the project from a console app to a .dll and then split the two classes into their own file.  I then added some unit tests so that I can verify that my reworking was correct:

1 [TestClass] 2 public class CSLogisticMultiTests 3 { 4 LogisticMulti _lc = null; 5 double[][] _trainData; 6 double[][] _testData; 7 8 public CSLogisticMultiTests() 9 { 10 int numFeatures = 4; 11 int numClasses = 3; 12 int numRows = 1000; 13 int seed = 42; 14 var data = LogisticMultiProgram.MakeDummyData(numFeatures, numClasses, numRows, seed); 15 LogisticMultiProgram.SplitTrainTest(data, 0.80, 7, out _trainData, out _testData); 16 _lc = new LogisticMulti(numFeatures, numClasses); 17 18 int maxEpochs = 100; 19 double learnRate = 0.01; 20 double decay = 0.10; 21 _lc.Train(_trainData, maxEpochs, learnRate, decay); 22 } 23 24 [TestMethod] 25 public void GetWeights_ReturnExpected() 26 { 27 double[][] bestWts = _lc.GetWeights(); 28 var expected = 13.939104508387803; 29 var actual = bestWts[0][0]; 30 Assert.AreEqual(expected, actual); 31 } 32 33 [TestMethod] 34 public void GetBiases_ReturnExpected() 35 { 36 double[] bestBiases = _lc.GetBiases(); 37 var expected = 11.795019237894717; 38 var actual = bestBiases[0]; 39 Assert.AreEqual(expected, actual); 40 } 41 42 [TestMethod] 43 public void GetTrainAccuracy_ReturnExpected() 44 { 45 var expected = 0.92125; 46 var actual = _lc.Accuracy(_trainData); 47 Assert.AreEqual(expected, actual); 48 } 49 50 [TestMethod] 51 public void GetTestAccuracy_ReturnExpected() 52 { 53 var expected = 0.895; 54 double actual = _lc.Accuracy(_testData); 55 Assert.AreEqual(expected, actual); 56 } 57 } 58

You will notice that this is the exact code that McCaffrey uses in his output for the Console app.  In any event, they were running all green


I then went into the F# Project and fired up the REPL.  I decided to start with the MakeDummyData method because it seemed beefy enough to demonstrate the language differences between the languages, it is fairly self-contained, and its data is already testable.  Here is the first 9 lines of code.

1 Random rnd = new Random(seed); 2 double[][] wts = new double[numFeatures][]; 3 for (int i = 0; i < numFeatures; ++i) 4 wts[i] = new double[numClasses]; 5 double hi = 10.0; 6 double lo = -10.0; 7 for (int i = 0; i < numFeatures; ++i) 8 for (int j = 0; j < numClasses; ++j) 9 wts[i][j] = (hi - lo) * rnd.NextDouble() + lo;

And here is the F# equivalent

1 let rnd = new Random(seed) 2 let hi = 10.0 3 let lo = -10.0 4 let wts = Array.create numFeatures (Array.create numClasses 1.) 5 let wts' = wts |> Array.map(fun row -> row |> Array.map(fun col -> (hi - lo) * rnd.NextDouble() + lo)) 6

There is one obvious difference and 1 subtle difference.  The obvious difference is that the F# code does not do any looping to create and populate the array of arrays data structure, rather it uses  the high-order Array.Map function.   This reduces the idiomatic line count from 9 to 5  – a 50% decrease (and a funny move from the 1980s).  (Note that I use the words “idiomatic line count” because you can reduce both examples to a single line of code but that makes in unworkable by humans.  Both examples show the typical way you would write code in the language.)  So with the fewer lines of code, which is more readable?  That is a subjective opinion.  A C#/Java/Javascript/Curly-Brace dev would say the C#.  Everyone else in the world would say F#.

The less obvious difference is that F# emphasizes immutability so that there are two variables (wts and wts’) and the C# has 1 variable that is mutated.  The implication is lost in such a small example, but if the numFeatures was large, you would want to take advantage of mutli-core processors and the F# code is ready for parallelism.  The C# code would have to be reworked to use an immutable collection.

The next lines create and populate the biases variable.  The C# Code:

1 double[] biases = new double[numClasses]; 2 for (int i = 0; i < numClasses; ++i) 3 biases[i] = (hi - lo) * rnd.NextDouble() + lo; 4

And the F# Code 

1 let biases = Array.create numClasses 1. 2 let biases' = biases |> Array.map(fun row -> (hi - lo) * rnd.NextDouble() + lo) 3

Same deal as before.  No loops or mutation.  Fewer lines of code and better readability.

The last set of code is a ball of string so it is very hard to separate out.


1 double[][] result = new double[numRows][]; // allocate result 2 for (int i = 0; i < numRows; ++i) 3 result[i] = new double[numFeatures + numClasses]; 4 5 for (int i = 0; i < numRows; ++i) // create one row at a time 6 { 7 double[] x = new double[numFeatures]; // generate random x-values 8 for (int j = 0; j < numFeatures; ++j) 9 x[j] = (hi - lo) * rnd.NextDouble() + lo; 10 11 double[] y = new double[numClasses]; // computed outputs storage 12 for (int j = 0; j < numClasses; ++j) // compute z-values 13 { 14 for (int f = 0; f < numFeatures; ++f) 15 y[j] += x[f] * wts[f][j]; 16 y[j] += biases[j]; 17 } 18 19 // determine loc. of max (no need for 1 / 1 + e^-z) 20 int maxIndex = 0; 21 double maxVal = y[0]; 22 for (int c = 0; c < numClasses; ++c) 23 { 24 if (y[c] > maxVal) 25 { 26 maxVal = y[c]; 27 maxIndex = c; 28 } 29 } 30 31 for (int c = 0; c < numClasses; ++c) // convert y to 0s or 1s 32 if (c == maxIndex) 33 y[c] = 1.0; 34 else 35 y[c] = 0.0; 36 37 int col = 0; // copy x and y into result 38 for (int f = 0; f < numFeatures; ++f) 39 result[i][col++] = x[f]; 40 for (int c = 0; c < numClasses; ++c) 41 result[i][col++] = y[c]; 42 } 43

Note the use of code comments, which is typically considered a code smell, even in demonstration code.

Here is the F# Code:

1 let x = Array.create numFeatures 1. 2 let x' = x |> Array.map(fun row -> (hi - lo) * rnd.NextDouble() + lo) 3 4 let xWts = Array.zip x' wts' 5 let xWts' = xWts |> Array.map(fun (x,wts) -> wts |> Array.sumBy(fun wt -> wt * x)) 6 7 let y = Array.create numClasses 1. 8 let yWts = Array.zip y xWts' 9 let y' = yWts |> Array.map(fun (y,xwt) -> y + xwt) 10 11 let yBias = Array.zip y' biases' 12 let y'' = yBias |> Array.map(fun (y,bias) -> y + bias) 13 14 let maxVal = y'' |> Array.max 15 16 let y''' = y'' |> Array.map(fun y -> if y = maxVal then 1. else 0.) 17 18 let xy = Array.append x' y''' 19 let result = Array.create numRows xy

This is pretty much the same as before,no loops, immutability, and a 50% reduction of code.  Also, notice that by using a more functional style breaks apart the ball of string.  Individual values are one their own line to be individual evaluated and manipulated.  Also, the if..then statement goes to a single line. 

So I had a lot of fun working through these examples.  The major differences were

  • Amount of Code and Code Readability
  • Immutability and ready for parallelism
    I am not planning to refactor the rest of the project, but you can too as the project is found here.  I am curious if using an array of arrays is the best way to represent the matric –> I guess it is standard for the curly-brace community?  I would think using Deedle would be better, but I don’t know enough about it (yet).


WCPSS Scores and Property Tax Valuations Using R

With all of the data gathered and organized I was ready to do some analytics using R.  The first thing I did was to load the four major datasets into R.


  • NCScores is the original dataset that has the school score.  I had already done did an analysis on it here.
  • SchoolValuation is the aggrgrate property values for each school as determined by scraping the Wake County Tax Website and Wake County School Assignment websites.  You can read how it was created here and here.
  • SchoolNameMatch is a crosswalk table between the school name as found in the NCScores dataframe and the School Valuation dataframe.  You can read how it was created here
  • WakeCountySchoolInfo is an export from WCPSS that was tossed around at open data day.

Step one was to reduce the North Carolina Scores data to only Wake County

1 #Create Wake County Scores From NC State Scores 2 WakeCountyScores <- NCScores[NCScores$District == 'Wake County Schools',] 3

The next step was to add in the SchoolNameMatch so that we have the Tax Valuation School Name

1 #Join SchoolNameMatch to Wake County Scores 2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolNameMatch, by.x="School", by.y="WCPSS") 3

Interestingly, R is smart enough that the common field not duplicated, just the additional field(s) are added


The next step was to add in the Wake County Property Values, remove the Property field as it is no longer needed, and convert the TaxBase field from string to numeric

1 #Join Property Values 2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolValuation, by.x="Property", by.y="SchooName") 3 4 #Remove Property column 5 WakeCountyScores$Property = NULL 6 7 #Turn tax base to numeric 8 WakeCountyScores$TaxBase <- as.numeric(WakeCountyScores$TaxBase) 9

Eager to do an analysis, I pumped the data into a correlation

1 #Do a Correlation 2 cor(WakeCountyScores$TaxBase,WakeCountyScores$SchoolScore,use="complete") 3


So clearly my expectations that property values track with FreeAndReducedLunch (.85 correlation) were not met.  I decided to use Practical Data Science with R Chapter 3 (Exploring Data)  as a guide to better understand the dataset.

1 #Practical Data Science With R, Chapter3 2 summary(WakeCountyScores) 3 summary(WakeCountyScores$TaxBase) 4



So there is quite a range in tax base!  The next task was to use some graphs to explore the data.  I added in ggplot2


and followed the books example for a histogram.  I started with score and it comes out as expected.  I then tried a historgram on TaxBase and had to tinker with the binwidth to make a meaningful chart:

1 #Historgrams 2 ggplot(WakeCountyScores) + geom_histogram(aes(x=SchoolScore),binwidth=5,fill="gray") 3 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=10000,fill="gray") 4 #Ooops 5 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=5000000,fill="gray") 6




The book then moves to an example studying income, which is directly analogous to the TaxBase so I followed it very closely.  The next graph were some density graphs.  Note the second one is a logarithmic one:

1 #Density 2 library(scales) 3 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_continuous(labels=dollar) 4 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_log10(labels=dollar) + annotation_logticks(sides="bt") 5





So kinda interesting that most schools cluster in terms of their tax base, but because there is such a wide range with a majority clustered to the low end, the logarithmic curve is much more revealing.

The book then moved into showing the relationship between two variables.  In this case, SchoolScore as the Y variable and TaxBase as the X variable:

1 #Relationship between TaxBase and Scores 2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() 3 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + stat_smooth(method="lm") 4 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + geom_smooth() 5




So what is interesting is that there does not seem to be a strong relationship between scores and tax base.  There looks like an equal number of schools both below the score curve than above it.  Note that using a smoothing curve is much better than the linear fit curve in showing the relationship of scores to tax base.  You can see the dip in the lower quartile and the increase at the tail.  It makes sense that the higher tax base shows an increase in scores, but what’s up with that dip?

Finally, the same data is shown using a hax chart

1 library(hexbin) 2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_hex(binwidth=c(100000000,5)) + geom_smooth(color="white",se=F) 3


So taking a step back, it is clear that there is a weakness in this analysis.  Some schools have thousands of students, some schools have a couple hundred.  (high schools versus elementary students). Using the absolute dollars from the tax valuation is misleading.  What we really need is revenue per student.  Going back to the SchoolInfo dataframe, I added it in and pulled a student count column.

1 WakeCountyScores <- merge(x=WakeCountyScores, y=WakeCountySchoolInfo, by.x="School", by.y="School.Name") 2 names(WakeCountyScores)[names(WakeCountyScores)=="School.Membership.2013.14..ADM..Mo2."] <- "StudentCount" 3 WakeCountyScores$StudentCount <- as.numeric(WakeCountyScores$StudentCount) 4 5 WakeCountyScores["TaxBasePerStudent"] <- WakeCountyScores$TaxBase/WakeCountyScores$StudentCount 6 summary(WakeCountyScores$TaxBasePerStudent) 7

Interestingly, the number of records in the base frame dropped from 166 to 152, which means that perhaps we need a second mapping table.  In any event, you can see that the average tax base is $6.5 million with a max of $114 Million.  Quite a range!


Going back to the point and hex graphs

1 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_point() + geom_smooth() 2 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_hex(binwidth=c(25000000,5)) + geom_smooth(color="white",se=F) 3




There is some interesting going on.  First, the initial conclusion that a higher tax base leads to a gradual increase in scores is wrong once you move from total tax to tax per student.

Also, note the significant drop in school scores once you move away from the lowest tax base schools, the recovery, and then the drop again.  From a real estate perspective, these charts suggest that the marginal value of a really expensive or really inexpensive house in Wake County is not worth it (at least in terms of where you send you kids), and there is a sweet spot of value above a certain price point. 

You can find the gist here and the repo is here.

Some lessons I learned in doing this exercise:

  • Some records got dropped between the scores dataframe and the info dataframe -> so there needs to be another mapping table
  • Make the tax base in millions
  • What’s up with that school with 114 million per student?
  • An interesting question is location of dollars to school compared to tax base.  I wonder if that is on WCPSS somewhere.  Hummm…
  • You can’t use the tick(‘) notation, which means you do a lot of overwriting of dataframes.  This can be a costly and expensive feature of the language.  It is much better to assume immutably, even if you clutter up your data window.

As a final note, I was using the Console window b/c that is what the intro books do.  This is a huge mistake in R Studio.  It is much better to create a script and send the results to the console


So that you can make changes and run things again.  It is a cheap way of avoid head-scratching bugs…

Predictive Analytics With Microsoft Azure Machine Learning

(On vacation this week)

Over the Christmas holiday, I had some time to look at some of the books that have been sitting on my bookshelf.  One of these was Predictive Analytics With Microsoft Machine Learning by Barga, Fontama, and Tok. 


This book is a great introduction to both analytics and Azure ML.  I really appreciated how the authors started off with a couple of basic experiments to get your feet wet, then moved over to some theory about different ML techniques, and then finished out the rest of the book with some hand-on labs.

I worked through all of the labs (except 1) in about 6 hours.  The labs follow a very nice step-by-step pattern with plenty of screen shots.  My only quibble with the book is that the most interesting lab was Building a Chun Model that relied on data from a third party.  When I went to the 3rd party’s website to download the data, the data had broken links and 404s.  I went to the book’s site at APress and its did not have the data either.  That was kinda frustrating and something that the authors should have considered.

In any event, if you have some time, working through Predictive Analytics With Microsoft Azure Machine Learning is well worth the time and is quite fun.

Introduction to (part of) IBM Watson

Recently, I joined the IBM Watson beta program (you can join too here) to see what it had to offer.  It looks like IBM is using the “Watson” word to cover a broad array of analytical and machine learning capabilities.  One area that Watson is used is to do statistical analysis without knowing any programming and/or statistics.  For example, I went into their portal and uploaded a new dataset that I just got from the Town Of Cary regarding traffic stops:

image image


I then hit the “New Exploration” button just to see what would happen and voila, I have graphs!





So this is kind interesting, they seem to use both modeling sweeping and parameter sweeping and then use natural language questions to explore the dataset.  This is quite impressive as it allows someone to who nothing about statistics to ask questions and get answers.  I am not sure if there is a way to drill down into the models to tweet the questions nor does there look to be a way to consume the results.  Instead, it looks like a management dashboard.  So it is a bit like when you view the results of a dataset, they have taken it to the n degree.

I then went back and hit the “Create a Prediction” button


I picked a random y variable (“disposition) with the default values and voila, graphs:


Interestingly, it does some sweeping and it picked up that the PrimaryKey is correlated with date – which would make sense since the date is part of the PK value 🙂


In any event, I think this is a cool entry into the machine learning space from IBM.  They really have done a good job in making data science accessible.  Now, if they could put their weight into “Open Data” so there are lots of really cool datasets to analyze available, they would really position themselves well in an emerging market.  I can’t wait to dig in even more with  Watson…