Wake County Restaurant Inspection Data with Azure ML and F#

With Azure ML now available, I was thinking about some of the analysis I did last year and how I could do even more things with the same data set.  One such analysis that came to mind was the restaurant inspection data that I analyzed last year.  You can see the prior analysis here.

I uploaded the restaurant data into Azure and thought of a simple question –> can we predict inspection scores based on some easily available data?  This is an interesting dataset because there are some categorical data elements (zip code, restaurant type, etc…) and there are some continuous ones (priority foundation, etc…).

Here is the base dataset:

image

I created a new experiment and I used a boosted regression model and a neural network regression and used a 70/30 train/test split.

image

After running the models and inspecting the model evaluation, I don’t have a very good model

image

I then decided to go back and pull some of the X variables out of the dataset and concentrate on only a couple of variables.  I added a project column module and then selected Restaurant Type and Zip Code as the X variables and left the Inspection Score as the Y variable. 

image

With this done, I added a couple of more models (Bayesian Linear Regression and a Decision Forest Regression) and gave it a whirl

image

image

Interesting, adding these models did not give us any better of a prediction and dropping the variables to two made a less accurate model.  Without doing any more analysis, I picked the model with the lowest MAE )Boosted Decision Tree Regression) and published it at a web service:

image

I published it as a web service and now I can consume if from a client app.   I used the code that I used for voting analysis found here as a template and sure enough:

["27519","Restaurant","0","96.0897827148438"]

["27612","Restaurant","0","95.5728530883789"]

So restaurants in Cary,NC have a higher inspection score than the ones found in Northwest Raleigh.   However, before we start  alerting the the Cary Chamber of Commerce to create a marketing campaign (“Eat in Cary, we are safer”), the difference is within the MAE.

In any event, it would be easy to create a  phone app and you don’t know a restaurant score, you can punch in the establishment type and the zip code and have a good idea about the score of the restaurant. 

This is an academic exercise b/c the establishments have to show you their card and yelp has their score on them, but a fun exercise none the less.  Happy eating.

Consuming Azure ML web api endpoint from an array

Last week, I blogged about creating an Azure ML experiment, publishing it as a web service, and then consuming it from F#.  I then wanted to consume the web service using an array – passing in several values and seeing the results.  I created added on to my existing F #script with the following code

1 let input1 = new Dictionary<string,string>() 2 input1.Add("Zip Code","27519") 3 input1.Add("Race","W") 4 input1.Add("Party","UNA") 5 input1.Add("Gender","M") 6 input1.Add("Age","45") 7 input1.Add("Voted Ind","1") 8 9 let input2 = new Dictionary<string,string>() 10 input2.Add("Zip Code","27519") 11 input2.Add("Race","W") 12 input2.Add("Party","D") 13 input2.Add("Gender","F") 14 input2.Add("Age","47") 15 input2.Add("Voted Ind","1") 16 17 let inputs = new List<Dictionary<string,string>>() 18 inputs.Add(input1) 19 inputs.Add(input2) 20 21 inputs 22 |> Seq.map(fun i -> invokeService(i)) 23 |> Async.Parallel 24 |> Async.RunSynchronously 25

And sure enough, I can run the model using multiple inputs:

image

Consuming Azure ML With F#

(This post is a continuation of this one)

So with a model that works well enough,  I selected only that model and saved it

image

 

image

Created a new experiment and used that model with the base data.  I then marked the project columns as the input and the score as the output (green and blue circle respectively)

image

After running it, I published it as a web service

image

And voila, an endpoint ready to go.  I then took the auto generated script and opened up a new Visual Studio F# project to use it.  The problem was that this is the data structure that the model needs

FeatureVector = new Dictionary<string, string>() { { "Precinct", "0" }, { "VRN", "0" }, { "VRstatus", "0" }, { "VRlastname", "0" }, { "VRfirstname", "0" }, { "VRmiddlename", "0" }, { "VRnamesufx", "0" }, { "VRstreetnum", "0" }, { "VRstreethalfcode", "0" }, { "VRstreetdir", "0" }, { "VRstreetname", "0" }, { "VRstreettype", "0" }, { "VRstreetsuff", "0" }, { "VRstreetunit", "0" }, { "VRrescity", "0" }, { "VRstate", "0" }, { "Zip Code", "0" }, { "VRfullresstreet", "0" }, { "VRrescsz", "0" }, { "VRmail1", "0" }, { "VRmail2", "0" }, { "VRmail3", "0" }, { "VRmail4", "0" }, { "VRmailcsz", "0" }, { "Race", "0" }, { "Party", "0" }, { "Gender", "0" }, { "Age", "0" }, { "VRregdate", "0" }, { "VRmuni", "0" }, { "VRmunidistrict", "0" }, { "VRcongressional", "0" }, { "VRsuperiorct", "0" }, { "VRjudicialdistrict", "0" }, { "VRncsenate", "0" }, { "VRnchouse", "0" }, { "VRcountycomm", "0" }, { "VRschooldistrict", "0" }, { "11/6/2012", "0" }, { "Voted Ind", "0" }, }, GlobalParameters = new Dictionary<string, string>() { } };

And since I am only using 6 of the columns, it made sense to reload the Wake County Voter Data with just the needed columns.  I went back to the original CSV and did that.  Interestingly, I could not set the original dataset as the publish input so I added a project column module that does nothing

image

With that in place, I republished the service and opened Visual Studio.  I decided to start with a script.  I was struggling though the async when Tomas P helped me on Stack Overflow here.  I’ll say it again, the F# community is tops.  In any event, here is the initial script:

#r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll" #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll" open System open System.Net.Http open System.Net.Http.Headers open System.Net.Http.Formatting open System.Collections.Generic type scoreData = {FeatureVector:Dictionary<string,string>;GlobalParameters:Dictionary<string,string>} type scoreRequest = {Id:string; Instance:scoreData} let invokeService () = async { let apiKey = "" let uri = "https://ussouthcentral.services.azureml.net/workspaces/19a2e623b6a944a3a7f07c74b31c3b6d/services/f51945a42efa42a49f563a59561f5014/score" use client = new HttpClient() client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Bearer",apiKey) client.BaseAddress <- new Uri(uri) let input = new Dictionary<string,string>() input.Add("Zip Code","27519") input.Add("Race","W") input.Add("Party","UNA") input.Add("Gender","M") input.Add("Age","45") input.Add("Voted Ind","1") let instance = {FeatureVector=input; GlobalParameters=new Dictionary<string,string>()} let scoreRequest = {Id="score00001";Instance=instance} let! response = client.PostAsJsonAsync("",scoreRequest) |> Async.AwaitTask let! result = response.Content.ReadAsStringAsync() |> Async.AwaitTask if response.IsSuccessStatusCode then printfn "%s" result else printfn "FAILED: %s" result response |> ignore } invokeService() |> Async.RunSynchronously

 

Unfortunately, when I run it, it fails.  Below is the Fiddler trace:

image

 

So it looks like the Json Serializer is postpending the “@” symbol.  I changed the records to types and voila:

image

You can see the final script here.

So then throwing in some different numbers. 

  • A millennial: ["27519","W","D","F","25","1","1","0.62500011920929"]
  • A senior citizen: ["27519","W","D","F","75","1","1","0.879632294178009"]

I wonder why social security never gets cut?

In any event, just to check the model:

  • A 15 year old: ["27519","W","D","F","15","1","0","0.00147285079583526"]

Azure ML and Wake County Election Data

I have been spending the last couple of weeks using Azure ML and I think it is one of the most exciting technologies for business developers and analysts since ODBC and FSharp type providers.   If you remember, when ODBC came out, every relational database in the world became accessible and therefore usable/analyzable.   When type providers came out, programming, exploring, and analyzing data sources became much easier and it expanded from RDBMS to all formats (notably Json).  So getting data was no longer a problem, but analyzing it still was.

Enter Azure ML. 

I downloaded the Wake County Voter History data from here.  I took the Excel spreadsheet and converted it to a .csv locally.  I then logged into Azure ML and imported the data

image

I then created an experiment and added the dataset to the canvas

image

 

And looked at the basic statistics of the data set

image

(Note that I find that using the FSharp REPL  a better way to explore the data as I can just dot each element I am interested in and view the results).

In any event, the first question I want to answer is

“given a person’s ZipCode, Race, Party,Gender, and Age, can I predict if they will vote in November”

To that end, I first narrowed down the columns using a Column Projection and picked only the columns I care about.  I picked “11/6/2012” and the X variable because that was the last  national election and that is what we are going to have in November.  I prob should have done 2010 b/c that is a national without a President, but that can be analyzed at a later date.

image

image

I then ran my experiment so the data would be available in the Project Column step.

image

 

I then renamed the columns to make them a bit readable by using a series Metadata Editors (it does not look like you can do all renames in 1 step.  Equally as annoying is that you have to add each module, run it, then add the next.)

image

(one example)

image

 

I then added a Missing Values scrubber for the voted column.  So instead of a null field, people who didn’t vote get a “N”

image

The problem is that it doesn’t work –> looks like we can’t change the values per column.

image

I asked the question on the forum but in the interest of time, I decided to change the voted column from a categorical column to an indicator. That way I can do binary analysis.  That also failed.  I went back to the original spreadsheet and added a Indicator column and then also renamed the column headers so I am not cluttering up my canvas with those meta data transforms.  Finally, I realized I want only active voters but there does not seems to be a filtering ability (remove rows only works for missing) so I removed those also from the original dataset.  I think the ability to scrub and munge data is an area for improvement, but since this is release 1, I understand.

After re-importing the data, I changed my experiment like so

image

I then split the dataset into Training/Validation/And Testing using a 60/20/20 split

image

So the left point on the second split is 60% of the original dataset, the right point on the second split is 20% of the original dataset (or 75%/25% of the 80% of the first split)

I then added a SVM with a train and score module.  Note that I am training with 60% of the original dataset and I am validating with 20%

 

image

After it runs, there are 2 new columns in the dataset –> Scored labels and probabilities so each row now has a score.

 

image

With the model in place, I can then evaluate it using an evaluation model

image

And we can see an AUC of .666, which immediately made me think of this

image

In any event, I added a Logisitc Regression and a Boosted Decision Tree to the canvas and hooked them up to the training and validation sets

image

And this is what we have

image image

 

SVM: .666 AUC

Regression: .689 AUC

Boosted Decision Tree: .713 AUC

So with Boosted Decision Tree ahead, I added a Sweep Parameter module to see if I can tune it more.  I am using AUC as the performance metric

image

image

So the best AUC I am going to get is .7134 with the highlighted parameters.  I then added 1 more Model that uses those parameters against the entire training dataset (80% of the total) and then evaluates it against the remaining 20%.

image

With the final answer of

image

With that in hand, I can create a new experiment that will be the bases of a real time voting app.

Sql Saturday and MVP Monday

Thanks to everyone who came to my session on F# Type Providers.  The code is found here.

Also, my article on the Eject-A-Bed was selected for MVP Mondays.  You can see a link here.

 

Fun with Statistics and Charts

I am preparing my Raleigh Code Camp submission ‘Nerd Dinner With Brains” this weekend.  If you are not familiar, Nerd Dinner is the canonical example of a MVC application and is very familiar to Web Devs who want to learn MVC the Microsoft way.  You can see the walkthrough here.   For everything that Nerd Dinner is, it is not … smart.  There is no business rules outside of some basic input validation, which is pretty representative of many “Boring Line Of Business Applications (BLOBAs according to Scott Waschlan).  Not coincidently, the lack of business logic is the biggest  reason many BLOBAs don’t have many unit tests –> if all you are doing is wire framing a database, what business logic needs to be tested? 

The talk is going to take the Nerd Diner wireframe and inject some analytics to the application.  To that end, I first considered the person who is attending the dinner.  All we know about them is their name and possibly their location.  So what can a name tell you?  Turns out, plenty.

As I showed in this post, there is a great source of the number of names given by gender, yearOfBrith, and stateOfBirth from the US census.  Picking up where that post left off, I loaded in the entire data set into memory.

My first question was, “given a name, can I tell what gender the person is?”  This is very straight forward to calculate.

1 let genderSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.F) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.toArray 11 12 genderSearch "James" 13

And the REPL shows me that is is very likely that “James” is a male:

image

I can then set up in the web.config file a confidence point where there name is a male/female, I am thinking 75%.  Once we have that, the app can respond differently.  Perhaps we have a product-placement advertisement that becomes a male-focused if we are reasonably certain that the user is a male.  Perhaps we can be more subtle and change the theme of the site, or the page navigation, to induce the person to do additional things on the site.

In any event, I then wanted to tackle age.  I spun up some code to isolate a person’s age

1 let ageSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.``1910``) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 |> Seq.toArray 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.toArray

I had no idea if names have a certain age connotation so I decided to do some basic charting.  Isaac Abraham pointed me to FSharp.Chart which is a great way to do some basic charting for discovery.

1 let chartData = ageSearch "James" 2 |> Seq.map(fun (y,c,p) -> y, c) 3 |> Seq.sortBy(fun (y,c) -> y) 4 5 Chart.Line(chartData).ShowChart()

And sure enough, the name “James” has a real ebb and flow for its popularity.

image

so if the user has a name of “James”, you can make a reasonable assumption they are male and probably born before 1975.  Cue up the Van Halen!

And yes, because I had to:

1 let chartData = ageSearch "Britney" 2 |> Seq.map(fun (y,c,p) -> y, c) 3 |> Seq.sortBy(fun (y,c) -> y)

image

Kinda does match her career, no?

Anyway, back to the task at hand.  In terms of analytics, I want to be a bit more precise then eyeballing a chart.  I started with the following code:

1 ageSearch "James" 2 |> Seq.map(fun (y,c,p) -> float c) 3 |> Seq.average 4 5 ageSearch "James" 6 |> Seq.map(fun (y,c,p) -> float c) 7 |> Seq.min 8 9 ageSearch "James" 10 |> Seq.map(fun (y,c,p) -> float c) 11 |> Seq.max 12

image

With these basic statistics out of the way, I then wanted to look at when the name was no longer popular.  I decided to use 1 standard deviation away from the average to determine an outlier.  First the standard deviation:

1 let variance (source:float seq) = 2 let mean = Seq.average source 3 let deltas = Seq.map(fun x -> pown(x-mean) 2) source 4 Seq.average deltas 5 6 let standardDeviation(values:float seq) = 7 sqrt(variance(values)) 8 9 ageSearch "James" 10 |> Seq.map(fun (y,c,p) -> float c) 11 |> standardDeviation 12 13 let standardDeviation' = ageSearch "James" 14 |> Seq.map(fun (y,c,p) -> float c) 15 |> standardDeviation 16 17 let average = ageSearch "James" 18 |> Seq.map(fun (y,c,p) -> float c) 19 |> Seq.average 20 21 let attachmentPoint = average+standardDeviation'

image

And then I can get the last year that the name was within 1 standard deviation above the average (greater than 71,180 names given):

1 2 let popularYears = ageSearch "James" 3 |> Seq.map(fun (y,c,p) -> y, float c) 4 |> Seq.filter(fun (y,c) -> c > attachmentPoint) 5 |> Seq.sortBy(fun (y,c) -> y) 6 |> Seq.last

image

So “James” is very likely a male and likely born before 1964.  Cue up the Pink Floyd!

The last piece was the state of birth –> can I guess the state of birth for a user?  I first looked at the states on a plot

1 let chartData' = stateSearch "James" 2 |> Seq.map(fun (s,c,p) -> s,c) 3 4 Chart.Column(chartData').ShowChart() 5

image

Nothing really stands out at me –> states with the most births have the most names.  I could do an academic exercise of seeing what states favor certain names, but that does not help me with Nerd Dinner in guessing the state of birth when given a name.

I pressed on to look at the top 10 states:

1 let topTenStates = stateSearch "James" 2 |> Seq.sortBy(fun (s,c,p) -> -c-1) 3 |> Seq.take 10 4 5 let topTenTotal = topTenStates 6 |> Seq.sumBy(fun (s,c,p) -> c) 7 let total = stateSearch "James" 8 |> Seq.sumBy(fun (s,c,p) -> c) 9 10 float topTenTotal/float total

image

So 50% of “James” were born in 10 states.  Again, I am not sure there is any actionable information here.  For example, if a majority of “James” were born in MI, I might have something (cue up the Bob Seger). 

Interestingly, there are certain number of names where the state of birth does matter.  For example, consider “Jose”:

image

Unsurprisingly, the two states are CA and TX.  Just using James and Jose as an example:

  • James is a male born before 1964
  • Jose is a male born before 2008 in either TX or CA

As an academic exercise, we could construct a random forest to find the names with the greatest state affinity.  However, that won’t help us on Nerd Dinner so I am leaving that out for another day.

This analysis does not account for a host of factors (person not born in the USA, nicknames, etc..), but it is still better than the nothing that Nerd Dinner currently has.  This analysis is not particular sophisticated but I often find that even the most basic statistics can be very powerful if used correctly.  That will be the next part of the talk…