Azure ML | Jamie Dixon's Home

Wake County Voter Analysis Using FSharp, AzureML, and R

August 25, 2015 1 Comment

One of the real strengths of FSharp its ability to plow through and transform data in a very intuitive way, I was recently looking at Wake Country Voter Data found here to do some basic voter analysis. My first thought was to download the data into R Studio. Easy? Not really. The data is available as a ginormous Excel spreadsheet of database of about 154 MB in size. I wanted to slim the dataset down and make it a .csv for easy import into R but using Excel to export the data as a .csv kept screwing up the formatting and importing it directly into R Studio from Excel resulting in out of memory crashes. Also, the results of the different election dates were not consistent –> sometimes null, sometimes not. I managed to get the data into R Studio without a crash and wrote a function of either voted “1” or not “0” for each election

 1 #V = voted in-person on Election Day
 2 #A = voted absentee by mail or early voting (through May 2006)
 3 #M = voted absentee by mail (November 2006 - present)
 4 
 5 #O = voted One-Stop early voting (November 2006 - present)
 6 #T = voted at a transfer precinct on Election Day
 7 #P = voted a provisional ballot
 8 #L = Legacy data (prior to 2006)
 9 #D = Did not show
10 
11 votedIndicated <- function(votedCode) {
12   switch(votedCode,
13          "V" = 1,
14          "A" = 1,
15          "M" = 1,
16          "O" = 1,
17          "T" = 1,
18          "P" = 1,
19          "L" = 1,
20          "D" = 0)
21 }
22

However, every time I tried to run it, the IDE would crash with an out of memory issue.

Stepping back, I decided to transform the data in Visual Studio using FSharp. I created a sample from the ginormous excel spreadsheet and then imported the data using a type provider. No memory crashes!

 1 #r "../packages/ExcelProvider.0.1.2/lib/net40/ExcelProvider.dll"
 2 open FSharp.ExcelProvider
 3 
 4 [<Literal>]
 5 let samplePath = "../../Data/vrdb-Sample.xlsx"
 6 
 7 open System.IO  
 8 let baseDirectory = __SOURCE_DIRECTORY__
 9 let baseDirectory' = Directory.GetParent(baseDirectory)
10 let baseDirectory'' = Directory.GetParent(baseDirectory'.FullName)
11 let inputFilePath = @"Data\vrdb.xlsx"
12 let fullInputPath = Path.Combine(baseDirectory''.FullName, inputFilePath)
13 
14 type WakeCountyVoterContext = ExcelFile<samplePath>
15 let context = new WakeCountyVoterContext(fullInputPath)
16 let row = context.Data |> Seq.head

I then applied a similar function for voted or not and then exported the data as a .csv

 1 let voted (voteCode:obj) =
 2     match voteCode = null with
 3     | true -> "0"
 4     | false -> "1"
 5 
 6 open System
 7 let header =  "Id,Race,Party,Gender,Age,20080506,20080624,20081104,20091006,20091103,20100504,20100622,20101102,20111011,20111108,20120508,20120717,20121106,20130312,20131008,20131105,20140506,20140715,20141104"
 8                     
 9 let createOutputRow (row:WakeCountyVoterContext.Row) =
10     String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}",
11         row.voter_reg_num,
12         row.race_lbl, 
13         row.party_lbl, 
14         row.gender_lbl, 
15         row.eoy_age,
16         voted(row.``05/06/2008``),
17         voted(row.``06/24/2008``),
18         voted(row.``11/04/2008``),
19         voted(row.``10/06/2009``),
20         voted(row.``11/03/2009``),
21         voted(row.``05/04/2010``),
22         voted(row.``06/22/2010``),
23         voted(row.``11/02/2010``),
24         voted(row.``10/11/2011``),
25         voted(row.``11/08/2011``),
26         voted(row.``05/08/2012``),
27         voted(row.``07/17/2012``),
28         voted(row.``11/06/2012``),
29         voted(row.``03/12/2013``),
30         voted(row.``10/08/2013``),
31         voted(row.``11/05/2013``),
32         voted(row.``05/06/2014``),
33         voted(row.``07/15/2014``),
34         voted(row.``11/04/2014``)
35         )
36 
37 let outputFilePath = @"Data\vrdb.csv"
38 
39 let data = context.Data |> Seq.map(fun row -> createOutputRow(row))
40 let fullOutputPath = Path.Combine(baseDirectory''.FullName, outputFilePath)
41 
42 let file = new StreamWriter(fullOutputPath,true)
43 
44 file.WriteLine(header)
45 context.Data |> Seq.map(fun row -> createOutputRow(row))
46              |> Seq.iter(fun r -> file.WriteLine(r))
47

The really great thing is that I could write and then dispose of each line so I could do it without any crashes. Once the data was into a a .csv (10% the size of Excel), I could then import it into R Studio without a problem. It is a common lesson but really shows that using the right tool for the job saves tons of headaches.

I knew from a previous analysis of voter data that the #1 determinate of a person from wake county voting in a off-cycle election was their age:

So then in R, I created a decision tree for just age to see what the split was:

1 library(rpart)
2 temp <- rpart(all.voters$X20131008 ~ all.voters$Age)
3 plot(temp)
4 text(temp)

Thanks to Placidia for answering my question on stats.stackoverflow

So basically politicians should be targeting people 50 years or older or perhaps emphasizing issues that appeal to the over 50 crowd.

Filed under Analytics, Azure ML, F#, R

Kaggle and AzureML

August 11, 2015 Leave a comment

If you are not familiar with Kaggle, it is probably the de-facto standard for data science competitions. The competitions can be hosted by a private company with cash prizes or it can be a general competition with bragging rights on the line. The Titanic Kaggle competition is one of the more popular “hello world” data science projects that is a must-try for aspiring data scientists.

Recently, Kaggle hosted a competition sponsored by Liberty Mutual to help predict the insurance risk of houses. I decided to see how well AzureML could stack up against the best data scientists that Kaggle could offer.

My first step was to get the mechanics down (I am a big believer in getting dev ops done first). I imported the train and test datasets from Kaggle into AzureML. I visualized the data and was struck that all of the vectors were categorical, even the Y variable (“Hazard”) –> it is an int with a range between 1 and 70.

I created a quick categorical model and ran it. Note I did a 60/40 train/test split of the data

Once I had a trained model, I hit the “Set Up Web Service” button.

I then went into that “web service” and changed the input from a web service input to the test dataset that Kaggle provided. I then outputted the data to azure blob storage. I also added a transform to only export the data that Kaggle wants to evaluate the results: ID and Hazard:

Once the data was in blob storage, I could download it to my desktop and then upload it to Kaggle to get an evaluation and a ranking.

With the mechanics out of the way, I decided to try a series of out of box models to see what gives the best result. Since the result was categorical, I stuck to the classification models and this is what I found:

The OOB Two Class Bayes Point Machine is good for 1,278 place, out of about 1,200 competitors.

Stepping back, the hazard is definitely left-skewed so perhaps I need two models. If I can predict if the hazard is between low and high group, I should be able to be right with most of the predictions and then let the fewer outlier predictions use a different model. To test that hypotheses, I went back to AzureML and added a filter module for Hazard < 9

The problem is that the AUC dropped 3%. So it looks like the outliers are not really skewing the analysis. The next thought is that perhaps AzureML can help me identify the x variables that have the greatest predictive power. I dragged in a Filter Based Feature Selection module and ran that with the model

The results are kinda interesting. There is a significant drop-off after these top 9 columns

So I recreated the model with only these top 9 X variables

And the AUC moved to .60, so I am not doing better.

I then thought of treating the Hazard score not as a factor but as a continuous variable. I rejiggered the experiment to use a boosted decision tree regression

So then sending that over to Kaggle, I moved up. I then rounded the decimal but that did this:

So Kaggle rounds to an int anyway. Interestingly, I am at 32% and the leader is at 39%.

I then used all of the OOB models for regression in AzureML and got the following results:

Submitting the Poisson Regression, I got this:

I then realized that I could mike my model <slightly> more accurate by not including the 60/40 split when doing the predictive effort. Rather, I would put all 100% of the training data to the model:

Which moved me up another 10 spots…

So that is a good way to stop with the out of the box modeling in AzureML.

There are a couple of notes here

1) Kaggle knows how to run a competition. I love how easy it is to set up a team, submit an entry, and get immediate feedback.

2) AzureML OOB is a good place to start and explore different ideas. However, it is obvious that stacked against more traditional teams, it does not do well

3) Speaking of which. You are allowed to submit 5 entries a day and the competition lasts 90 days or so. With 450 entries, I am imagine a scenario where a person can spend their time gaming their submissions. There are 51,000 entries so and the leading entry (as of this writing) is around 39% so there are 20,000 correct answers. That is about 200 correct answers a day or 40 each submission.

Filed under Analytics, Azure ML

Global Azure Bootcamp Racing Game: More Analytics Using R and AzureML

May 19, 2015 1 Comment

Alan Smith, the creator and keeper of the Global Azure Bootcamp Racing Game, was kind enough to put the telemetry data from the races out on Azure Blob Storage. The data was already available as XML from Table Storage but AzureML was choking on the format so Alan was kind enough to turn it in to csv and put the file out here:

https://alanazuredemos.blob.core.windows.net/alan/TelemetryData0.csv
https://alanazuredemos.blob.core.windows.net/alan/TelemetryData1.csv
https://alanazuredemos.blob.core.windows.net/alan/TelemetryData2.csv
https://alanazuredemos.blob.core.windows.net/alan/PlayerLapTimes0.csv
https://alanazuredemos.blob.core.windows.net/alan/PlayerLapTimes1.csv
https://alanazuredemos.blob.core.windows.net/alan/PlayerLapTimes2.csv

Note that there are 3 races with race0, race1, and race2 each having 2 datasets. The TelemetryData is a reading foreaceach car in the race every 10 MS or so and the PlayerLapTimes is a summary of the demographics of the player as well as some final results.

I decided to do some unsupervised learning using Chapter 8 of Practical Data Science With R as my guide. I pulled down all 972,780 observations from the Race0 telemetry data in R Studio. It took a bit :-) I then ran the following script to do a cluster dendrogram. Alas, I killed the job after several minutes (actually the job killed my machine and I got a out of memory exception)

1 summary(TelemetryData0)
2 pmatrix <- scale(TelemetryData0[,])
3 d <- dist(pmatrix, method="euclidean")
4 pfit <- hclust(d,method="ward")
5 plot(pfit)
6

I then tried to narrow my search down to damage and speed:

1 damage <- TelemetryData0$Damage
2 speed <- TelemetryData0$Speed
3 
4 plot(damage, speed, main="Damage and Speed", 
5      xlab="Damage ", ylab="Speed ", pch=20)
6 
7 abline(lm(speed~speed), col="red") # regression line (y~x) 
8 lines(lowess(speed,speed), col="blue") # lowess line (x,y)
9

(I added the red line manually)

So that is interesting. It looks like there is a slight downhill (more damage) the lower the speed. So perhaps speed does not automatically mean more damage to the car. Anyone who drives in San Francisco can attest to that 🙂

I then went back and took a sample of the telemetry data

1 telemetry <- TelemetryData0[sample(1:nrow(TelemetryData0),10000),]
2 telemetry <- telemetry[0:10000,c("Damage","Speed")]
3 summary(telemetry)
4 pmatrix <- scale(telemetry[,])
5 d <- dist(pmatrix, method="euclidean")
6 pfit <- hclust(d,method="ward")
7 plot(pfit)
8

And I got this:

And the fact that it is not showing me anything made me think of this clip:

In any event, I decided to try a similar analysis using AzureML to see if AzureML can handle the 975K records better than my desktop.

I fired up AzureML and added a data reader to the original file and then added some cleaning:

The problem is that these steps would take 10-12 minutes to complete. I decided to give up and bring a copy of the data locally via the “Save As Dataset” context menu. This speed things up significantly. I added in a k-means module for speed and damage and ran the model

The first ten times or so I ran this, I got a this

After I added in the “Clean Missing Data” module before the normalization step,

I got some results. Note that Removing the entire row is what R does as a default when cleaning the data via import so I thought I would keep it matching. In any event, the results look like this:

So I am not sure what this shows, other than there is overlap of speed and damage and there seems to be a relationship.

So there are some other questions I want to answer, like:

1) After a player sustains some damage, do they have a generic response (like breaking, turning right, etc…)

2) Are there certain “lines’’” that winner players take going though individual curves?

3) Do you really have to avoid damage to win?

I plan to try and answer these questions and more in the coming weeks.

Filed under Azure ML, R

Predicting Physician Gender Using AzureML and F#

December 2, 2014 1 Comment

I am working with a couple of friends in a 2 week hackathon where the main subject is health care provider quality. One of the datasets that we are using is the national registry of physician information found here. One of the team members loaded it into Azure Sql Server and it is a dog. It is a about 1 gig of data and takes a couple of minutes to scan the entire dataset. I decided to take a small slice of the data (Connecticut physicians) and do some analysis on it .

My first step was to bring the data into AzureML via the Data Reader

Note that it took about 3 minutes to bring the data down. I then saved this data as a local dataset to do my experiments:

I then fired up another experiment using the dataset as the base. I first dragged in a Project Column module to only grab the columns I was interested in

I then pulled in a Missing Values Scrubber module where I would drop any row where there was a value missing

I then brought in a Metadata Editor module To change all of the fields to Categorical data types

With the data ready to go, I created a 70/30 (train/test) split of the data and added a Multiclass Decision Forest model with Gender as the Dependent variable

I then added a Score Model module and fed in the 30%. I finally added an Evaluate Model module

And the results were interesting, if not unsurprising:

Basically, if I know your age, your specialty, and your medical school, we can predict if you are a man 85% of the time. Encouragingly, we can only do it 62% of the time for a woman. I then published the experiment and created a quick script to consume the data:

 1 #r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll"
 2 #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll"
 3 
 4 open System
 5 open System.Net.Http
 6 open System.Net.Http.Headers
 7 open System.Net.Http.Formatting
 8 open System.Collections.Generic
 9 
10 type scoreData = {FeatureVector:Dictionary<string,string>;GlobalParameters:Dictionary<string,string>}
11 type scoreRequest = {Id:string; Instance:scoreData}
12 
13 let invokeService () = async {
14     let apiKey = ""
15     let uri = "https://ussouthcentral.services.azureml.net/workspaces/19a2e623b6a944a3a7f07c74b31c3b6d/services/6c4bbb43456e4d7e8a9196f2899f717d/score"
16     use client = new HttpClient()
17     client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Bearer",apiKey)
18     client.BaseAddress <- new Uri(uri)
19 
20     let input = new Dictionary<string,string>()
21     input.Add("Gender","U")
22     input.Add("MedicalSchoolName","OTHER")
23     input.Add("GraduationYear","1995")
24     input.Add("PrimarySpecialty","INTERNAL MEDICINE")
25 
26     let instance = {FeatureVector=input; GlobalParameters=new Dictionary<string,string>()}
27     let scoreRequest = {Id="score00001";Instance=instance}
28 
29     let! response = client.PostAsJsonAsync("",scoreRequest) |> Async.AwaitTask
30     let! result = response.Content.ReadAsStringAsync() |> Async.AwaitTask
31 
32     if response.IsSuccessStatusCode then
33         printfn "%s" result
34     else
35         printfn "FAILED: %s" result
36     response |> ignore
37 }
38 
39 invokeService() |> Async.RunSynchronously

And I have a way of predicting genders:

U,OTHER,1995,INTERNAL MEDICINE,0.651031798112075,0.348968201887925,0,F

Filed under Azure ML, F#

Wake County Restaurant Inspection Data with Azure ML and F#

September 30, 2014 1 Comment

With Azure ML now available, I was thinking about some of the analysis I did last year and how I could do even more things with the same data set. One such analysis that came to mind was the restaurant inspection data that I analyzed last year. You can see the prior analysis here.

I uploaded the restaurant data into Azure and thought of a simple question –> can we predict inspection scores based on some easily available data? This is an interesting dataset because there are some categorical data elements (zip code, restaurant type, etc…) and there are some continuous ones (priority foundation, etc…).

Here is the base dataset:

I created a new experiment and I used a boosted regression model and a neural network regression and used a 70/30 train/test split.

After running the models and inspecting the model evaluation, I don’t have a very good model

I then decided to go back and pull some of the X variables out of the dataset and concentrate on only a couple of variables. I added a project column module and then selected Restaurant Type and Zip Code as the X variables and left the Inspection Score as the Y variable.

With this done, I added a couple of more models (Bayesian Linear Regression and a Decision Forest Regression) and gave it a whirl

Interesting, adding these models did not give us any better of a prediction and dropping the variables to two made a less accurate model. Without doing any more analysis, I picked the model with the lowest MAE )Boosted Decision Tree Regression) and published it at a web service:

I published it as a web service and now I can consume if from a client app. I used the code that I used for voting analysis found here as a template and sure enough:

["27519","Restaurant","0","96.0897827148438"]

["27612","Restaurant","0","95.5728530883789"]

So restaurants in Cary,NC have a higher inspection score than the ones found in Northwest Raleigh. However, before we start alerting the the Cary Chamber of Commerce to create a marketing campaign (“Eat in Cary, we are safer”), the difference is within the MAE.

In any event, it would be easy to create a phone app and you don’t know a restaurant score, you can punch in the establishment type and the zip code and have a good idea about the score of the restaurant.

This is an academic exercise b/c the establishments have to show you their card and yelp has their score on them, but a fun exercise none the less. Happy eating.

Filed under Analytics, Azure ML, F# Tagged with Analytics, AzureML, F#

Consuming Azure ML With F#

September 16, 2014 2 Comments

(This post is a continuation of this one)

So with a model that works well enough, I selected only that model and saved it

Created a new experiment and used that model with the base data. I then marked the project columns as the input and the score as the output (green and blue circle respectively)

After running it, I published it as a web service

And voila, an endpoint ready to go. I then took the auto generated script and opened up a new Visual Studio F# project to use it. The problem was that this is the data structure that the model needs

FeatureVector = new Dictionary<string, string>() 
    {
        { "Precinct", "0" },
        { "VRN", "0" },
        { "VRstatus", "0" },
        { "VRlastname", "0" },
        { "VRfirstname", "0" },
        { "VRmiddlename", "0" },
        { "VRnamesufx", "0" },
        { "VRstreetnum", "0" },
        { "VRstreethalfcode", "0" },
        { "VRstreetdir", "0" },
        { "VRstreetname", "0" },
        { "VRstreettype", "0" },
        { "VRstreetsuff", "0" },
        { "VRstreetunit", "0" },
        { "VRrescity", "0" },
        { "VRstate", "0" },
        { "Zip Code", "0" },
        { "VRfullresstreet", "0" },
        { "VRrescsz", "0" },
        { "VRmail1", "0" },
        { "VRmail2", "0" },
        { "VRmail3", "0" },
        { "VRmail4", "0" },
        { "VRmailcsz", "0" },
        { "Race", "0" },
        { "Party", "0" },
        { "Gender", "0" },
        { "Age", "0" },
        { "VRregdate", "0" },
        { "VRmuni", "0" },
        { "VRmunidistrict", "0" },
        { "VRcongressional", "0" },
        { "VRsuperiorct", "0" },
        { "VRjudicialdistrict", "0" },
        { "VRncsenate", "0" },
        { "VRnchouse", "0" },
        { "VRcountycomm", "0" },
        { "VRschooldistrict", "0" },
        { "11/6/2012", "0" },
        { "Voted Ind", "0" },
    },
    GlobalParameters = new Dictionary<string, string>() 
    {
    }
};

And since I am only using 6 of the columns, it made sense to reload the Wake County Voter Data with just the needed columns. I went back to the original CSV and did that. Interestingly, I could not set the original dataset as the publish input so I added a project column module that does nothing

With that in place, I republished the service and opened Visual Studio. I decided to start with a script. I was struggling though the async when Tomas P helped me on Stack Overflow here. I’ll say it again, the F# community is tops. In any event, here is the initial script:


#r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll"
#r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll"

open System
open System.Net.Http
open System.Net.Http.Headers
open System.Net.Http.Formatting
open System.Collections.Generic

type scoreData = {FeatureVector:Dictionary<string,string>;GlobalParameters:Dictionary<string,string>}
type scoreRequest = {Id:string; Instance:scoreData}

let invokeService () = async {
    let apiKey = ""
    let uri = "https://ussouthcentral.services.azureml.net/workspaces/19a2e623b6a944a3a7f07c74b31c3b6d/services/f51945a42efa42a49f563a59561f5014/score"
    use client = new HttpClient()
    client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Bearer",apiKey)
    client.BaseAddress <- new Uri(uri)

    let input = new Dictionary<string,string>()
    input.Add("Zip Code","27519")
    input.Add("Race","W")
    input.Add("Party","UNA")
    input.Add("Gender","M")
    input.Add("Age","45")
    input.Add("Voted Ind","1")

    let instance = {FeatureVector=input; GlobalParameters=new Dictionary<string,string>()}
    let scoreRequest = {Id="score00001";Instance=instance}

    let! response = client.PostAsJsonAsync("",scoreRequest) |> Async.AwaitTask
    let! result = response.Content.ReadAsStringAsync() |> Async.AwaitTask

    if response.IsSuccessStatusCode then
        printfn "%s" result
    else
        printfn "FAILED: %s" result
    response |> ignore
}

invokeService() |> Async.RunSynchronously

Unfortunately, when I run it, it fails. Below is the Fiddler trace:

So it looks like the Json Serializer is postpending the “@” symbol. I changed the records to types and voila:

You can see the final script here.

So then throwing in some different numbers.

A millennial: ["27519","W","D","F","25","1","1","0.62500011920929"]
A senior citizen: ["27519","W","D","F","75","1","1","0.879632294178009"]

I wonder why social security never gets cut?

In any event, just to check the model:

A 15 year old: ["27519","W","D","F","15","1","0","0.00147285079583526"]

Filed under Analytics, Azure ML, F#

Azure ML and Wake County Election Data

September 16, 2014 2 Comments

I have been spending the last couple of weeks using Azure ML and I think it is one of the most exciting technologies for business developers and analysts since ODBC and FSharp type providers. If you remember, when ODBC came out, every relational database in the world became accessible and therefore usable/analyzable. When type providers came out, programming, exploring, and analyzing data sources became much easier and it expanded from RDBMS to all formats (notably Json). So getting data was no longer a problem, but analyzing it still was.

Enter Azure ML.

I downloaded the Wake County Voter History data from here. I took the Excel spreadsheet and converted it to a .csv locally. I then logged into Azure ML and imported the data

I then created an experiment and added the dataset to the canvas

And looked at the basic statistics of the data set

(Note that I find that using the FSharp REPL a better way to explore the data as I can just dot each element I am interested in and view the results).

In any event, the first question I want to answer is

“given a person’s ZipCode, Race, Party,Gender, and Age, can I predict if they will vote in November”

To that end, I first narrowed down the columns using a Column Projection and picked only the columns I care about. I picked “11/6/2012” and the X variable because that was the last national election and that is what we are going to have in November. I prob should have done 2010 b/c that is a national without a President, but that can be analyzed at a later date.

I then ran my experiment so the data would be available in the Project Column step.

I then renamed the columns to make them a bit readable by using a series Metadata Editors (it does not look like you can do all renames in 1 step. Equally as annoying is that you have to add each module, run it, then add the next.)

(one example)

I then added a Missing Values scrubber for the voted column. So instead of a null field, people who didn’t vote get a “N”

The problem is that it doesn’t work –> looks like we can’t change the values per column.

I asked the question on the forum but in the interest of time, I decided to change the voted column from a categorical column to an indicator. That way I can do binary analysis. That also failed. I went back to the original spreadsheet and added a Indicator column and then also renamed the column headers so I am not cluttering up my canvas with those meta data transforms. Finally, I realized I want only active voters but there does not seems to be a filtering ability (remove rows only works for missing) so I removed those also from the original dataset. I think the ability to scrub and munge data is an area for improvement, but since this is release 1, I understand.

After re-importing the data, I changed my experiment like so

I then split the dataset into Training/Validation/And Testing using a 60/20/20 split

So the left point on the second split is 60% of the original dataset, the right point on the second split is 20% of the original dataset (or 75%/25% of the 80% of the first split)

I then added a SVM with a train and score module. Note that I am training with 60% of the original dataset and I am validating with 20%

After it runs, there are 2 new columns in the dataset –> Scored labels and probabilities so each row now has a score.

With the model in place, I can then evaluate it using an evaluation model

And we can see an AUC of .666, which immediately made me think of this

In any event, I added a Logisitc Regression and a Boosted Decision Tree to the canvas and hooked them up to the training and validation sets

And this is what we have

SVM: .666 AUC

Regression: .689 AUC

Boosted Decision Tree: .713 AUC

So with Boosted Decision Tree ahead, I added a Sweep Parameter module to see if I can tune it more. I am using AUC as the performance metric

So the best AUC I am going to get is .7134 with the highlighted parameters. I then added 1 more Model that uses those parameters against the entire training dataset (80% of the total) and then evaluates it against the remaining 20%.

With the final answer of

With that in hand, I can create a new experiment that will be the bases of a real time voting app.

Filed under Analytics, Azure ML

Jamie Dixon's Home

Wake County Voter Analysis Using FSharp, AzureML, and R

Kaggle and AzureML

Global Azure Bootcamp Racing Game: More Analytics Using R and AzureML

Predicting Physician Gender Using AzureML and F#

Wake County Restaurant Inspection Data with Azure ML and F#

Consuming Azure ML With F#

Azure ML and Wake County Election Data

Categories

Recent Posts

Archives

Blogroll

Meta