Kaggle and AzureML

If you are not familiar with Kaggle, it is probably the de-facto standard for data science competitions.  The competitions can be hosted by a private company with cash prizes or it can be a general competition with bragging rights on the line.  The Titanic Kaggle competition is one of the more popular “hello world” data science projects that is a must-try for aspiring data scientists.

Recently, Kaggle hosted a competition sponsored by Liberty Mutual to help predict the insurance risk of houses.  I decided to see how well AzureML could stack up against the best data scientists that Kaggle could offer.

My first step was to get the mechanics down (I am a big believer in getting dev ops done first).  I imported the train and test datasets from Kaggle into AzureML.  I visualized the data and was struck that all of the vectors were categorical, even the Y variable (“Hazard”) –> it is an int with a range between 1 and 70.

image

I created a quick categorical model and ran it.  Note I did a 60/40 train/test split of the data

image

Once I had a trained model, I hit the “Set Up Web Service” button.

image

I then went into that “web service” and changed the input from a web service input to the test dataset that Kaggle provided.  I then outputted the data to azure blob storage.  I also added a transform to only export the data that Kaggle wants to evaluate the results: ID and Hazard:

image

Once the data was in blob storage, I could download it to my desktop and then upload it to Kaggle to get an evaluation and a ranking.

image

With the mechanics out of the way, I decided to try a series of out of box models to see what gives the best result.  Since the result was categorical, I stuck to the classification models and this is what I found:

image

image

The OOB Two Class Bayes Point Machine is good for 1,278 place, out of about 1,200 competitors.

Stepping back, the hazard is definitely left-skewed so perhaps I need two models.  If I can predict if the hazard is between low and high group, I should be able to be right with most of the predictions and then let the fewer outlier predictions use a different model.  To test that hypotheses, I went back to AzureML and added a filter module for Hazard < 9

image

The problem is that the AUC dropped 3%.  So it looks like the outliers are not really skewing the analysis.  The next thought is that perhaps AzureML can help me identify the x variables that have the greatest predictive power.  I dragged in a Filter Based Feature Selection module and ran that with the model

image image

The results are kinda interesting.  There is a significant drop-off after these top 9 columns

image

So I recreated the model with only these top 9 X variables

image

And the AUC moved to .60, so I am not doing better.

I then thought of treating the Hazard score not as a factor but as a continuous variable.   I rejiggered the experiment to use a boosted decision tree regression

image

So then sending that over to Kaggle, I moved up.  I then rounded the decimal but that did this:

image

So Kaggle rounds to an int anyway.  Interestingly, I am at 32% and the leader is at 39%. 

I then used all of the OOB models for regression in AzureML and got the following results:

image

Submitting the Poisson Regression, I got this:

image

I then realized that I could mike my model <slightly> more accurate by not including the 60/40 split when doing the predictive effort.  Rather, I would put all 100% of the training data to the model:

image

Which moved me up another 10 spots…

image

So that is a good way to stop with the out of the box modeling in AzureML. 

There are a couple of notes here

1) Kaggle knows how to run a competition.  I love how easy it is to set up a team, submit an entry, and get immediate feedback.

2) AzureML OOB is a good place to start and explore different ideas.  However, it is obvious that stacked against more traditional teams, it does not do well

3) Speaking of which.  You are allowed to submit 5 entries a day and the competition lasts 90 days or so.  With 450 entries, I am imagine a scenario where a person can spend their time gaming their submissions.  There are 51,000 entries so and the leading entry (as of this writing) is around 39% so there are 20,000 correct answers.  That is about 200 correct answers a day or 40 each submission.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: