Wake County Voter Analysis Using FSharp, AzureML, and R →

Kaggle and R

August 18, 2015 Leave a comment

Following up on last week’s post on doing a Kaggle competition, I then decided to see if I could explore the data more in R on my local desktop. The competition is about analyzing a large group of house claims to give them a risk score.

I started the R studio to take a look at the initial data:

1 train <- read.csv("../Data/train.csv")
2 head(train)
3 summary(train)
4 
5 plot(train$Hazard)

A couple of things popped out. All of the X variables look to be categorical. Even the result “Hazard” is an integer with most of the values falling between 1 and 9.

With that in mind, I decided to split the dataset into two sections: the majority and the minority.

1 train.low <- subset(train, Hazard < 9)
2 train.high <- subset(train, Hazard >= 9)
3 
4 plot(train.low$Hazard)
5 plot(train.high$Hazard)

With the under as:

And the over 9 is like this

But I want to look at the Hazard score from a distribution point of view:

1 hazard.frame <- as.data.frame(table(train$Hazard))
2 colnames(hazard.frame) <- c("hazard","freq")
3 hist(hazard.frame$freq)
4 plot(x=hazard.frame$hazard, y=hazard.frame$freq)
5 plot(x=hazard.frame$hazard, log(y=hazard.frame$freq))
6

The hist shows the left skew

and the log plot really shows the distribution

So there is clearly a diminishing return going on. As of this writing, the leader is at 40%, which is about 20,400 of the 51,000 entries. So if you could identify all of the ones correctly, you should get 37% of the way there. To test it out, I submitted to Kaggle only ones:

LOL, so they must take away for incorrect answers as it is same as “all 0” benchmark. So going back, I know that if I can predict the ones correctly and make a reasonable guess at the rest, I might be OK. I went back and tuned my model some to get me out of the bottom 25% and then let it be. I assume that there is something obvious/industry standard that I am missing because there are so many people between my position and the top 25%.

Filed under Analytics, R

Jamie Dixon's Home

Kaggle and R

Leave a comment Cancel reply

Categories

Recent Posts

Archives

Blogroll

Meta

Jamie Dixon's Home

Kaggle and R

Share this:

Related

Leave a comment Cancel reply

Categories

Recent Posts

Archives

Blogroll

Meta