Kaggle and R
August 18, 2015 Leave a comment
Following up on last week’s post on doing a Kaggle competition, I then decided to see if I could explore the data more in R on my local desktop. The competition is about analyzing a large group of house claims to give them a risk score.
I started the R studio to take a look at the initial data:
1 train <- read.csv("../Data/train.csv") 2 head(train) 3 summary(train) 4 5 plot(train$Hazard)
A couple of things popped out. All of the X variables look to be categorical. Even the result “Hazard” is an integer with most of the values falling between 1 and 9.
With that in mind, I decided to split the dataset into two sections: the majority and the minority.
1 train.low <- subset(train, Hazard < 9) 2 train.high <- subset(train, Hazard >= 9) 3 4 plot(train.low$Hazard) 5 plot(train.high$Hazard)
With the under as:
And the over 9 is like this
But I want to look at the Hazard score from a distribution point of view:
1 hazard.frame <- as.data.frame(table(train$Hazard)) 2 colnames(hazard.frame) <- c("hazard","freq") 3 hist(hazard.frame$freq) 4 plot(x=hazard.frame$hazard, y=hazard.frame$freq) 5 plot(x=hazard.frame$hazard, log(y=hazard.frame$freq)) 6
The hist shows the left skew
and the log plot really shows the distribution
So there is clearly a diminishing return going on. As of this writing, the leader is at 40%, which is about 20,400 of the 51,000 entries. So if you could identify all of the ones correctly, you should get 37% of the way there. To test it out, I submitted to Kaggle only ones:
LOL, so they must take away for incorrect answers as it is same as “all 0” benchmark. So going back, I know that if I can predict the ones correctly and make a reasonable guess at the rest, I might be OK. I went back and tuned my model some to get me out of the bottom 25% and then let it be. I assume that there is something obvious/industry standard that I am missing because there are so many people between my position and the top 25%.