# Kaggle and R

Following up on last week’s post on doing a Kaggle competition, I then decided to see if I could explore the data more in R on my local desktop.  The competition is about analyzing a large group of house claims to give them a risk score.

I started the R studio to take a look at the initial data:

```1 train <- read.csv("../Data/train.csv")
3 summary(train)
4
5 plot(train\$Hazard)```

A couple of things popped out.  All of the X variables look to be categorical.  Even the result “Hazard” is an integer with most of the values falling between 1 and 9.

With that in mind, I decided to split the dataset into two sections: the majority and the minority.

```1 train.low <- subset(train, Hazard < 9)
2 train.high <- subset(train, Hazard >= 9)
3
4 plot(train.low\$Hazard)
5 plot(train.high\$Hazard)```

With the under as:

And the over 9 is like this

But I want to look at the Hazard score from a distribution point of view:

```1 hazard.frame <- as.data.frame(table(train\$Hazard))
2 colnames(hazard.frame) <- c("hazard","freq")
3 hist(hazard.frame\$freq)
4 plot(x=hazard.frame\$hazard, y=hazard.frame\$freq)
5 plot(x=hazard.frame\$hazard, log(y=hazard.frame\$freq))
6 ```

The hist shows the left skew

and the log plot really shows the distribution

So there is clearly a diminishing return going on.   As of this writing, the leader is at 40%, which is about 20,400 of the 51,000 entries.   So if you could identify all of the ones correctly, you should get 37% of the way there.  To test it out, I submitted to Kaggle only ones:

LOL, so they must take away for incorrect answers as it is same as “all 0” benchmark.  So going back, I know that if I can predict the ones correctly and make a reasonable guess at the rest, I might be OK.   I went back and tuned my model some to get me out of the bottom 25% and then let it be.  I assume that there is something obvious/industry standard that I am missing because there are so many people between my position and the top 25%.