R for the .NET Developer
June 23, 2015 Leave a comment
I spent some time over the last week putting my ideas down for a new speaking topic: “R for the .NET Developer.” With Microsoft acquiring Revolution Analytics and making a concerted push into analytics tooling and platforms, it makes sense that .NET developers have some exposure to the most common language in the data science space – R.
I started the presentation using Prezi (thanks David Green) and set up the major points I wanted to cover:
- · R Overview
- · R Language Features
- · R In Action
- · R Lessons Learned
You can see the Prezi here.
I worked through and then borrowed from several different books:
this great you tube clip
and this Pluralsight course
I then jumped into R Studio to work though some of the code ideas that the Prezi illustrates. The entire set of code is found here on Github here but I wanted to show a couple of the cooler things that I did.
First, I implemented the Automotive In R from Data Mining and Business Analytics Book. This is pretty much a straight port of his exercise, with the exception is that I convert some vectors to factors to demonstrate who/when to do it:
1 setwd("C:\\Git\\R4DotNet") 2 3 #y = x1 + x2 + x3 + E 4 #y is what you are trying explain 5 #x1, x2, x3 are the variables that cause/influence y 6 #E is things that we are not measuring/ using for calculations 7 8 fuel.efficiency <- read.csv("C:/Git/R4DotNet/Data/FuelEfficiency.csv") 9 summary(fuel.efficiency) 10 11 #MPG = Miles per gallon 12 #GPM = Gallons per 100 miles 13 #WT = Weight of car in 1000 lbs 14 #DIS = Displacment in cubic inches 15 #NC = number of cylinders 16 #HP = Horsepower 17 #ACC = Acceleration in seconds from 0-60 18 #ET = Engine Type 0 = V, 1 = Straight 19 20 plot(GPM~WT,data=fuel.efficiency) 21 plot(GPM~DIS,data=fuel.efficiency) 22 23 fuel.efficiency$NC <- factor(fuel.efficiency$NC) 24 fuel.efficiency$ET <- factor(fuel.efficiency$ET) 25 summary(fuel.efficiency) 26 27 plot(GPM~NC,data=fuel.efficiency) 28 29 model <- lm(GPM~.,data=fuel.efficiency) 30 summary(model) 31 32 # Multiple R-squared: 0.9804 33 # means that we can explain 98% of the GPM with the variables we have E = 2% 34 # That is pretty friggen good 35 36 # turning back to numeric so we can do cor accross data frame 37 fuel.efficiency$NC <- as.integer(fuel.efficiency$NC) 38 fuel.efficiency$ET <- as.integer(fuel.efficiency$ET) 39 cor(fuel.efficiency) 40 41 #DIS -> WT = 0.9507647 42 43 library(leaps) 44 x=fuel.efficiency[,3:7] 45 y=fuel.efficiency[,2] 46 out = summary(regsubsets(x,y,nbest=2,nvmax=ncol(x))) 47 tab=cbind(out$which,out$req,out$adjr2,out$cp) 48 tab 49 50 #trade off between model size and model fit 51 #just weight is 52 53 model2 = lm(GPM~WT,data=fuel.efficiency) 54 summary(model2)
Here are the plots (as continuous and as a factor):
Then, I implemented this K-Means from Azure ML to show the difference between the two implementations. The AzureML experiment is found here. And my code looks like this. Note that I did not do a regression
1 flowers <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data") 2 summary(flowers) 3 4 colnames(flowers) <- c("F1", "F2", "F3", "F4", "Label") 5 summary(flowers) 6 7 8 indexes = sample(1:nrow(flowers), size=0.6*nrow(flowers)) 9 flowers.train <- flowers[-indexes,] 10 flowers.test <- flowers[indexes,] 11 12 fit <- kmeans(flowers.train[,1:4],5) 13 fit 14 15 plot(flowers.train[c("F1", "F2")], col=fit$cluster) 16 points(fit$centers[,c("F1", "F2")], col=1:3, pch=8, cex=2)
With a plot example like this:
So I think I am ready for the presentation. It is really true, the best way to learn about something is to teach it…