Sandcastle Help File Builder and FSharp →

R for the .NET Developer

June 23, 2015 Leave a comment

I spent some time over the last week putting my ideas down for a new speaking topic: “R for the .NET Developer.” With Microsoft acquiring Revolution Analytics and making a concerted push into analytics tooling and platforms, it makes sense that .NET developers have some exposure to the most common language in the data science space – R.

I started the presentation using Prezi (thanks David Green) and set up the major points I wanted to cover:

· R Overview
· R Language Features
· R In Action
· R Lessons Learned

You can see the Prezi here.

I worked through and then borrowed from several different books:

this great you tube clip

and this Pluralsight course

I then jumped into R Studio to work though some of the code ideas that the Prezi illustrates. The entire set of code is found here on Github here but I wanted to show a couple of the cooler things that I did.

First, I implemented the Automotive In R from Data Mining and Business Analytics Book. This is pretty much a straight port of his exercise, with the exception is that I convert some vectors to factors to demonstrate who/when to do it:

 1 setwd("C:\\Git\\R4DotNet")
 2 
 3 #y = x1 + x2 + x3 + E
 4 #y is what you are trying explain
 5 #x1, x2, x3 are the variables that cause/influence y
 6 #E is things that we are not measuring/ using for calculations
 7 
 8 fuel.efficiency <- read.csv("C:/Git/R4DotNet/Data/FuelEfficiency.csv")
 9 summary(fuel.efficiency)
10 
11 #MPG = Miles per gallon
12 #GPM = Gallons per 100 miles
13 #WT = Weight of car in 1000 lbs
14 #DIS = Displacment in cubic inches
15 #NC = number of cylinders
16 #HP = Horsepower
17 #ACC = Acceleration in seconds from 0-60
18 #ET = Engine Type 0 = V, 1 = Straight
19 
20 plot(GPM~WT,data=fuel.efficiency)
21 plot(GPM~DIS,data=fuel.efficiency)
22 
23 fuel.efficiency$NC <- factor(fuel.efficiency$NC)
24 fuel.efficiency$ET <- factor(fuel.efficiency$ET)
25 summary(fuel.efficiency)
26 
27 plot(GPM~NC,data=fuel.efficiency)
28 
29 model <- lm(GPM~.,data=fuel.efficiency)
30 summary(model)
31 
32 # Multiple R-squared:  0.9804 
33 # means that we can explain 98% of the GPM with the variables we have E = 2%
34 # That is pretty friggen good
35 
36 # turning back to numeric so we can do cor accross data frame
37 fuel.efficiency$NC <- as.integer(fuel.efficiency$NC)
38 fuel.efficiency$ET <- as.integer(fuel.efficiency$ET)
39 cor(fuel.efficiency)
40 
41 #DIS -> WT = 0.9507647
42 
43 library(leaps)
44 x=fuel.efficiency[,3:7]
45 y=fuel.efficiency[,2]
46 out = summary(regsubsets(x,y,nbest=2,nvmax=ncol(x)))
47 tab=cbind(out$which,out$req,out$adjr2,out$cp)
48 tab
49 
50 #trade off between model size and model fit
51 #just weight is 
52 
53 model2 = lm(GPM~WT,data=fuel.efficiency)
54 summary(model2)

Here are the plots (as continuous and as a factor):

Then, I implemented this K-Means from Azure ML to show the difference between the two implementations. The AzureML experiment is found here. And my code looks like this. Note that I did not do a regression

 1 flowers <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
 2 summary(flowers)
 3 
 4 colnames(flowers) <- c("F1", "F2", "F3", "F4", "Label")
 5 summary(flowers)
 6 
 7 
 8 indexes = sample(1:nrow(flowers), size=0.6*nrow(flowers))
 9 flowers.train <- flowers[-indexes,]
10 flowers.test <- flowers[indexes,]
11 
12 fit <- kmeans(flowers.train[,1:4],5)
13 fit
14 
15 plot(flowers.train[c("F1", "F2")], col=fit$cluster)
16 points(fit$centers[,c("F1", "F2")], col=1:3, pch=8, cex=2)

With a plot example like this:

So I think I am ready for the presentation. It is really true, the best way to learn about something is to teach it…

Filed under R

Jamie Dixon's Home

R for the .NET Developer

Leave a comment Cancel reply

Categories

Recent Posts

Archives

Blogroll

Meta

Jamie Dixon's Home

R for the .NET Developer

Share this:

Related

Leave a comment Cancel reply

Categories

Recent Posts

Archives

Blogroll

Meta