Global Azure Bootcamp Racing Game: More Analytics Using R and AzureML
May 19, 2015 1 Comment
Alan Smith, the creator and keeper of the Global Azure Bootcamp Racing Game, was kind enough to put the telemetry data from the races out on Azure Blob Storage. The data was already available as XML from Table Storage but AzureML was choking on the format so Alan was kind enough to turn it in to csv and put the file out here:
https://alanazuredemos.blob.core.windows.net/alan/TelemetryData0.csv
https://alanazuredemos.blob.core.windows.net/alan/TelemetryData1.csv
https://alanazuredemos.blob.core.windows.net/alan/TelemetryData2.csv
https://alanazuredemos.blob.core.windows.net/alan/PlayerLapTimes0.csv
https://alanazuredemos.blob.core.windows.net/alan/PlayerLapTimes1.csv
https://alanazuredemos.blob.core.windows.net/alan/PlayerLapTimes2.csv
Note that there are 3 races with race0, race1, and race2 each having 2 datasets. The TelemetryData is a reading foreaceach car in the race every 10 MS or so and the PlayerLapTimes is a summary of the demographics of the player as well as some final results.
I decided to do some unsupervised learning using Chapter 8 of Practical Data Science With R as my guide. I pulled down all 972,780 observations from the Race0 telemetry data in R Studio. It took a bit :-) I then ran the following script to do a cluster dendrogram. Alas, I killed the job after several minutes (actually the job killed my machine and I got a out of memory exception)
1 summary(TelemetryData0) 2 pmatrix <- scale(TelemetryData0[,]) 3 d <- dist(pmatrix, method="euclidean") 4 pfit <- hclust(d,method="ward") 5 plot(pfit) 6
I then tried to narrow my search down to damage and speed:
1 damage <- TelemetryData0$Damage 2 speed <- TelemetryData0$Speed 3 4 plot(damage, speed, main="Damage and Speed", 5 xlab="Damage ", ylab="Speed ", pch=20) 6 7 abline(lm(speed~speed), col="red") # regression line (y~x) 8 lines(lowess(speed,speed), col="blue") # lowess line (x,y) 9
(I added the red line manually)
So that is interesting. It looks like there is a slight downhill (more damage) the lower the speed. So perhaps speed does not automatically mean more damage to the car. Anyone who drives in San Francisco can attest to that 🙂
I then went back and took a sample of the telemetry data
1 telemetry <- TelemetryData0[sample(1:nrow(TelemetryData0),10000),] 2 telemetry <- telemetry[0:10000,c("Damage","Speed")] 3 summary(telemetry) 4 pmatrix <- scale(telemetry[,]) 5 d <- dist(pmatrix, method="euclidean") 6 pfit <- hclust(d,method="ward") 7 plot(pfit) 8
And I got this:
And the fact that it is not showing me anything made me think of this clip:
In any event, I decided to try a similar analysis using AzureML to see if AzureML can handle the 975K records better than my desktop.
I fired up AzureML and added a data reader to the original file and then added some cleaning:
The problem is that these steps would take 10-12 minutes to complete. I decided to give up and bring a copy of the data locally via the “Save As Dataset” context menu. This speed things up significantly. I added in a k-means module for speed and damage and ran the model
The first ten times or so I ran this, I got a this
After I added in the “Clean Missing Data” module before the normalization step,
I got some results. Note that Removing the entire row is what R does as a default when cleaning the data via import so I thought I would keep it matching. In any event, the results look like this:
So I am not sure what this shows, other than there is overlap of speed and damage and there seems to be a relationship.
So there are some other questions I want to answer, like:
1) After a player sustains some damage, do they have a generic response (like breaking, turning right, etc…)
2) Are there certain “lines’’” that winner players take going though individual curves?
3) Do you really have to avoid damage to win?
I plan to try and answer these questions and more in the coming weeks.
Pingback: F# Weekly #21, 2015 | Sergey Tihon's Blog