Machine Learning For Hackers: Using MSFT technologies
April 16, 2013 Leave a comment
So I picked up Machine Learning for Hackers and started going though the 1st couple of chapters last night. I love the premise that developers can learn something about machine learning using some easily-understood examples. Why I didn’t love was that it used R. After reading the 1st chapter, I said to myself, “self, everything they do in this chapter you can do in MSFT office.” Now I get that this is supposed to be a simple example to get the reader up and running with R, but I thought, “Hey, I wonder how much of this book can I do using the MSFT stack?”
Chapter 1 was all about cleaning up data to get it ready for analysis. This brought me back to my 1st job out of school (marketing analyst) when I spent more time collecting and cleaning the data than actually analyzing the data.
My 1st stop was to download the sample UFO dataset from the publisher site and save it locally with a .txt extension. I then imported the data into Microsoft access: Note that I assigned column names and made sure that the Long Description is memo:
Note that there were
With that base data imported, I then added an additional field to the table called valid. The book follows the process of analyzing each column’s data and removing rows that are not analyzable. I learned the hard way many years ago that you should never permanently remove data because you need to be able to trace the data set back to its initial state for audit purposes. I like the ability to make queries upon queries in Access to layer up data validation steps.
For example, in the book, the first validation is to parse the DateOccured field into year/month/columns – the problem is that some columns don’t have valid data. To this 2 step process, the first is to aggregate the values in the column and see what the valid values look like:
Sure enough, there are over 250 rows of data that cannot be used. I created a table of the aggregate dates and the inner joined that table back to the base datatable. Rows with invalid data was flagged as Invalid:
I then divided the Location field into City, State:
And then created table for valid states – the same way I did for the valid DateOccured. I had to remove about 10% of the dataset – because of both malformed data and the fact that the dataset is world-wide and the book’s example is only interested in the USA. The fact that 90% of the world’s UFO sightings is in America probably says something, though I am not sure what.
In any event, then exported the data into Excel, threw a pivot table on the data and added some conditional formatting:
Note that some malformed data slipped in (“NT”, “NE”, etc…) but I was too lazy to go back and clean it up.
Note that this is total sightings, not per-capita so you would expect that the states with larger populations to have more sightings (assuming that UFOs are targeting the USA evenly). I think the color presentation is more effective and really draw your eye to CA and WA much more than the 50 histogram display that is found in the book.
I then filtered the data to only greater than 1990 like the book did and changed the pivot table to filter on state. Here is California:
The color really shows how the number of sightings are increasing in CA and seem to be more consistent in WA. I do wonder what happened in 1995 in WA to cause that spike?
The next chapter is about regression analysis, so I assume that I can use the built-in functions of Excel for that too…