Machine Learning for Hackers: Using F#
May 21, 2013 1 Comment
I decided I wanted to learn more about F# so my Road Alert project. I started by watching this great video. After reviewing it a couple of times, I realized that I could try and do chapter 1 of Machine Learning for Hackers using F#.
Since I already had the data from this blog post, I just had to follow Luca’s example. I wrote the following code in an F# project in Visual Studio 2012.
- open System.IO
- type UFOLibrary() =
- member this.GetDetailData() =
- let path = "C:\Users\Jamie\Documents\Visual Studio 2012\Projects\MachineLearningWithFSharp_Solution\Tff.MachineLearningWithFSharp.Chapter01\ufo_awesome.txt"
- let fileStream = new FileStream(path,FileMode.Open,FileAccess.Read)
- let streamReader = new StreamReader(fileStream)
- let contents = streamReader.ReadToEnd()
- let usStates = [|"AL";"AK";"AZ";"AR";"CA";"CO";"CT";"DE";"DC";"FL";"GA";"HI";"ID";"IL";"IN";"IA";
- "KS";"KY";"LA";"ME";"MD";"MA";"MI";"MN";"MS";"MO";"MT";"NE";"NV";"NH";"NJ";"NM";
- "NY";"NC";"ND";"OH";"OK";"OR";"PA";"RI";"SC";"SD";"TN";"TX";"UT";"VT";"VA";"WA";
- "WV";"WI";"WY"|]
- let cleanContents =
- contents.Split([|'\n'|])
- |> Seq.map(fun line -> line.Split([|'\t'|]))
- Seq.head()
I then added a C# console project to the solution and added the following code:
- static void Main(string[] args)
- {
- Console.WriteLine("Start");
- UFOLibrary ufoLibrary = new UFOLibrary();
- foreach (String currentString in ufoLibrary.GetDetailData())
- {
- Console.WriteLine(currentString);
- }
- Console.WriteLine("End");
- Console.ReadKey();
- }
Sure enough, when I hit F5
How cool is it to call F# code from a C# project and it just works? I feel a whole new world of possibilites just opened to me.
I then went back to the book and saw that they used the head function in R that returns the top 10 rows of data. The F# head only returns the top 1 so I had to make the following change to my F# to duplicate the effect:
- let cleanContents =
- contents.Split([|'\n'|])
- |> Seq.map(fun line -> line.Split([|'\t'|]))
- |> Seq.take(10)
I then had to remove the defective rows that had malformed data. To do this, I went back to the F# code and changed it to this
- let cleanContents =
- contents.Split([|'\n'|])
- |> Seq.map(fun line -> line.Split([|'\t'|]))
I then went back to the Console app to change it like this:
- Console.WriteLine("Start");
- UFOLibrary ufoLibrary = new UFOLibrary();
- IEnumerable<String> rows = ufoLibrary.GetDetailData();
- Console.WriteLine(String.Format("Number of rows: {0}", rows.Count()));
- Console.WriteLine("End");
- Console.ReadKey();
And I see this when I hit F5
So now I have a baseline of 61,394 rows.
My 1st step is to removed rows that do not have 6 columns. To do that, I changed my code to this:
- Console.WriteLine("Start");
- UFOLibrary ufoLibrary = new UFOLibrary();
- IEnumerable<String> rows = ufoLibrary.GetDetailData();
- Console.WriteLine(String.Format("Number of rows: {0}", rows.Count()));
- Console.WriteLine("End");
- Console.ReadKey();
and when I hit F5, I can see that the number of records has dropped:
I then want to removed the bad date fields the way they did it in the book – all dates have to be 8 characters in length, no more, no less.
Going back to the F# code, I added this line
- |> Seq.filter(fun values -> values.[0].Length = 8)
and sure enough, fewer records in my dataset:
And finally applying the same logic to the second column – which is also a date
- |> Seq.filter(fun values -> values.[1].Length = 8)
Which raises eyebrows, I assume there would be some malformed data in the 2ndcolumn independent of the 1st column, but I guess not.
I then wanted to convert the 1st two columns from strings into DateTimes. Going back to Luca’s examples, I did this:
- |> Seq.map(fun values ->
- System.DateTime.Parse(values.[0]),
- System.DateTime.Parse(values.[1]),
- values.[2],
- values.[2],
- values.[3],
- values.[4],
- values.[5])
Interestingly, I then went back to my Console application and got this
Error 1 Cannot implicitly convert type ‘System.Collections.Generic.IEnumerable<System.Tuple<System.DateTime,System.DateTime,string,string,string,string>>’ to ‘System.Collections.Generic.IEnumerable<string[]>’. An explicit conversion exists (are you missing a cast?)
So I then did this:
1: var rows = ufoLibrary.GetData();
so I can compile again. When I ran it, I got his exception:
So it looks like R can handle YYYYMMDD while F# DateTime.Parse() can not. So I went back to The different ways to parse in .NET I changed the parsing to this:
- System.DateTime.ParseExact(values.[0],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
- System.DateTime.ParseExact(values.[1],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
When I ran it, I got this:
Which I am not sure is progress. so then it hit me that the data in the strings might be out of bounds – for example a month of “13”. So I added the following filters to the dataset:
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) > 1900)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) > 1900)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) < 2100)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) < 2100)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) <= 12)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) <= 12)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) <= 31)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) <= 31)
Sure enough, now when I run it:
Which matches what the book’s R example.
I then wanted to match what the book does in terms of cleaning the city,state field (column). We are only interested in data from the united states that follows the “City,State” pattern. The R examples does some conditional logic to clean this data, up, which I didn’t want to do in F#.
So I added this filter than split the City,State column and checked that the state value is only 2 characters in length R uses the “Clean” keyword to remove white space, F# uses “Trim()”
- |> Seq.filter(fun values -> values.[2].Split(',').[1].Trim().Length = 2)
Next, the book limits the location values to only the Unites States. To do that, it creates a list of values of all 50 postal codes (lower case) to then compare the state portion of the location field. To that end, I added a string array like so:
- let usStates = [|"AL";"AK";"AZ";"AR";"CA";"CO";"CT";"DE";"DC";"FL";"GA";"HI";"ID";"IL";"IN";"IA";
- "KS";"KY";"LA";"ME";"MD";"MA";"MI";"MN";"MS";"MO";"MT";"NE";"NV";"NH";"NJ";"NM";
- "NY";"NC";"ND";"OH";"OK";"OR";"PA";"RI";"SC";"SD";"TN";"TX";"UT";"VT";"VA";"WA";
- "WV";"WI";"WY"|]
I then add this filter (took me about 45 minutes to figure out):
- |> Seq.filter(fun values -> Seq.exists(fun elem -> elem = values.[2].Split(',').[1].Trim().ToUpperInvariant()) usStates)
So now I am 1/2 way done with Chapter 1 – the data has now been cleaned and is ready to be analyzed. Here is the code that I have so far:
- member this.GetDetailData() =
- let path = "C:\Users\Jamie\Documents\Visual Studio 2012\Projects\MachineLearningWithFSharp_Solution\Tff.MachineLearningWithFSharp.Chapter01\ufo_awesome.txt"
- let fileStream = new FileStream(path,FileMode.Open,FileAccess.Read)
- let streamReader = new StreamReader(fileStream)
- let contents = streamReader.ReadToEnd()
- let usStates = [|"AL";"AK";"AZ";"AR";"CA";"CO";"CT";"DE";"DC";"FL";"GA";"HI";"ID";"IL";"IN";"IA";
- "KS";"KY";"LA";"ME";"MD";"MA";"MI";"MN";"MS";"MO";"MT";"NE";"NV";"NH";"NJ";"NM";
- "NY";"NC";"ND";"OH";"OK";"OR";"PA";"RI";"SC";"SD";"TN";"TX";"UT";"VT";"VA";"WA";
- "WV";"WI";"WY"|]
- let cleanContents =
- contents.Split([|'\n'|])
- |> Seq.map(fun line -> line.Split([|'\t'|]))
- |> Seq.filter(fun values -> values |> Seq.length = 6)
- |> Seq.filter(fun values -> values.[0].Length = 8)
- |> Seq.filter(fun values -> values.[1].Length = 8)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) > 1900)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) > 1900)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) < 2100)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) < 2100)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) <= 12)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) <= 12)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) <= 31)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) <= 31)
- |> Seq.filter(fun values -> values.[2].Split(',').[1].Trim().Length = 2)
- |> Seq.filter(fun values -> Seq.exists(fun elem -> elem = values.[2].Split(',').[1].Trim().ToUpperInvariant()) usStates)
- |> Seq.map(fun values ->
- System.DateTime.ParseExact(values.[0],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
- System.DateTime.ParseExact(values.[1],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
- values.[2].Split(',').[0].Trim(),
- values.[2].Split(',').[1].Trim().ToUpperInvariant(),
- values.[3],
- values.[4],
- values.[5])
- cleanContents
I now want to finish up the chapter where the analysis happens. R uses some built-in plotting libraries (ggplot). Following Luca’s example of this
I went to the flying frogs libraries and, alas, there is no longer a free edition.
So I am bit stuck. I’ll continue to work on it for next week’s blog…
Pingback: este enlace