Multiple Linear Regression Using R and F#
September 3, 2013 Leave a comment
Following up on my previous post, I decided to test calling R from F# for a multiple linear regression. I decided to use the dataset from chapter 1 of Machine Learning For Hackers (ufo sightings).
Step #1 was to open R from F#
- #r @"C:\TFS\Tff.RDotNetExample_Solution\packages\R.NET.1.5.3\lib\net40\RDotNet.dll"
- #r @"C:\TFS\Tff.RDotNetExample_Solution\packages\R.NET.1.5.3\lib\net40\RDotNet.NativeLibrary.dll"
- open System.IO
- open RDotNet
- //open R
- let environmentPath = System.Environment.GetEnvironmentVariable("PATH")
- let binaryPath = @"C:\Program Files\R\R-3.0.1\bin\x64"
- System.Environment.SetEnvironmentVariable("PATH",environmentPath+System.IO.Path.PathSeparator.ToString()+binaryPath)
- let engine = RDotNet.REngine.CreateInstance("RDotNet")
- engine.Initialize()
Step #2 was to import the ufo dataset and clean it:
- //open dataset
- let path = @"C:\TFS\Tff.RDotNetExample_Solution\Tff.RDotNetExample\ufo_awesome.txt"
- let fileStream = new FileStream(path,FileMode.Open,FileAccess.Read)
- let streamReader = new StreamReader(fileStream)
- let contents = streamReader.ReadToEnd()
- let usStates = [|"AL";"AK";"AZ";"AR";"CA";"CO";"CT";"DE";"DC";"FL";"GA";"HI";"ID";"IL";"IN";"IA";
- "KS";"KY";"LA";"ME";"MD";"MA";"MI";"MN";"MS";"MO";"MT";"NE";"NV";"NH";"NJ";"NM";
- "NY";"NC";"ND";"OH";"OK";"OR";"PA";"RI";"SC";"SD";"TN";"TX";"UT";"VT";"VA";"WA";
- "WV";"WI";"WY"|]
- let cleanContents =
- contents.Split([|'\n'|])
- |> Seq.map(fun line -> line.Split([|'\t'|]))
- |> Seq.filter(fun values -> values |> Seq.length = 6)
- |> Seq.filter(fun values -> values.[0].Length = 8)
- |> Seq.filter(fun values -> values.[1].Length = 8)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) > 1900)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) > 1900)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) < 2100)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) < 2100)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) <= 12)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) <= 12)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) > 0)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) <= 31)
- |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) <= 31)
- |> Seq.filter(fun values -> values.[2].Split(',').[1].Trim().Length = 2)
- |> Seq.filter(fun values -> Seq.exists(fun elem -> elem = values.[2].Split(',').[1].Trim().ToUpperInvariant()) usStates)
- |> Seq.map(fun values ->
- System.DateTime.ParseExact(values.[0],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
- System.DateTime.ParseExact(values.[1],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
- values.[2].Split(',').[0].Trim(),
- values.[2].Split(',').[1].Trim().ToUpperInvariant(),
- values.[3],
- values.[4],
- values.[5])
- cleanContents
- let relevantContents =
- cleanContents
- |> Seq.map(fun (a,b,c,d,e,f,g) -> a.Year,d,g.Length)
Step #3 was to run the regression using the dataset. You will notice that I made the length of the report the Y (dependent) variable – not that I think I will find any causality but it was a good enough to use). Also, notice the Seq.Map of each column in the larger Seq(Int*String*Int) into the Vector.
- let reportLength = engine.CreateIntegerVector(relevantContents |> Seq.map (fun (a,b,c) -> c))
- engine.SetSymbol("reportLength", reportLength)
- let year = engine.CreateIntegerVector(relevantContents |> Seq.map (fun (a,b,c) -> a))
- engine.SetSymbol("year", year)
- let state = engine.CreateCharacterVector(relevantContents |> Seq.map (fun (a,b,c) -> b))
- engine.SetSymbol("state", state)
- let calcExpression = "lm(formula = reportLength ~ year + state)"
- let testResult = engine.Evaluate(calcExpression).AsList()
Sure enough, you can get the results of the regression. The challenge is teasing out the values that are interesting from the real data structure that is returned (testResult in this example)
> testResult.Item(0).AsCharacter();;
val it : CharacterVector =
seq
["31775.5599180962"; "-15.2760355122386"; "37.8028841898059";
"-91.2309146099364"; …]
Intercepts, I think, are Item(0).