F# | Jamie Dixon's Home

Traffic Stop Disposition: Classification Using F# and KNN

January 14, 2014 1 Comment

I have already looked at the summary statistics of the traffic stop data I received from the town here. My next stop was to try and do a machine learning exercise with the data. One of the more interesting questions I want to answer is what factors into weather a person gets a warning or a ticket (called disposition)? Of all of the factors that may be involved, the dataset that I have is fairly limited:

Using dispositionId as the result variable, there is StopDateTime and Location (Latitude/Longitude). Fortunately, DateTime can be decomposed into several input variables. For this exercise, I wanted to use the following:

TimeOfDay
DayOfWeek
DayOfMonth
MonthOfYear
Location (Latitude:Longitude)

And the resulting variable being disposition. To make it easier for analysis, I limited the analysis set to finalDisposition as either “verbal warning” or “citation” I decided to do a K-Nearest Neighbor because it is regarded as an easy machine learning algorithm to learn and the question does seem to be a classification problem.

My first step was to decide weather to write or borrow the KNN algorithm. After looking at what kind of code would be needed to write my own and then looking at some other libraries, I decided to use Accord.Net.

My next first step was to get the data via the web service I spun up here.

namespace ChickenSoftware.RoadAlert.Analysis
 
open FSharp.Data
open Microsoft.FSharp.Data.TypeProviders
open Accord.MachineLearning
 
type roadAlert2 = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
type MachineLearningEngine =
    static member RoadAlertDoc = roadAlert2.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)

My next first step was to filter the data to only verbal warnings (7) or citations (15).

  static member BaseDataSet =
      MachineLearningEngine.RoadAlertDoc
            |> Seq.filter(funx -> x.DispositionId = 7 || x.DispositionId = 15)
          |> Seq.map(fun x -> x.Id, x.StopDateTime, x.Latitude, x.Longitude, x.DispositionId)
          |> Seq.map(fun (a,b,c,d,e) -> a, b, System.Math.Round(c,3), System.Math.Round(d,3), e)
          |> Seq.map(fun (a,b,c,d,e) -> a, b, c.ToString() + ":" + d.ToString(), e)
          |> Seq.map(fun (a,b,c,d) -> a,b,c, match d with
                                              |7 -> 0
                                              |15 -> 1
                                              |_ -> 1)
          |> Seq.map(fun (a,b,c,d) -> a, b.Hour, b.DayOfWeek.GetHashCode(), b.Day, b.Month, c, d)
          |> Seq.toList

You will notice that I had to transform the dispositionIds from 7 and 15 to 1 and 0. The reason why is that the KNN method in Accord.Net assumes that the values match the index position in the array. I had to dig into the source code of Accord.Net to figure that one out.

My next step was to divide the dataset in half: one half being the training sample and the other the validation sample:

static member TrainingSample =
    let midNumber = MachineLearningEngine.NumberOfRecords/ 2
    MachineLearningEngine.BaseDataSet
        |> Seq.filter(fun (a,b,c,d,e,f,g) -> a < midNumber)
        |> Seq.toList
 
static member ValidationSample =
    let midNumber = MachineLearningEngine.NumberOfRecords/ 2
    MachineLearningEngine.BaseDataSet
        |> Seq.filter(fun (a,b,c,d,e,f,g) -> a > midNumber)
        |> Seq.toList

The next step was to actually run the KKN. Before I could do that though, I had to create the distance function. Since this was my 1st time, I dropped the geocoordinates and focused only on the time of day derivatives.

static member RunKNN inputs outputs input =
    let distanceFunction (a:int,b:int,c:int,d:int) (e:int,f:int,g:int,h:int) =  
      let b1 = b * 4
      let f1 = f * 4
      let d1 = d * 2
      let h1 = h * 2
      float((pown(a-e) 2) + (pown(b1-f1) 2) + (pown(c-g) 2) + (pown(d1-h1) 2))
 
    let distanceDelegate = 
          System.Func<(int * int * int * int),(int * int * int * int),float>(distanceFunction)
    
    let knn = new KNearestNeighbors<int*int*int*int>(10,2,inputs,outputs,distanceDelegate)
    knn.Compute(input)

You will notice I tried to normalize the values so that they all had the same basis. They are not exact, but they are close. You will also notice that I had to create a delegate from for the distanceFunction (thanks to Mimo on SO). This is because Accord.NET was written in C# with C# consumers in mind and F# has a couple of places where the interfaces are not as seemless as one would hope.

In any event, once the KKN function was written, I wrote a function that to the validation sample, made a guess via KKN, and then reported the result:

static member GetValidationsViaKKN  =
    let inputs = MachineLearningEngine.TrainingInputClass
    let outputs = MachineLearningEngine.TrainingOutputClass
    let validations = MachineLearningEngine.ValidationClass
 
    validations
        |> Seq.map(fun (a,b,c,d,e) -> e, MachineLearningEngine.RunKNN inputs outputs (a,b,c,d))
        |> Seq.toList
 
static member GetSuccessPercentageOfValidations =
    let validations = MachineLearningEngine.GetValidationsViaKKN
    let matches = validations
                    |> Seq.map(fun (a,b) -> match (a=b) with
                                                | true -> 1
                                                | false -> 0)
 
    let recordCount =  validations |> Seq.length
    let numberCorrect = matches |> Seq.sum
    let successPercentage = double(numberCorrect) / double(recordCount)
    recordCount, numberCorrect, successPercentage

I then hopped over to my UI console app and looked that the success percentage.

private static void GetSuccessPercentageOfValidations()
{
    var output = MachineLearningEngine.GetSuccessPercentageOfValidations;
    Console.WriteLine(output.Item1.ToString() + ":" + output.Item2.ToString() + ":" + output.Item3.ToString());
}

So there are 12,837 records in the validation sample and the classifier guessed the correct disposition 9,001 times – a success percentage of 70%

So it looks like there is something there. However, it is not clear that this is a good classifier without further tests – specifically seeing if the how to most common case results when pushing though the classifier. Also, I would assume to make this a true ‘machine learning’ algorithm I would have to feed the results back to the distance function to see if I can alter it to get the success percentage higher.

One quick note about methodology – I used unit tests pretty extensively to understand how the KKN works. I created a series of tests with some sample data to see who the function reacted.

[TestMethod]
public void TestKKN_ReturnsExpected()
{
 
    Tuple<int, int, int, int>[] inputs = { 
        new Tuple<int, int, int, int>(1, 0, 15, 1), 
        new Tuple<int,int,int,int>(1,0,11,1)};
    int[] outputs = { 1, 1 };
 
    var input = new Tuple<int, int, int, int>(1, 1, 1, 1);
 
    var output = MachineLearningEngine.RunKNN(inputs, outputs, input);
 
}

This was a big help to get me up and running (walking, really..)…

Filed under Analytics, F#

Traffic Stop Analysis Using F#

January 7, 2014 1 Comment

Now that I have the traffic stop services up and running, it is time to actually do something with the data. The data set is all traffic stops in my town for 2012 with some limited information: date/time of the stop, the geolocation of the stop, and the final disposition of the stop. The data looks like this:

My 1st step was to look at the Date/Time and see if there are any patterns in DayOfMonth, MonthOfYear, And TimeOfDay. To that end, I spun up a F# project and added my 1st method that determines the total number of records in the dataset:

type roadAlert = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
type AnalysisEngine =
    static member RoadAlertDoc = roadAlert.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)
 
    static member NumberOfRecords =
        AnalysisEngine.RoadAlertDoc 
            |> Seq.length

Since I am a TDDer more than a REPLer, I went and wrote a covering unit test.

[TestMethod]
public void NumberOfRecords_ReturnsExpected()
{
    Int32 notEpected = 0;
    Int32 actual = AnalysisEngine.NumberOfRecords;
    Assert.AreNotEqual(notEpected, actual);
}

A couple of things to note about this:

1) This is really an integration test, not a unit test. I could have written the test like this:

[TestMethod]
public void NumberOfRecordsFor2012DataSet_ReturnsExpected()
{
    Int32 expected = 27778;
    Int32 actual = AnalysisEngine.NumberOfRecords;
    Assert.AreEqual(expected, actual);
}

But that means I am tying the test to the specific data sample (in its current state) – and I don’t want to do that.

2) I am finding that my F# code has many more functions than the code written by other people – esp data scientists. I think it has to do with contrasting methodologies. Instead of spending time in the REPL with a small piece of code to get it right and then adding the code into the larger code base, I am writing very small piece of code in the class and then using unit tests to get it right. The upshot of that is that there are lots of small, independently testable pieces of code – I think this stems from my background of writing production apps that are for business problems and not for academic papers. Also, I use classes in source files versus script files because I plan to plug the code into larger .NET applications that will be written in C# and/or VB.NET.

In any event, once I has the total number of records, I went to see how they broke down into month:

static member ActualTrafficStopsByMonth =
    AnalysisEngine.RoadAlertDoc
        |> Seq.map(fun x -> x.StopDateTime.Month)
        |> Seq.countBy(fun x-> x)
        |> Seq.toList

[TestMethod]
public void ActualTrafficStopsByMonth_ReturnsExpected()
{
    Int32 notExpected = 0;
    var stops = AnalysisEngine.ActualTrafficStopsByMonth;
    Assert.AreNotEqual(notExpected, stops.Length);
 
}

I then created a function that shows the expected number of stops by month. Pattern matching with F# makes creating the month list a snap. Note that is is a true unit test because I am not dependent on external data:

static member Months =
    let monthList = [1..12]
    Seq.map (fun x -> 
            match x with
                | 1 | 3 | 5 | 7 | 8 | 10 | 12 -> x,31,31./365.
                | 2 -> x,28,28./365.
                | 4 | 6 | 9 | 11 -> x,30, 30./365.
                | _ -> x,0,0.                    
        ) monthList
    |> Seq.toList   

static member ExpectedTrafficStopsByMonth numberOfStops =
    AnalysisEngine.Months
        |> Seq.map(fun (x,y,z) -> 
            x, int(z*numberOfStops))
        |> Seq.toList

[TestMethod]
public void ExpectedTrafficStopsByMonth_ReturnsExpected()
{
    var stops = AnalysisEngine.ExpectedTrafficStopsByMonth(27778);
    double expected = 2359;
    double actual =stops[0].Item2;
 
    Assert.AreEqual(expected, actual);
}

With the actual and expected ready to go, I then put the two side by side:

static member TrafficStopsByMonth =
    let numberOfStops = float(AnalysisEngine.NumberOfRecords)
    let monthlyExpected = AnalysisEngine.ExpectedTrafficStopsByMonth numberOfStops
    let monthlyActual = AnalysisEngine.ActualTrafficStopsByMonth
    Seq.zip monthlyExpected monthlyActual
        |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
        |> Seq.toList

[TestMethod]
public void TrafficStopsByMonth_ReturnsExpected()
{
    var output = AnalysisEngine.TrafficStopsByMonth;
    Assert.IsNotNull(output);
 
}

All of my unit tests ran green

so now I am ready to roll. I created a quick console UI

static void Main(string[] args)
{
    Console.WriteLine("Start");
 
    foreach (var tuple in AnalysisEngine.TrafficStopsByMonth)
    {
        Console.WriteLine(tuple.Item1 + ":" + tuple.Item2 + ":" + tuple.Item3 + ":" + tuple.Item4 + ":" + tuple.Item5);
    }
 
    Console.WriteLine("End");
    Console.ReadKey();
}

With the output. Obviously, a UX person could put some real pizzaz front of this data, but that is something to do another day. If you didn’t see it in the code above, the tuple is constructed as: Month,ExpectedStops,ActualStops,Difference,%Difference. So the real interesting thing is that September was 47% higher than expected with December 26% less. That kind of wide variation begs for more analysis.

I then did a similar analysis by DayOfMonth:

static member ActualTrafficStopsByDay = 
    AnalysisEngine.RoadAlertDoc
        |> Seq.map(fun x -> x.StopDateTime.Day)
        |> Seq.countBy(fun x-> x)
        |> Seq.toList
 
static member Days =
    let dayList = [1..31]
    Seq.map (fun x -> 
            match x with
                | x when x < 29 -> x, 12, 12./365.
                | 29 | 30 -> x, 11, 11./365.
                | 31 -> x, 7, 7./365.
                | _ -> x, 0, 0.                 
        ) dayList
    |> Seq.toList     
 
static member ExpectedTrafficStopsByDay numberOfStops =
    AnalysisEngine.Days
        |> Seq.map(fun (x,y,z) -> 
            x, int(z*numberOfStops))
        |> Seq.toList    
 
static member TrafficStopsByDay =
    let numberOfStops = float(AnalysisEngine.NumberOfRecords)
    let dailyExpected = AnalysisEngine.ExpectedTrafficStopsByDay numberOfStops
    let dailyActual = AnalysisEngine.ActualTrafficStopsByDay
    Seq.zip dailyExpected dailyActual
        |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
        |> Seq.toList

The interesting thing is that there are higher than expected traffic stops in the last half of the month (esp the 25th and 26th) and much lower in the 1st part of the month.

And by TimeOfDay

static member ActualTrafficStopsByHour = 
    AnalysisEngine.RoadAlertDoc
        |> Seq.map(fun x -> x.StopDateTime.Hour)
        |> Seq.countBy(fun x-> x)
        |> Seq.toList
 
static member Hours =
    let hourList = [1..24]
    Seq.map (fun x -> 
                x,1, 1./24.
        ) hourList
    |> Seq.toList     
 
static member ExpectedTrafficStopsByHour numberOfStops =
    AnalysisEngine.Hours
        |> Seq.map(fun (x,y,z) -> 
            x, int(z*numberOfStops))
        |> Seq.toList    
 
static member TrafficStopsByHour =
    let numberOfStops = float(AnalysisEngine.NumberOfRecords)
    let hourlyExpected = AnalysisEngine.ExpectedTrafficStopsByHour numberOfStops
    let hourlyActual = AnalysisEngine.ActualTrafficStopsByHour
    Seq.zip hourlyExpected hourlyActual
        |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
        |> Seq.toList

The interesting thing here is that there are much higher than expected number of traffic stops from 1-2 AM (61% and 123%) with significantly less between 8PM and midnight. Finally, I looked at GPS location for the stops.

static member ActualTrafficStopsByGPS =  
    AnalysisEngine.RoadAlertDoc
        |> Seq.map(fun x -> System.Math.Round(x.Latitude,3).ToString() + ":" + System.Math.Round(x.Longitude,3).ToString())
        |> Seq.countBy(fun x-> x)
        |> Seq.sortBy snd
        |> Seq.toList
        |> List.rev
 
static member GetVarianceOfTrafficStopsByGPS =
    let trafficStopList = AnalysisEngine.ActualTrafficStopsByGPS
                            |> Seq.map(fun x -> double(snd x))
                            |> Seq.toList
    AnalysisEngine.Variance(trafficStopList)
 
static member GetAverageOfTrafficStopsByGPS =
    AnalysisEngine.ActualTrafficStopsByGPS
        |> Seq.map(fun x -> double(snd x))
        |> Seq.average

You can see that I rounded the Latitude and Longitude to 3 decimal places. Using Wikipedia, saying that 4 decimals at 23N is 10.24M and 45N it is 7.87M for latitude, I imputed that 35 is 8.94M. With 1 M = 3.28 feet, that means that 4 decimals is with 30 feet and 3 decimals is within 300 feet and 2 decimals is within 3,000 feet. 300 feet seems like a good compromise so I ran with that.

So running the average and variance and the top GPS locations:

With an average of 11 stops per GPS location (less than 1 a month) and a variance of 725, there does not seem be a strong relationship between GPS location and traffic stops.

The upshot of all of this analysis seems to point to avoid getting stopped it is less important where you are than when you are. This is confirmed anecdotally too – the Town actually broadcasts when they will have heightened traffic surveillance on Twitter and the like. Ignore open data at your own risk.

In any event, I my next step is to run this data though a machine-learning algorithm to see if there is anything else to uncover.

Filed under Analytics, F#

Setting up an OData Service on WebAPI2 to be used by F# Type Providers

January 7, 2014 1 Comment

I am prepping for the F#/Data Analytics workshop on January 8th and wanted to get the data that I used for the Road Alert application beck to better shape. By better, I mean out of that crusty WCF SOAP that I have had it in for the last 2 years. To that end, I jumped over to Mike Wasson’s Creating an OData tutorial. All in all, it is a good step by step guide, but I had to make some changes to get it working for me.

Change #1 is that I am not using a local database, I am using a database located on WinHost. Therefore, I had to swap out the EF connection string.

One of the things I can appreciate about the template is the comments to get the routing set up (I guess there is not attribute-based routing for O-Data?)

        /*
To add a route for this controller, merge these statements into the Register method of the WebApiConfig class. Note that OData URLs are case sensitive.
 
using System.Web.Http.OData.Builder;
using ChkickenSoftware.RoadAlertServices.Models;
ODataConventionModelBuilder builder = new ODataConventionModelBuilder();
builder.EntitySet<TrafficStop>("TrafficStop");
config.Routes.MapODataRoute("odata", "odata", builder.GetEdmModel());
*/

Things were looking good when I took a departure from the tutorial and added in a couple of unit tests (Change #2). The 1st one was fairly benign:

[TestClass]
public class TrafficStopControllerIntegrationTests
{
    [TestMethod]
    public void GetTrafficStopUsingKey_ReturnsExpected()
    {
        TrafficStopController controller = new TrafficStopController();
        var trafficStop = controller.GetTrafficStop(1);
        Assert.IsNotNull(trafficStop);
    }
}

Note that I had to add an app.config to the test project b/c this is an integration test and I am making a real database call – a unit test would using a mocking framework. In any event, when I went to run the test, I got a compile error – I needed to add a reference to System.Web.Http.OData to resolve the return value from the controller. Not big thing, though I wish I could install packages from Nuget via their .dll name and not just their package name:

In any event, I then ran the test and I got this exception:

So this is another reason why EF drives me nuts. I have to add a reference to Entity Framework (and throw some crap in the .config file)

<entityFramework>
  <defaultConnectionFactory type="System.Data.Entity.Infrastructure.SqlConnectionFactory, EntityFramework" />
  <providers>
    <provider invariantName="System.Data.SqlClient" type="System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer" />
  </providers>
</entityFramework>

– even thought the calling application has nothing to do with EF. In 2014, we have such dependency drip? Really? In any event, once I added a reference to EF and updated the .config file, my unit/integration test ran green so I was on the right track.

I then went to fiddler and tried to call the controller:

Yikes, it looks like my model has to match the EF exactly

The database:

And the model:

public class TrafficStop
{
    public Int32 Id { get; set; }
    public double CadCallId { get; set; }
    public DateTime StopDateTime { get; set; }
    public Int32 DispositionId { get; set; }
    public String DispositionDesc { get; set; }
    public double Latitude { get; set; }
    public double Longitude { get; set; }
}

– I assume that I should be able to override this behavior – another thing to research.

So after matching up field names, I ran fiddler and sure enough:

So that was pretty painless to get an OData Service up and running. I then removed everything but the read methods and I added an auth header (you can see the value in the screen shot above), feel free to hit up the service now that it is deployed to WinHost:

One of the coolest things about OData is that it has a .WSDL type discovery:

http://chickensoftware.com/RoadAlert/odata/$metadata

I was really missing that when we went from SOAP Services to REST

Note that I had to do a couple of more things in Tsql (remember that?) to the original data to get it ready for general consumption (and analytics). I had to create a real date/time from the 2 varchar fields:

Update [XXXX].[dbo].[TrafficStops]
Set StopDateTime = Convert(DateTime, right (left([Date],6),2) + ‘/’ + right([Date],2) + ‘/’ + left([Date],4) + ‘ ‘ + left(Time,2) + ‘:’ + Right(left(Time,4),2) + ‘:’ + Right(left(Time,6),2))

I also had to add an integral value for when we do statistical analysis:

Update [XXXXX].[dbo].[TrafficStops]
Set dispositionId =
CASE
     WHEN dispositionDesc = ‘FURTHER ACTION NECESSARY’ THEN 1
     WHEN dispositionDesc = ‘UNABLE TO LOCATE’ THEN 2
     WHEN dispositionDesc = ‘FALSE ALARM’ THEN 3
     WHEN dispositionDesc = ‘WRITTEN WARNING’ THEN 4
     WHEN dispositionDesc = ‘OTHER    SEE NOTES’ THEN 5
     WHEN dispositionDesc = ‘REFERRED TO PROPER AGENCY’ THEN 6
     WHEN dispositionDesc = ‘VERBAL WARNING’ THEN 7
     WHEN dispositionDesc = ‘NULL’ THEN 8
     WHEN dispositionDesc = ‘ARREST’ THEN 9
     WHEN dispositionDesc = ‘NO FURTHER ACTION NECESSARY’ THEN 10
     WHEN dispositionDesc = ‘CIVIL PROBLEM’ THEN 11
     WHEN dispositionDesc = ‘COMPLETED AS REQUESTED’ THEN 12
     WHEN dispositionDesc = ‘INCIDENT REPORT’ THEN 13
     WHEN dispositionDesc = ‘UNFOUNDED’ THEN 14
     WHEN dispositionDesc = ‘CITATION’ THEN 15
     WHEN dispositionDesc = ‘FIELD CONTACT’ THEN 16
     WHEN dispositionDesc = ‘BACK UP UNIT’ THEN 17
     WHEN dispositionDesc = ‘CITY ORDINANCE VIOLATION’ THEN 18
END

So now I am ready to roll with doing the analytics.

So when I say “ready to roll”, I really meant to say “ready to flail.” When we last left the show, I was ready to start consuming the data from OData using the F# type providers. Using Fiddler, I can see the data coming out of the OData service

The problem started when I went to consume the data using the F# OData Type Provider as documented here. I got the red squiggly line of approbation when I went to create the type:

with the following message:

Error 1 The type provider ‘Microsoft.FSharp.Data.TypeProviders.DesignTime.DataProviders’ reported an error: error 7001: The element ‘DataService’ has an attribute ‘DataServiceVersion’ with an unrecognized version ‘3.0’.

I went over to the F#-open source Google group to seek help and Isaac Abraham had this response:

WebAPI 2 now pushes out OData 3 endpoints by default, which are actually not even backwards compatible with the OData 2 standard. OData 3 was (AFAIK) released some time after the OData Type Provider was written, so I suspect it doesn’t support OData 3.

So I am stuck. I really want to use type providers but they are behind. I thought about if I could downgrade my WebAPI2 OData to go to OData2 standard (whatever that is).

My 1st thought was to trick out the client by removing the DataServiceVersion header like so:

public class HeadersHandler : DelegatingHandler
{
async protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
HttpResponseMessage response = await base.SendAsync(request, cancellationToken);
response.Content.Headers.Remove("DataServiceVersion");
return response;
}
}

The header was removed, but alas, the RSLA is still with me with the same message. I then thought, perhaps I can go back to the old version of Json so I modified the header like so:

public class HeadersHandler : DelegatingHandler
{
async protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
request.Headers.Add("Accept", "application/json;odata=verbose");
HttpResponseMessage response = await base.SendAsync(request, cancellationToken);
response.Content.Headers.Remove("DataServiceVersion");
return response;
}
}

So the Json is now the “old” version, but I am still getting the RSLA. I then ran fiddler when creating the type provider and I see this:

Crap. I need to have Entity Framework use a lower version (I am using EF 6.0). I guess? My 1st thought was to remove EF from the situation entirely, which is always a good idea. My next, and more time-efficient, thought was to ask Stack Overflow – which is what I did here. While I wait for Stack Overflow to come to the rescue. I decided to press on. I just exposed the data via a normal controller like so:

public class TrafficStopSearchController : ApiController
{
public List<TrafficStop> Get()
{
DataContext context = new DataContext();
return context.TrafficStops.ToList<TrafficStop>();
}
public TrafficStop Get(int id)
{
DataContext context = new DataContext();
return context.TrafficStops.Where(ts => ts.Id == id).FirstOrDefault();
}
[HttpGet]
[Route("api/TrafficStopSearch/Sample/")]
public List<TrafficStop> Sample()
{
DataContext context = new DataContext();
return context.TrafficStops.Where(ts => ts.Id < 100).ToList();
}
}

The reason I threw in the Sample method is that the F# JSON type provider uses a sample to infer types and I didn’t want to send the entire set of data across the wire for that. Once that was done, the traffic stop data was consumable in my F# application like so:

type roadAlert = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample">
type AnalysisEngine =
static member RoadAlertDoc = roadAlert.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch")

Once/if I get the OData set up, I will swap this out but this is good enough for now – after all the interesting piece is not getting the data – but doing something with it!

Filed under F#, OData, WebApi

Correlation Between Recruit Rankings and Final Standings in Big Ten Football

December 31, 2013 1 Comment

Following up on my last post about screen scraping college football in F#, I took the next step and analyzed the data that I scraped. I am a big believer in Domain Specific Language so ‘Rankings’ means the ranking assigned by Rivals about how well a school recruits players. ‘Standings’ means the final position in the Big Ten after the games have been played. Ranking is for recruiting and standings is for actually playing the games.

Going back to the code, the 1st thing I did was to separate the Standings call from the search for a given school – so that the XmlDocument is loaded once and then searched several times versus loading it for each search. This improved performance dramatically:

static member getAnnualConferenceStandings(year:int)=
    let url = "http://espn.go.com/college-football/conferences/standings/_/id/5/year/&quot;+year.ToString()+"/big-ten-conference";         
    let request = WebRequest.Create(Uri(url)) 
    use response = request.GetResponse() 
    use stream = response.GetResponseStream() 
    use reader = new IO.StreamReader(stream) 
    let htmlString = reader.ReadToEnd()
    let divMarkerStartPosition = htmlString.IndexOf("my-teams-table");
    let tableStartPosition = htmlString.IndexOf("<table",divMarkerStartPosition);
    let tableEndPosition = htmlString.IndexOf("</table",tableStartPosition);
    let data = htmlString.Substring(tableStartPosition, tableEndPosition- tableStartPosition+8)
    let xmlDocument = new XmlDocument();
    xmlDocument.LoadXml(data);
    xmlDocument        
 
static member getSchoolStanding(xmlDocument: XmlDocument,school) =
    let keyNode = xmlDocument.GetElementsByTagName("td")
                        |> Seq.cast<XmlNode>
                        |> Seq.find (fun node -> node.InnerText = school)
    let valueNode = keyNode.NextSibling
    let returnValue = (keyNode.InnerText, valueNode.InnerText)
    returnValue
 
static member getConferenceStandings(year:int) =
    let xmlDocument = RankingProvider.getAnnualConferenceStandings(year)
    Seq.map(fun school -> RankingProvider.getSchoolStanding(xmlDocument,school)) RankingProvider.schools
        |> Seq.sortBy snd
        |> Seq.toList
        |> List.rev
        |> Seq.mapi(fun index (school,ranking) -> school, index+1)
        |> Seq.sortBy fst
        |> Seq.toList

Thanks for Valera Kolupaev for showing me how to use mapi to create a tuple from the list of schools and what rank they were in the list in getConferenceStandings().

I then went to the rankings call and added a way to parse down only the schools I am interested in. That way I can compare individual schools, groups of schools, or the entire conference:

static member getConferenceRankings(year) =
    RankingProvider.schools
            |> Seq.map(fun schoolName -> RankingProvider.getSchoolInSequence(year, schoolName))
            |> Seq.toList
    
 
static member getSchoolInSequence(year, schoolName) =
    RankingProvider.getRecrutRankings(year)
                    |> Seq.find(fun (school,rank) -> school = schoolName)

After these two refactorings, my unit tests still ran green so I was ready to do the analysis.

I went out to my project of a couple of weeks ago for correlation and copied in the module. The Correlation function takes in two lists of doubles. The first list would be a school’s ranking and the second would be the standings:

static member getCorrelationBetweenRankingsAndStandings(year, rankings, standings ) =
    let ranks = Seq.map(fun (school,rank) -> rank) rankings
    let stands = Seq.map(fun (school,standing) -> standing) standings
    Calculations.Correlation(ranks,stands)
 
static member getCorrelation(year:int) =
    let rankings = RankingProvider.getConferenceRankings year
                    |> Seq.map(fun (school,rank) -> school,Convert.ToDouble(rank))
    let standings = RankingProvider.getConferenceStandings(year+RankingProvider.yearDifferenceBetwenRankingsAndStandings)
                    |> Seq.map(fun (school, standing) -> school, Convert.ToDouble(standing))
    let correlation = RankingProvider.getCorrelationBetweenRankingsAndStandings(year,rankings, standings)
    (year, correlation)

A couple of things to note:

1) This function assumes that both the rankings and the standings are the same length and are in order by school name. A production application would check this as part of standard argument validation.

2) I used Convert.ToDouble() to change the Int32 of the ranking to Double of the correlation function. Having these .NET assemblies available at key points in the application really moved things along.

In any event, all that was left was to list the Big Ten schools to analyze, the number of years to analyze, and the year difference between the recruit rankings and the standings from the games they played in.

As a first step, I did all original big ten schools with 7 years of recruiting and a 1,2,3,4 years difference (2002 ranking compared to 2003, 2004,2005,2006 standings ,etc…):

The average is .3303/.2650/.5138/.6065

And so yeah – there is a really strong correlation between a recruit ranking and the outcome on the field. Also, the most impact the class has seems to be senior year – which makes sense. I don’t have a hypothesis on why it drops sophomore year – perhaps the ‘impact freshmen’ leave after 1 year?

Also of interest, the correlation does not seem to follow a normal distribution. If you only look at the schools that have an emphasis on academics, the correlation drops significantly – to a negative correlation!

The average is .1485/-.1446/-.2817/-.0381

So another great reason to create the new big ten – sometimes there is a really good recruit class does not do well on the field and other times a poorly-ranked recruiting class does well on the field. This kind of unpredictability is both exciting and probably much more likely to bring in the casual fans.

Based on this analysis, here is what is going to happen in the Big Ten next year:

Michigan State and Ohio State will be the leaders
Michigan and Penn State are in the best position beat Michigan State and Ohio State

But you didn’t need a statistical analysis to tell you that. The one key surprise that this analysis tells you is that

Nebraska will have a significant improvement in the standings in 2014
Indiana will have a significant improvement in the standings in 2015 and 2016

As a final note, I got this after doing a bunch of requests to Yahoo:

So I wonder if I hit the page too many times and my IP was flagged a as a bot? I waited a day for the server to reset to finish my analysis. Perhaps this is a case where I should get the data when the getting is good and take their pages and bring them locally?

Filed under Analytics, F#

Screen Scraping College Football Statistics

December 24, 2013 3 Comments

As a follow-up to my post of the correlation of Academic Ranking and Football Rankings in the Big Ten, I thought I would look that the relationship between two different kinds of Football Rankings: the recruiting ranking assigned by Rivals and the actual results on the field. To that end, I went to collect the data programmically because I am doing a time-series analysis and I didn’t want to do data-entry.

My first stop was to find a free service that exposes this data on the web. No luck – either the data was a service that cost money or the data was presented as a web page. Since I have never screen-scraped using F# (and I am cheap), I chose option #2.

My first data point was the recruiting ranking found here. When I inspected the source of the page, I caught a break – the data is actually stored as Json on the page.

So firing up Visual Studio, I created a solution with 1 F# project and 2 C# projects:

I then wrote a unit test to check that something is being returned:

[TestMethod]
public void getRecrutRankings_RetunsExpected()
{
    var rankings = RankingProvider.getRecrutRankings("2012");
    Assert.AreNotEqual(0, rankings.Length);
}

I then went over the F#. I created the RankingProvider type and then add a function that pulls in the rankings for a given year:

static member getRecrutRankings(year) =
    let url = "http://sports.yahoo.com/footballrecruiting/football/recruiting/teamrank/&quot;+year+"/BIG10/all";
    let request = WebRequest.Create(Uri(url)) 
    use response = request.GetResponse() 
    use stream = response.GetResponseStream() 
    use reader = new IO.StreamReader(stream) 
    let htmlString = reader.ReadToEnd()
    let startPosition = htmlString.IndexOf("var rankingsTableData =")
    let headerLength = 23
    let endPosition = htmlString.IndexOf(";",startPosition)
    let data = htmlString.Substring(startPosition+headerLength,endPosition-startPosition-headerLength).Trim()
    let results = JsonConvert.DeserializeObject(data)
    let castedResults = results :?> Newtonsoft.Json.Linq.JArray
                                            |> Seq.map(fun x -> (x.Value("name").ToString(), Int32.Parse(x.Value("rank").ToString())))
                                            |> Seq.toList

A couple of things to note.

Lines 2 through 12 are language-agnostic. You would write the exact same code in C#/VB.NET with a slightly different syntax.
Line 13 is where things get interesting. I used the :?> operator to cast the Json to a typed structure. :?> wins as the weirdest symbol I have ever used in computer programming. I guess I haven’t been programming long enough?
Lines 14 and 15 is where you can see why F# is better than C#. I created a function that takes the Json and pushes it into a tuple. With no iteration, the code is both easier to read and less likely to have bugs

Hoping to press my luck, I went over the the other page (the one that holds the standings from the actual games) to see if they used Json. No dice – so back to mid-2000s screen scraping. I created a function that loads the table into an XML document and then searches for a given school.

static member getConferenceStanding(year, school) =
    let url = "http://espn.go.com/college-football/conferences/standings/_/id/5/year/&quot;+year+"/big-ten-conference";         
    let request = WebRequest.Create(Uri(url)) 
    use response = request.GetResponse() 
    use stream = response.GetResponseStream() 
    use reader = new IO.StreamReader(stream) 
    let htmlString = reader.ReadToEnd()
    let divMarkerStartPosition = htmlString.IndexOf("my-teams-table");
    let tableStartPosition = htmlString.IndexOf("<table",divMarkerStartPosition);
    let tableEndPosition = htmlString.IndexOf("</table",tableStartPosition);
    let data = htmlString.Substring(tableStartPosition, tableEndPosition- tableStartPosition+8)
    let xmlDocument = new XmlDocument();
    xmlDocument.LoadXml(data);
    let keyNode = xmlDocument.GetElementsByTagName("td")
                    |> Seq.cast<XmlNode>
                    |> Seq.find (fun node -> node.InnerText = school)
    let valueNode = keyNode.NextSibling
    (keyNode.InnerText, valueNode.InnerText)

A couple of things to note:

Lines 2-7 are identical to the prior function so they should be combined into a single function that can be independently testable.
Lines 8-13 are language-agnostic. You would write the exact same code in C#/VB.NET with a slightly different syntax.
Lines 14-18 is where F# really shines. Like the prior function, by using functional programming techniques in F#, I saved myself time, avoid bugs, and made the code much more intuitive.
I am making a web call for each function call– this should be optimized so the call is made once and the xmlDocument is passed in. This would also make the function much more testable (even without a mocking framework)

Next up, I needed to call this function for each of the Big Ten Schools:

static member getConferenceStandings(year)=
    let schools =[|"Nebraska";"Michigan";"Northwestern";"Michigan State";"Iowa";
        "Minnesota";"Ohio State";"Penn State";"Wisconsin"; "Purdue"; "Indiana"; "Illinois"|]
    Seq.map(fun school -> RankingProvider.getConferenceStanding(year,school)) schools
        |> Seq.sortBy snd
        |> Seq.toList
        |> List.rev

This is purely F# and is a pure joy to write (and look the least amount of time). Note that the sort is on the second element of the tuple and that the list is reversed because the second element is the wins-losses so F# is sorting ascending on the number of wins. Since Seq does not have a rev function, I turned it into a List, which does have the rev function

Some might ask “Why didn’t you use type-providers?” My answer is “I tried, but I couldn’t get them to work.” For example, here is the code that I used for the type provider when parsing the xmlDocument:

xmlDocument.LoadXml(data);
let document = XmlProvider<xmlDocument>

The problem is that the type provider expects a uri (and I can’t find an overload to pass in the document). It looks like type providers are more designed for providers that are ready to, well, provide (Web Services, Databases, etc..) versus jerry-rigged data (like screen scraping).

In any event, with these two functions, ready, I went to the UI project and decided to see how the teams did in 2012 on the field compared to how the teams did in recruiting 2 years before:

static void Main(string[] args)
{
    Console.WriteLine("Start");
 
    Console.WriteLine("——-Rankings");
    var rankings = RankingProvider.getRecrutRankings("2010");
    foreach (var school in rankings)
    {
        Console.WriteLine(school.Item1 + ":" + school.Item2);
    }
 
    Console.WriteLine("——-Standings");
    var standings = RankingProvider.getConferenceStandings("2012");
    foreach (var school in standings)
    {
        Console.WriteLine(school.Item1 + ":" + school.Item2);
    }
 
    Console.WriteLine("End");
    Console.ReadKey();
}

And the results:

I have no idea if a 2-year lag between recruiting and rankings is the right number – perhaps an analysis of the correct lag will be done. After all, between red-shirt freshmen, transfer rules, and attrition, there are plenty of variables the determine when a recruiting class has the biggest impact. Also, the standings are a blend of recruiting classes and since I am not evaluating individual players, I can’t go to that level of detail. 2 years out seems reasonable, but as Bluto famiously once said

static member getBlutoQuote() =
    "Seven years of college down the drain.";

the average might be different. In any event, I now have the data I want so the next step is to analyze it to see if there is any correlation. At first glance, there might be something – the top 4 schools for recruiting all finished in the top 4 in the standings – but the bottom 4 is more muddled with only Illinois doing poorly in both recruiting and the standings.

More to come…

Filed under F#, Statistics

F# and Monopoly Simulation Redux

December 17, 2013 2 Comments

Now that I am 4 months into my F# adventure, I thought I would revisit the monopoly simulation that I wrote in August. There are some pretty big differences

1) I am not using the ‘if…then’ construct at all –> rather I am using pattern matching. For example, consider the original communityChest function:

let communityChest x y =
    if y = 1 then
        0
    else if y = 2 then
        10
     else
        x

and the most recent one:

let communityChest (tile, randomNumber) =
    match randomNumber with
        | 1 -> 0
        | 2 -> 10
        | _ -> tile

“Big deal”, you are saying to yourself (or at least I did). But the power of pattern matching is put on display with the revised chance. The code is much more readable and understandable.

Original:

let chance x y =
    if y = 1 then
        0
    else if y = 2 then
        10
    else if y = 3 then
        11
    else if y = 4 then
        39
    else if y = 5 then
        x – 3
    else if y = 6 then
        5
    else if y = 7 then
        24
    else if y = 8 then
        if x < 5 then
            5
        else if x < 15 then
            15
        else if x < 25 then
            25
        else if x < 35 then
            35
        else
            5
    else if y = 9 then
        if x < 12 then
            12
        else if x < 28 then
            28
        else
            12
    else
        x

Revised:

let goToNearestRailroad tile =
    match tile with
        | 36|2 -> 5
        | 7 -> 15
        | 17|22 -> 25
        | 33 -> 35
        | _ -> failwith "not on chance"
 
let goToNearestUtility tile =
    match tile with
        | 36|2|7 -> 12
        | 12|22|33-> 28
        | _ -> failwith "not on chance"
 
let chance (tile, randomNumber) =
    match randomNumber with
        | 1 -> 0
        | 2 -> 10
        | 3 -> 11
        | 4 -> 39
        | 5 -> tile – 3
        | 6 -> 5
        | 7 -> 24
        | 8 -> goToNearestRailroad tile
        | 9 -> goToNearestUtility tile
        | _ -> tile

As a side note, I ditched the x and y values because they are unreadable. When I went back to the code after 3 months, I spent way too long trying to figure out what the heck ‘x’ was. I know that scientific code uses cryptic values, but clean code does not. I changed them and the code became much better.

I then took a look at the move() function. The original:

let move x y z =
    if x + y > 39 then
        x + y – 40
    else if x + y = 30 then
        10
    else if x + y = 2 then
        communityChest 2 z
    else if x + y = 7 then
        chance 7 z
    else if x + y = 17 then
        communityChest 17 z
    else if x + y = 22 then
        chance 22 z
    else if x + y = 33 then
        communityChest 33 z
    else if x + y = 36 then
        chance 36 z
    else
        x + y  

and the revised:

let getBoardMove (currentTile, dieTotal) =
    let initialTile = currentTile + dieTotal
      matchinitialTile with
          | 2 ->communityChest (2, random.Next())
          | 7 ->chance (7, random.Next())
          | 17 ->communityChest (17, random.Next())
          | 22 ->chance (22, random.Next())
        | 30 -> 10
          | 33 ->communityChest (2, random.Next())
          | 36 ->chance (7, random.Next())
        | 40|41|42|43|44|45|46|47|48|49|50|51 -> initialTile – 40
        | _ -> initialTile   

I am not happy with line 11 above – but apparently there is not a way in F# to do this ‘>40’ or even ‘[40 .. 51]’ in the left hand side of the pattern match.

So far, the biggest changes were to make the values more understandable and to get rid of the if…then statements and replace them with pattern matching. Both these techniques make the code more readable and understandable. The next big change came with the actual game play itself. The original version:

let simulation =
    let mutable startingTile = 0
    let mutable endingTile = 0
    let mutable doublesCount = 0
    let mutable inJail = false
    let mutable jailRolls = 0
    for diceRoll in 1 .. 10000 do
        let dieOneValue = random.Next(1,7)
        let dieTwoValue = random.Next(1,7)
        let cardDraw = random.Next(1,17)
        let numberOfMoves = dieOneValue + dieTwoValue
        
        if dieOneValue = dieTwoValue then
            doublesCount <- doublesCount + 1
        else
            doublesCount <- 0
 
        if inJail = true then
            if doublesCount > 1 then
                inJail <- false
                jailRolls <- 0
                endingTile <- move 10 numberOfMoves cardDraw
            else
                if jailRolls = 3 then
                    inJail <- false
                    jailRolls <- 0
                    endingTile <- move 10 numberOfMoves cardDraw
                else
                    inJail <- true
                    jailRolls <- jailRolls + 1
        else
            if doublesCount = 3 then
                inJail <- true
                endingTile <- 10
            else
                endingTile <- move startingTile numberOfMoves cardDraw
         
        printfn "die1: %A + die2: %A = %A FROM %A TO %A"
            dieOneValue dieTwoValue numberOfMoves startingTile endingTile
 
        startingTile <- endingTile
        tiles.[endingTile] <- tiles.[endingTile] + 1

You will notice that the word ‘’mutable” shows up six times. Using the word mutable in F# is a code smell so I refactored it out like so:

let rec rollDice (currentTile, rollCount, doublesCount, inJail, jailRollCount)=
    let dieOneValue = random.Next(1,7)
    let dieTwoValue = random.Next(1,7)
    let dieTotal = dieOneValue + dieTwoValue
    let newRollCount = rollCount + 1
    
    let newDoublesCount = 
        if dieOneValue = dieTwoValue then doublesCount + 1
        else 0
 
    let newTile = getTileMove(currentTile,dieTotal,newDoublesCount,inJail,jailRollCount)
    
    let newInJail = 
        if newTile = 10 then true
        else false
 
    let newJailRollCount =
        if newInJail = inJail then jailRollCount + 1
        else 0
 
    let targetTuple = scorecard.[newTile]
    let newTuple = (fst targetTuple, snd targetTuple + 1)
    scorecard.[newTile] <- newTuple
 
            if rollCount < 10000 then
        rollDice (newTile, newRollCount, newDoublesCount, newInJail, newJailRollCount)
    else
        scorecard

No “mutable” (thanks to recursion) and only 1 assignment. I also wanted to get rid of that one ‘<-‘ and Thomas Petrick was kind enough to demonstrate the correct way to do this on stack overflow. Finally, I had to throw in a supporting function to make the decision logic account for rolling doubles that may put you in jail or may get you out of jail depending on prior state (were you in jail when you rolled doubles, were you out of jail when you rolled doubles for the 3rd time, etc…). I spent way too much time monkeying around with a series of nest if…then statements when it hit me that I should be using tuples and pattern matching:

let getTileMove (currentTile, dieTotal, doublesCount, inJail, jailRollCount) =
    match (inJail,jailRollCount, doublesCount) with
        | (true,3,_) -> getBoardMove(10,dieTotal)
        | (true,_,_) -> 10
        | (false,_,3) -> 10
        | (false,_,_) -> getBoardMove(10,dieTotal)

So here if the real power of F# on display. I can think of hundreds of applications that I have seen in C#/VB.NET that have a high cyclomatic complexity and hidden bugs that have reared their head at the most inopportune time because of complex business logic using a series of case..switch and/or if..then. statements. Even by putting step into its own function only helps partially because the code is still there –> it is just labeled better.

By using tupled pattern matching, all of that complexity goes away and we have a succinct series of statements that actually reflect how the brain thinks about the problem. By using F#, there are fewer lines of code (and therefore fewer unit tests to maintain) and you can write code that better represents how the wetware is approaching the problem.

Filed under F#

The Big Ten and F#

December 10, 2013 3 Comments

I was talking to fellow TRINUGer David Green about football schools a couple of weeks ago. He went to Northwestern and I went to Michigan and we were discussing the relative merits of universities doing well in football. Assuming Goro was counting, on one hand, it is great to have a sport that can bring in tons of money to the school to fund non-football sports and activities, on the second hand it keeps alumni interested in their school, on the 3rd hand it can give locals a source of pride in the school, and on the last hand it can take the focus away from the other parts of the academic institution.

I then was talking to a professor at Ohio State University – she cares absolutely zero about the football team. I made the comment that the smartest kids in Ohio don’t go to OSU. They will go and root for their gladiators on Saturday but when it comes down to their academic and subsequent professional success, they look elsewhere. She agreed.

Putting those two conversations together, it put OSU and MSU’s continued success in the Big Ten in context – as the inevitable bellyaching that those teams get the short stick when compared to the SEC. For example, OSU and MSU both would be undefeated in the Ivy League in 2013– does that mean they should be considered in the same conversation as Alabama and Auburn for the national championship? I think the biggest problem that OSU and MSU have is that they are in the Big Ten – which historically has been about geography, academic success, and athletic competition (in that order).

Looking at the Big Ten Schools, I pulled their most recent academic ranking for US News and World Report and their BCS Ranking. I then went over to MathIsFun to get the recipe for correlation:

I then went over to Visual Studio and created a solution like so:

Learning from my last project, I created my unit test first to verify that the calculation is correct:

[TestMethod]
public void FindCorrelationUsingStandardInput_ReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
 
    Double expected = .9575;
    Double actual = Calculations.Correlation(tempatures, sales);
    Assert.AreEqual(expected, actual);
}

I then hopped over to my working code and started coding:

type Calculations() =
    static member Correlation(x:IEnumerable<double>, y:IEnumerable<double>) =
        let meanX = Seq.average x
        let meanY = Seq.average y
        
        let a = Seq.map(fun x -> x-meanX) x
        let b = Seq.map(fun y -> y-meanY) y
 
        let ab = Seq.zip a b
        let abProduct = Seq.map(fun (a,b) -> a * b) ab
 
        let aSquare = Seq.map(fun a -> a * a) a
        let bSquare = Seq.map(fun b -> b * b) b
        
        let abSum = Seq.sum abProduct
        let aSquareSum = Seq.sum aSquare
        let bSquareSum = Seq.sum bSquare
 
        let sums = aSquareSum * bSquareSum
        let squareRootOfSums = sqrt(sums)
 
        abSum/squareRootOfSums

What I noticed is that those intermediate variables make the code much more wordy than they need to be – so a mathematician might think that the code is too verbose– but a developer might appreciate that each step is laid out. In fact, I would argue that a better component design would be to break out each of the steps into their own function that can be independently testable (and perhaps reused by other functions):

[TestMethod]
public void GetMeanUsingStandardInputReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double expected = 18.675;
    Double actual = Calculations.Mean(tempatures);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void GetBothMeansProductUsingStandardInputReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
 
    Double expected = 5325;
    Double actual = Calculations.MeanProduct(tempatures);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void GetMeanSquareUsingStandardInputReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
 
    Double expected = 177;
    Double actual = Calculations.MeanSquared(tempatures);
    Assert.AreEqual(expected, actual);
}

I’ll leave that implementation for another day as it is already getting late. In any event, I ran the unit test and I got red (pink, really):

The spreadsheet rounded and my calculation does not. I adjusted the unit test appropriately:

[TestMethod]
public void FindCorrelationUsingStandardInput_ReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
 
    Double correlation = Calculations.Correlation(tempatures, sales);
    Double expected = .9575;
 
    Double actual = Math.Round(correlation, 4);
    Assert.AreEqual(expected, actual);
}

And now I am green:

So going back to the original question, I took the current Big Ten Schools and put their academic rankings and football rankings side by side:

I then made a revised Big Ten that had a much higher academic ranking based on schools that play in a power football conference but still maintain high academics.

Note that I left Penn State out of both of these lists b/c they have a NaN for their football ranking – but they certainly have a high enough academic score to be part of the revised Big Ten.

And then when I put those values through the correlation function via a Console UI:

static void Main(string[] args)
{
    Console.WriteLine("Start");
 
    Double[] academicRanking = new Double[12] { 12,28,41,41,52,62,68,69,73,73,75,101 };
    Double[] footballRanking = new Double[12] { 65,41,82,19,7,61,105,36,4,34,63,37 };
 
    Double originalCorrelation = Calculations.Correlation(academicRanking, footballRanking);
    Console.WriteLine("Original BigTen Correlation {0}", originalCorrelation);
 
    academicRanking = new Double[10] { 7,12,17,18,23,23,28,30,41,41 };
    footballRanking = new Double[10] { 24, 65, 32, 26, 94, 84, 41, 58, 82, 19 };
    Double revisedCorrelation = Calculations.Correlation(academicRanking, footballRanking);
    Console.WriteLine("Revised BigTen Correlation {0}", revisedCorrelation);
 
    
    Console.WriteLine("End");
    Console.ReadKey();
}

I get:

And just looking at the data seems to support this. There is a negative correlation between academics and football success in the current Big Ten – Higher the academics = lower the football ranking and vice versa. In the revised Big Ten, there is positive correlation of the same magnitude – higher academics and higher (relative) football rankings. Put another way, the new Big Ten has a much stronger academic ranking and pretty much the same football ranking.

Looking at a map, this new conference is like a doughnut with Ohio, West Virginia, and Kentucky in the middle. Perhaps they can have a football championship sponsored by Krispie Kreeme? In any event, OSU and MSU are much closer academically and football-wise to the Alabamas and Auburns than the Northwesterns and Michigans of the world. In terms of geographic proximity, Columbus, Ohio is closer to Tuscalosa, AL than Lincoln, NB. So perhaps the OSU and MSU fans would be better served in a conference that is more aligned with their University’s priorities? If they went undefeated or even 1 loss, they would still be in the national championship discussion.

Filed under Analytics, F#

F# > C# when doing math

December 10, 2013 5 Comments

My friend/coworker Rob Seder sent me this code project link and said it might be an interesting exercise to duplicate what he had done in F#. Interesting indeed! Challenge accepted!

I first created a solution like so:

I then copied the Variance calculation from the post to the C# implementation:

public class Calculations
{
    public static Double Variance(IEnumerable<Double> source)
    {
        int n = 0;
        double mean = 0;
        double M2 = 0;
 
        foreach (double x in source)
        {
            n = n + 1;
            double delta = x – mean;
            mean = mean + delta / n;
            M2 += delta * (x – mean);
        }
        return M2 / (n – 1);
    }
}

I then created a couple of unit tests for the method and made sure that the results ran green:

[TestClass]
public class CSharpCalculationsTests
{
    [TestMethod]
    public void VarianceOfSameNumberReturnsZero()
    {
        Collection<Double> source = new Collection<double>();
        source.Add(2.0);
        source.Add(2.0);
        source.Add(2.0);
 
        double expected = 0;
        double actual = Calculations.Variance(source);
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void VarianceOfOneAwayNumbersReturnsOne()
    {
        Collection<Double> source = new Collection<double>();
        source.Add(1.0);
        source.Add(2.0);
        source.Add(3.0);
 
        double expected = 1;
        double actual = Calculations.Variance(source);
        Assert.AreEqual(expected, actual);
    }    
}

I then spun up the same unit tests to test the F# implementation and then went over to the F# project. My first attempt started along the lines like this:

namespace Tff.BasicStats.FSharp
 
open System
open System.Collections.Generic
 
type Calculations() = 
    static member Variance (source:IEnumerable<double>) =
        let mean = Seq.average(source)
        let deltas = Seq.map(fun x -> x-mean) source
        let deltasSum = Seq.sum deltas
        let deltasLength = Seq.length deltas
        deltasSum/(double)deltasLength

I then realized that I was writing procedural code in F# – I was not taking advantage of the power that the expressiveness that the language provides. I also realized that looking at the C# code to understand how to calculate Variance was useless – I was getting lost in the loop and the poorly-named variables. I went over to Wikipedia’s definition to see if that could help me understand Variance better but I got lost in all of the formulas. I then binged Variance on Google and one of the 1st links is MathIsFun with this explanation. This was more like it! Cool dog pictures and a stupid simple recipe for calculating Variance. The steps are:

I hopped over to Visual Studio and wrote a one-for-one line of code to match the recipe:

namespace Tff.BasicStats.FSharp
 
open System
open System.Collections.Generic
 
type Calculations() = 
    static member Variance (source:IEnumerable<double>) =
        let mean = Seq.average source
        let deltas = Seq.map(fun x -> sqrt(x-mean)) source
        Seq.average deltas

I ran the unit tests but they were running red! I was getting a NaN.

Hearing my cursing, my 7th grade son came over and said – “Dad, that is wrong. You don’t use the square root on the (x-mean), you square it. Also, you can’t take the square root of a negative number and any item in that list that is less than the average will return that ” Let me repeat that – a 7th grader with no coding experience but who knows about Variance from his math class just read the code and found the problem.

I then changed the code to square the value like so:

namespace Tff.BasicStats.FSharp
 
open System
open System.Collections.Generic
 
type Calculations() = 
    static member Variance (source:IEnumerable<double>) =
        let mean = Seq.average source
        let deltas = Seq.map(fun x -> pown(x-mean) 2) source
        Seq.average deltas

And now my unit test… runs…. Red!

Not understanding why, I turned to the REPL (F# Interactive Window). I first entered my test set:

I then entered the calculation from each line against the test set:

Staring at the resulting array, it hit me that perhaps the original unit test’s expected value was wrong! I went over to TutorVista and entered in my array. Would you believe it?

The calculation on the code project site is incorrect! The correct way to do the unit test is:

[TestMethod]
public void VarianceOfOneAwayNumbersReturnsOne()
{
    Collection<Double> source = new Collection<double>();
    source.Add(1.0);
    source.Add(2.0);
    source.Add(3.0);
 
    //double expected = 6666666667;
    double expected = 2f / 3f; 
    double actual = Calculations.Variance(source);
    Assert.AreEqual(expected, actual);
}    

(Note that expected was the easiest way I could come up with .6 repeating without getting all crazy on the formatting). Now both my unit tests run green and one of the C# ones runs red.

I have no interest in trying to figure out how to fix that C# code – I care less about how to solve my problem and more about just solving the problem. The real power of F# really is on display here. The coolest parts of this exercise were:

One-for-one correspondence between the steps to solve a problem and the code
The code is much more readable to non developers
By concentrating on how to solve the problem in C#, the original developer lost sight of what he was trying to accomplish. F# focuses you on the result, not the code.
Unit tests can be wrong – if you let your code’s result drive the expected and not a external source.

Filed under F#, Statistics

F# and SignalR Stock Ticker: Part 2

December 3, 2013 1 Comment

Following up on my prior post found here about using F# to write the Stock Ticker example found on SignalR’s website, I went to implement the heart of the application – the stock ticker class.

The original C# class suffers from a violation of command/query separation and also does a couple of things. Breaking out the code functionality, the class creates a list of random stocks in the constructor.

Then there is a timer that loops and periodically updates the current stock price.

Finally, it broadcasts the new stock price to any connected clients.

Because the class depends on the clients for its creation and lifetime, it implements the singleton pattern – you access the class via its Instance property. This is a very common pattern:

//Singleton instance
private readonly static Lazy<StockTicker> _instance = 
    new Lazy<StockTicker>(() => 
        new StockTicker(GlobalHost.ConnectionManager.GetHubContext<StockTickerHub>().Clients));

public static StockTicker Instance
{
    get
    {
        return _instance.Value;
    }
}

Attacking the class from a F# point of view, I first addressed the singleton pattern. I checked out the Singleton pattern in Liu’s F# for C# Developers. The sentience that caught my eye was “An F# value is immutable by default, and this guarantees there is only on instance.” (p149) Liu then goes and builds an example using a private class and shows how to reference it via a Instance method. My take-away from the example is that you don’t need a Singleton pattern in F# – because everything is Singleton by default. Another way to look at it is that a Singleton pattern is a well-accepted workaround the limitations that mutability brings when using C#.

I then jumped over to the updating stock prices – after all, how can you send out a list of new stock prices if you can’t mutate the list or the individual stocks within the list? Quite easily, in fact.

The first thing I did was to create a StockTicker class that takes in a SignalR HubContext and a list of stocks.

type StockTicker(clients: IHubConnectionContext, stocks: IEnumerable<Stock>) = class

I then added the logic to update the list and stocks.

let rangePercent = 0.002
let updateInterval = TimeSpan.FromMilliseconds(250.)
let updateStockPrice stock:Stock = 
    let updateOrNotRandom = new Random()
    let r = updateOrNotRandom.NextDouble();
    match r with
        | r when r <= 1. -> stock
        | _ ->
 
            let random = new Random(int(Math.Floor(stock.Price)))
            let percentChange = random.NextDouble() * rangePercent
            let pos = random.NextDouble() > 0.51
            let change = Math.Round(stock.Price * decimal(percentChange),2)
            let newPrice = stock.Price + change
            new Stock(stock.Symbol, stock.DayOpen, newPrice)
let updatedStocks = stocks
                        |> Seq.map(fun stock -> updateStockPrice(stock))

Looking at the code, the word “update” in the prior sentence is wrong. I am not updating anything. I am replacing the list and the stocks with the new price (if determined). Who needs a singleton? F# doesn’t.

I then attempted to notify the clients like so:

member x.Clients = clients
member x.Stocks = stocks
member x.BroadcastStockPrice (stock: Stock) =
    x.Clients.All.updateStockPrice(stock)

But I got a red squiggly line of approbation (RSLA) on the updateStockPrice method. The compiler is complaining that

Error 1 The field, constructor or member ‘updateStockPrice’ is not defined

And reading the SignalR explanation here:

The updateStockPrice method that you are calling in BroadcastStockPrice doesn’t exist yet; you’ll add it later when you write code that runs on the client. You can refer to updateStockPrice here because Clients.All is dynamic, which means the expression will be evaluated at runtime

So how does F# accommodate the dynamic nature of Clients.All? I don’t know so off to StackOverflow I go….

In any event, I can then wire up a method that broadcasts the new stock prices like so:

member x.BroadcastAllPrices =
    x.Clients.All.updateAllStockPrices(updatedStocks)

And then write a method that calls this broadcast method every quarter second:

 member x.Start =
     async {
            while true do
             do! Async.Sleep updateInterval
             x.BroadcastAllPrices
     } |> Async.StartImmediate

Note that I tried to figure out the Timer class and subsequent Event for the timer, but I couldn’t. I stumbled upon this post to ditch the timer in favor of the code above and since it works, I am all for it. Figuring out events in F# is a battle for another day…

Filed under F#, SignalR

F# and SignalR Stock Ticker Example

November 26, 2013 1 Comment

I was looking at the server broadcast SignalR tutorial found here for a current project when I got to the StockTicker class. In this class, the interesting code surrounds making a singleton instance because SignalR hubs are transient. Here is the full text:

You’ll use the SignalR Hub API to handle server-to-client interaction. A StockTickerHub class that derives from the SignalR Hub class will handle receiving connections and method calls from clients. You also need to maintain stock data and run a Timer object to periodically trigger price updates, independently of client connections. You can’t put these functions in a Hub class, because Hub instances are transient. A Hub class instance is created for each operation on the hub, such as connections and calls from the client to the server. So the mechanism that keeps stock data, updates prices, and broadcasts the price updates has to run in a separate class, which you’ll name StockTicker.

So then it hit me – we need an immutable class that can handle multiple requests. This sounds like a job for Captain F#! Unfortunately, Captain F# is on vacation, so I went with Private 1st class F#. So this is what I did.

I created an empty solution. I then added in a C# Empty Web Application just to have a point of comparison to the FSharp project. Then I added in a new F# MVC4 project from on the on-line template that Daniel Mohl created:

The problem is that the C# is the web app and the F# is just the controller. Since I want a full-on double rainbow F# only MVC application, I tossed the template and just created a bare-bones F# project. I then opened the .fsproj file and added a web ProjectType GUID (basically parroting what was in the .cs project file):

I posted this to stack overflow here. So I am back to using C# as the Web application and F# as the plug in code. I re-started with a couple of skeleton projects like so:

I then added a class for Stock like so:

namespace Tff.SignalRServerBroadcast.FS
 
open System
 
type Stock() = 
    member val Symbol = String.Empty with get, set

I then added some unit tests to verify that I could create the Stock class and that I could assign the Symbol property:

[TestClass]
public class StockTests
{
    [TestMethod]
    public void CreateStock_ReturnsValidInstance()
    {
        Stock stock = new Stock();
        Assert.IsNotNull(stock);
    }
 
    [TestMethod]
    public void VerifyStockSymbolCanBeMutated()
    {
        Stock stock = new Stock();
        stock.Symbol = "TEST";
 
        String notExpected = String.Empty;
        String actual = stock.Symbol;
 
        Assert.AreNotEqual(notExpected, actual);
    }
}

And sure enough, they run green:

Just then, Captain F# swooped in from vacation and exclaimed “What are you doing? The tenants of functional programming is immutability. What happens if you change the Symbol after the object is created – does it really represent the same thing? In fact, allowing the Symbol to be changed after it is created will lead to bugs and potentially unexpected behaviors in your system!” With that, he left for a 10-day tour of the eastern Mediterranean.

I then changed the Stock class to be immutable like so:

namespace Tff.SignalRServerBroadcast.FS
 
open System
 
type Stock = 
    {Symbol: String;
     Price: Decimal;
     DayOpen: Decimal;
     }
 
     member x.GetChange () =
        x.Price – x.DayOpen

and then updated my unit tests like so:

[TestClass]
public class StockTests
{
    [TestMethod]
    public void CreateStock_ReturnsValidInstance()
    {
        Stock stock = new Stock("TEST", 10, 10.25M);
        Assert.IsNotNull(stock);
    }
 
    [TestMethod]
    public void PriceChangeUsingValidNumbers_ReturnsCorrectChange()
    {
        Stock stock = new Stock("TEST", 10, 10.25M);
        Decimal expected = .25M;
        Decimal actual = stock.GetChange();
        Assert.AreEqual(expected, actual);
    }
}

And then I ran my tests and got red

Ugh, I reversed the parameters – I intended to have the stock go up $.25. Instead, the constructor expects the Price to come before the DayOpen. This is not intuitive – you have implicit temporal coupleing in these parameters and since DayOpen occurs sooner in the space-time continuum, I reversed the parameters and the tests ran green:

type Stock = 
    {Symbol: String;
     DayOpen: Decimal;
     Price: Decimal;
     }

With that done, I looked at the last PercentChange calculation. The only thing remarkable about it is that the code in the on-line tutorial is incorrect. The tutorial uses Price as the denominator, but my unit tests shows that it wrong:

[TestMethod]
public void PercentChangeUsingValidNumbers_ReturnsCorrectChange()
{
    Stock stock = new Stock("TEST", 10, 11M);
    Decimal expected = .1M;
    Decimal actual = stock.GetPercentChange();
    Assert.AreEqual(expected, actual);
}

If a stock goes from $10.00 to $11.00, it increases $1.00 and $1.00 divided by $10.00 is 10% – the stock increased 10%.

So I went back and changed the implementation to get the test to run green.

member x.GetPercentChange() =
    Math.Round(x.GetChange()/x.DayOpen,4)

So looking at this class, why is F# better than C#?

1) Less noise. Compare the code between C# and F#

All of the code in the Price setter is irrelevant. In the F# implementation, you don’t need to worry about assigning DayOpen for the 1st time.

2) Fewer Bugs:

What happens if you are looking at Pets.com (IPET) on November 6, 2000 when it opened at $.16 and then went to $0.00 at noon when they finalized their liquidation? You need to change your code b/c the C# implementation is wrong – the price was $0.00 and it was not the open price.

Also, what prevents me from changing the Symbol? I could create a ticker class for APPL at $519 and then change the ticker to MSFT – volia MSFT’s price goes from $37.57 to $519.00! And all of the unit tests for the Stock still run green.

3) More readable.

Less noise – more signal (SignalR in fact)…

This blog post is getting a bit long so I will continue this project on another post.

Thanks to the RHCP, I listened to this 2-3 times when doing this blog post…

Filed under F#, SignalR

← Older posts

Newer posts →

Jamie Dixon's Home

Traffic Stop Disposition: Classification Using F# and KNN

Traffic Stop Analysis Using F#

Setting up an OData Service on WebAPI2 to be used by F# Type Providers

Correlation Between Recruit Rankings and Final Standings in Big Ten Football

Screen Scraping College Football Statistics

F# and Monopoly Simulation Redux

The Big Ten and F#

F# > C# when doing math

F# and SignalR Stock Ticker: Part 2

F# and SignalR Stock Ticker Example

Categories

Recent Posts

Archives

Blogroll

Meta