Basic Insert Operation Using F#

So the more I use F#,  the more I come to understand the benefits and limitations of the language.  Since I spend a majority of my day job in C# and JavaScript, it is a natural comparison between these two languages and F#.  One of the tenants of F# is ‘less noise, more signal’.  After looking at some projects, I am coming to the conclusion that Entity Framework, LinqToSql, <Any other ORM> is just noise.  It is expensive noise at that – if you have worked on a production app using EF and you to do anything outside of the examples on MSDN, you know what I mean.

So can EF type providers replace the overhead, code bloat, and additional costs of Entity Framework?  I decided to do a small test to see.  I needed to load into SqlServer 27,481 records of crash data that I get from the North Carolina DOT.  The records came to me in an Excel Spreadsheet which I pumped into MSAccess.  Then, instead of using SQl Server SSIS/Bulk Data Load functions, I decided to create an app that pulls that data from the Access database and load it into the SqlServer database via a type provider.

My first step was to look for a MSAccess type provider.  No luck.  I then hit up Stack Overflow and found this and this article for working with Access.  I coded up a solution to get the data into a DataReader like so

  1. static member GetCrashData =
  2.     let connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=E:\Documents\Road Alert\WakeCountyCrashes.accdb; Persist Security Info=False;"
  3.     use connection = new OleDbConnection(connectionString)
  4.     let commandText = "Select * from Data"
  5.     use command = new OleDbCommand(commandText,connection)
  6.     connection.Open()
  7.     use reader = command.ExecuteReader()

My first attempt to get the data from the read was a tuple like so:

  1. [while reader.Read() do
  2.      yield reader.GetInt32(0),
  3.      reader.GetFieldValue(1).ToString(),
  4.      reader.GetFieldValue(2).ToString(),
  5.      reader.GetDouble(3),
  6.      reader.GetFieldValue(4).ToString(),
  7.      reader.GetFieldValue(5).ToString(),
  8.      reader.GetFieldValue(6).ToString(),
  9.      reader.GetDateTime(7),
  10.      reader.GetDateTime(8),
  11.      reader.GetFieldValue(9).ToString(),
  12.      reader.GetFieldValue(10).ToString()
  13.  ]

Sure enough, it works like a champ from my C# UI (once I ran AccessDatabaseEngine.exe on my machine – sigh)

  1. private static void GetCrashData()
  2. {
  3.     var results = CrashDataLoader.GetCrashData;
  4.     Console.WriteLine(results.Count().ToString());
  5. }

Gives

image

The next thing I did was to create a Type Provider

image

And then create the method to insert the data into the database:

  1. static member LoadCrashData  =
  2.     let targetDatabase = targetSchema.GetDataContext()
  3.     let rows = CrashDataLoader.GetCrashData
  4.     targetDatabase.TrafficCrashes.InsertAllOnSubmit(rows)
  5.     targetDatabase.DataContext.SubmitChanges()
  6.     true

The problem is that I ran into was that the GetCrashData was returning a Tuple and the LoadCrashData was expecting a Typed CrashData element.  I searched for a bit and then gave up trying to figure out how to map the two without explicitly assigning each field.  So then I did it the old fashion way like so:

  1. static member TrafficCrashFromReader(reader: OleDbDataReader) =
  2.     let trafficCrash = new targetSchema.ServiceTypes.TrafficCrashes()
  3.     trafficCrash.NCDmvCrashId <- System.Nullable<float> (float (reader.GetFieldValue(0).ToString()))
  4.     trafficCrash.Municipality <- reader.GetFieldValue(1).ToString()  
  5.     trafficCrash.OnRoad <- reader.GetFieldValue(2).ToString()  
  6.     trafficCrash.Miles <- System.Nullable<double> (double (reader.GetFieldValue(3).ToString()))
  7.     trafficCrash.Direction <- reader.GetFieldValue(4).ToString()
  8.     trafficCrash.FromRoad <- reader.GetFieldValue(5).ToString()
  9.     trafficCrash.TowardRoad <- reader.GetFieldValue(6).ToString()
  10.     trafficCrash.DateOfCrash <- System.Nullable<DateTime> (reader.GetDateTime(7))
  11.     trafficCrash.TimeOfCrash <- System.Nullable<DateTime> (reader.GetDateTime(8))
  12.     trafficCrash.CrashType <- reader.GetFieldValue(9).ToString()
  13.     trafficCrash.CrashSeverity <- reader.GetFieldValue(10).ToString()
  14.     trafficCrash

The fact that I am using the <- symbol is a code smell, but I am not sure how to get around it.

In any event, once I ran it in my Console app:

  1. private static void LoadCrashData()
  2. {
  3.     Stopwatch stopWatch = new Stopwatch();
  4.     stopWatch.Start();
  5.     CrashDataLoader.LoadCrashData();
  6.     stopWatch.Stop();
  7.     Console.WriteLine("The load took " + stopWatch.Elapsed.TotalSeconds + " seconds.");
  8. }

I got nothing after 30m minutes!  Yikes.

I then went back and wrote a function to insert 1 row at a time:

  1. static member LoadCrashDataRow dataRow =
  2.     let targetDatabase = targetSchema.GetDataContext()
  3.     targetDatabase.TrafficCrashes.InsertOnSubmit(dataRow)
  4.     targetDatabase.DataContext.SubmitChanges()
  5.     true

And the consuming app:

  1. private static void LoadCrashData()
  2. {
  3.  
  4.     var crashRows = CrashDataLoader.GetCrashData;
  5.     Stopwatch stopWatch = new Stopwatch();
  6.     stopWatch.Start();
  7.     foreach (var crashRow in crashRows)
  8.     {
  9.         CrashDataLoader.LoadCrashDataRow(crashRow);
  10.         Console.WriteLine(crashRow.NCDmvCrashId + " loaded.");
  11.     }
  12.     stopWatch.Stop();
  13.     Console.WriteLine("The load took " + stopWatch.Elapsed.TotalSeconds + " seconds.");
  14. }

Sure enough, it works like a champ.

image

So it is slow – though I am not sure EF is any faster.  But not having to deal with that .edmx files, the .tt files, the whatever-else-we-throw-in files, I think further research is definitely warranted.  Also, there are some other things I think F# Type Providers need to have:

1) Ability to handle proxies

2) Making Plural tables singular (The table name is Crashes, the type should be Crash)

3) An MS Access TP would be great

4) An Azure Sql Database TP would be doubly great 

Traffic Stop Visualization Using D3

One of the comments I got from the TRINUG Data SIG was that the data and analysis were exciting but the results were, well, boring.  So I went back to the traffic stop data and thought about how I could sexy it up.  Since lots of people have been using D3 to present the data. I thought it would be a good place to start.

My 1st step was to look at their samples page and they have a simple bar chart that seems like a good introduction to the library.  I created an endpoint on my webapi for the summary data like so:

  1. [HttpGet]
  2. [Route("api/TrafficStopSearch/StopsByMonth/")]
  3. public dynamic StopsByMonth()
  4. {
  5.     return ChickenSoftware.RoadAlert.Analysis.AnalysisEngine.TrafficStopsByMonth;
  6.  
  7. }

I then spun up an empty asp.net website and created an index page based on their sample.  I then added an ajax call to the controller and replaced the reference to the data.tsv file:

  1. $.ajax({
  2.    url: "http://localhost:17680/api/TrafficStopSearch/StopsByMonth/&quot;,
  3.    dataType: "json",
  4.    success: function (data) {
  5.        x.domain(data.map(function (d) { return d.m_Item1; }));
  6.        y.domain([0, d3.max(data, function (d) { return d.m_Item6; })]);
  7.  
  8.        svg.append("g")
  9.            .attr("class", "x axis")
  10.            .attr("transform", "translate(0," + height + ")")
  11.            .call(xAxis);
  12.  
  13.        svg.append("g")
  14.            .attr("class", "y axis")
  15.            .call(yAxis)
  16.          .append("text")
  17.            .attr("transform", "rotate(-90)")
  18.            .attr("y", 6)
  19.            .attr("dy", ".71em")
  20.            .style("text-anchor", "end")
  21.            .text("Frequency");
  22.  
  23.        svg.selectAll(".bar")
  24.            .data(data)
  25.          .enter().append("rect")
  26.            .attr("class", "bar")
  27.            .attr("x", function (d) { return x(d.m_Item1); })
  28.            .attr("width", x.rangeBand())
  29.            .attr("y", function (d) { return y(d.m_Item6); })
  30.            .attr("height", function (d) { return height – y(d.m_Item6); });
  31.        
  32.    },
  33.    error: function(e){
  34.        alert("error");
  35.    }
  36. });

On thing to note is that the tuple that was created in F# and then passed though via the C# controller had its name changed.  Specially, Tuple.Item1 became Tuple.m_Item1.  I think that passing out tuple.anything is a horrible idea, so I created a POCO that actually lets the consumer know what each field means:

  1. public class OutputValue
  2. {
  3.     public Int32 Month { get; set; }
  4.     public Int32 ExpectedStops { get; set; }
  5.     public Int32 ActualStops { get; set; }
  6.     public Double DifferenceBetweenExpectedAndActual { get; set; }
  7.     public Double PercentDifferenceBetweenExpectedAndActual { get; set; }
  8.     public Double Frequency { get; set; }
  9. }

and then I adjusted the controller like so:

  1.  
  2. [HttpGet]
  3. [Route("api/TrafficStopSearch/StopsByMonth/")]
  4. public dynamic StopsByMonth()
  5. {
  6.     var outputs = new List<OutputValue>();
  7.     var resultSet= ChickenSoftware.RoadAlert.Analysis.AnalysisEngine.TrafficStopsByMonth;
  8.     foreach (var tuple in resultSet)
  9.     {
  10.         var outputValue = new OutputValue()
  11.         {
  12.             Month = tuple.Item1,
  13.             ExpectedStops = tuple.Item2,
  14.             ActualStops = tuple.Item3,
  15.             DifferenceBetweenExpectedAndActual = tuple.Item4,
  16.             PercentDifferenceBetweenExpectedAndActual = tuple.Item5,
  17.             Frequency = tuple.Item6
  18.         };
  19.         outputs.Add(outputValue);
  20.     }
  21.  
  22.     return outputs;
  23. }

So I adjusted the javascript and voila: a bar chart:

image

Up next – some real charts…

 

Screen Scraping The Department Of Health

As part of TRINUG’s Analytics SIG, some of the people were interested in health inspections found here.  I created a Public Records Request (PRR) on their website’s intact form:

image

And in the comments, I said this:

I would like to make a Public Records Request on this data.  I would like the following fields for all inspections 1/1/2013 to 12/31/2013: 

InspectionId
InspectorId
EstablishmentId
EstablishmenName
EstablishmentAddress
EstablishmentCity
EstablishmentZip
EstablishmentLat
EstablishmentLong
EstablishmentTypeId
EstablishmentTypeDesc
InspectionScore
NumberOfNonCriticalViolations
NumberOfCriticalViolations
InspectionDate

 

After a week, I did not hear back (their response is supposed to be 48 hours) so I emailed the director:

image

Not wanting to wait any longer for the data, I decided to do some screen scraping.  To that end, I did an on-line report and go something like this:

image

with the pages of the report down here:

image

I then went into source and checked out the pagination.  Fortunately, it was uri-based so there are 144 different uris like this one:

image

I pulled down all of the uris and put them into Excel.  I then trimmed off the <a and the text after the word “class”:

image

Fortunately, there was no restaurant with the word “class” in its name.  I know have 144 uris ready to go.  I then saved the uris into a .csv.

My next step is to suck these uris into a a database.  I have learned the hard way that screen scraping is fraught with mal-formed data and unstable connections.  Therefore, I will make a request and pull down a page and store it when the getting is good.  I will then parse that page separately.  Since the data appears in the past, I am less concerned about the data changing after I pull it local.

When I spun up an instance of Sql Server and tried to import the data, I kept getting things like this. 

image

 

So it is pretty obvious that Sql Server doesn’t make it easy to import text like uris (and I can image HTML).  I decided to spin up an instance of MongoDb.  Also, because the F# MongoDb driver is not in NuGet, I decided to go with C#. 

image

I then fired up a C# and read all of the uris into a List

  1. static List<String> GetUrisFromFileSystem()
  2. {
  3.     var path = @"C:\HealthDepartment\Inspections.csv";
  4.     var contents = File.ReadAllText(path);
  5.     var splitContents = contents.Split('\n');
  6.     var contentList = splitContents.ToList<String>();
  7.     contentList.RemoveAt(0);
  8.     return contentList;
  9. }

I then wrote the list into MongoDb.

  1. static void LoadDataIntoMongo(List<String> uris)
  2. {
  3.     var connectionString = "mongodb://localhost";
  4.     var client = new MongoClient(connectionString);
  5.     var server = client.GetServer();
  6.     var database = server.GetDatabase("HealthDepartment");
  7.     var collection = database.GetCollection<String>("UriEntities");
  8.     foreach (String uri in uris)
  9.     {
  10.         var entity = new UriEntity { Uri = uri };
  11.         collection.Insert(entity);
  12.         var id = entity.Id;
  13.         Console.WriteLine(id);
  14.     }
  15.  
  16.     
  17. }

 

The 1 gotcha is that I made my UriEntity class have a string as the Id.  This is not idomatic to Mongo and I got this:

image

It needs to be a a type of ObjectId.  Once I made that switch, I got this:

image

The fact that MongoDb makes things so much easier than SqlServer is really impressive.

With the Uris in Mongo, I then wanted to make a request to the individual pages.  I created a method that got to the contents of the page:

  1. static String GetHtmlForAUri(String uri)
  2. {
  3.     var fullyQualifiedUri = "http://wake.digitalhealthdepartment.com/&quot; + uri;
  4.     var request = WebRequest.Create(fullyQualifiedUri);
  5.     var response = request.GetResponse();
  6.     using(var stream = response.GetResponseStream())
  7.     {
  8.         using(var reader = new StreamReader(stream))
  9.         {
  10.             return reader.ReadToEnd();
  11.         }
  12.     }
  13.  
  14. }

I then fired up class to take the contents that are associated with the Uri:

  1. public class PageContentEntity
  2. {
  3.     public ObjectId Id { get; set; }
  4.     public ObjectId UriId { get; set; }
  5.     public String PageContent { get; set; }
  6. }

And created a method to persist the contents:

  1. static void LoadPageContentIntoMongo(PageContentEntity entity)
  2. {
  3.     var connectionString = "mongodb://localhost";
  4.     var client = new MongoClient(connectionString);
  5.     var server = client.GetServer();
  6.     var database = server.GetDatabase("HealthDepartment");
  7.     var collection = database.GetCollection<PageContentEntity>("PageContentEntities");
  8.     collection.Insert(entity);
  9.     Console.WriteLine(entity.Id);
  10. }

And then a method to put everything together

  1. static void LoadAllPageContentIntoMongo()
  2. {
  3.     var connectionString = "mongodb://localhost";
  4.     var client = new MongoClient(connectionString);
  5.     var server = client.GetServer();
  6.     var database = server.GetDatabase("HealthDepartment");
  7.     var collection = database.GetCollection<UriEntity>("UriEntities");
  8.     foreach(var uriEntity in collection.FindAllAs<UriEntity>())
  9.     {
  10.         String pageContent =  GetHtmlForAUri(uriEntity.TargetUri);
  11.         var pageEntity = new PageContentEntity()
  12.         {
  13.             UriId = uriEntity.Id,
  14.             PageContent =pageContent
  15.         };
  16.         LoadPageContentIntoMongo(pageEntity);
  17.     }
  18. }

So sure enough, I know have all of the pages local.  Doing something with it – that is the next trick…

Note that as soon as I finished up this piece, I got a note from the director of the department saying that they are looking at my request and will get back to me soon.

Traffic Stop Disposition: Classification Using F# and KNN

I have already looked at the summary statistics of the traffic stop data I received from the town here.  My next stop was to try and do a machine learning exercise with the data.  One of the more interesting questions I want to answer is what factors into weather a person gets a warning or a ticket (called disposition)?  Of all of the factors that may be involved, the dataset that I have is fairly limited:

image_thumb1

Using dispositionId as the result variable, there is StopDateTime and Location (Latitude/Longitude).  Fortunately, DateTime can be decomposed into several input variables.  For this exercise, I wanted to use the following:

  • TimeOfDay
  • DayOfWeek
  • DayOfMonth
  • MonthOfYear
  • Location (Latitude:Longitude)

And the resulting variable being disposition.  To make it easier for analysis, I limited the analysis set to finalDisposition as either “verbal warning” or “citation”  I decided to do a K-Nearest Neighbor because it is regarded as an easy machine learning algorithm to learn and the question does seem to be a classification problem.

My first step was to decide weather to write or borrow the KNN algorithm.  After looking at what kind of code would be needed to write my own and then looking at some other libraries, I decided to use Accord.Net.

My next first step was to get the data via the web service I spun up here.

  1. namespace ChickenSoftware.RoadAlert.Analysis
  2.  
  3. open FSharp.Data
  4. open Microsoft.FSharp.Data.TypeProviders
  5. open Accord.MachineLearning
  6.  
  7. type roadAlert2 = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
  8. type MachineLearningEngine =
  9.     static member RoadAlertDoc = roadAlert2.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)

My next first step was to filter the data to only verbal warnings (7) or citations (15). 

  1.   static member BaseDataSet =
  2.       MachineLearningEngine.RoadAlertDoc
  3.             |> Seq.filter(funx -> x.DispositionId = 7 || x.DispositionId = 15)
  4.           |> Seq.map(fun x -> x.Id, x.StopDateTime, x.Latitude, x.Longitude, x.DispositionId)
  5.           |> Seq.map(fun (a,b,c,d,e) -> a, b, System.Math.Round(c,3), System.Math.Round(d,3), e)
  6.           |> Seq.map(fun (a,b,c,d,e) -> a, b, c.ToString() + ":" + d.ToString(), e)
  7.           |> Seq.map(fun (a,b,c,d) -> a,b,c, match d with
  8.                                               |7 -> 0
  9.                                               |15 -> 1
  10.                                               |_ -> 1)
  11.           |> Seq.map(fun (a,b,c,d) -> a, b.Hour, b.DayOfWeek.GetHashCode(), b.Day, b.Month, c, d)
  12.           |> Seq.toList

You will notice that I had to transform the dispositionIds from 7 and 15 to 1 and 0.  The reason why is that the KNN method in Accord.Net assumes that the values match the index position in the array.  I had to dig into the source code of Accord.Net to figure that one out.

My next step was to divide the dataset in half: one half being the training sample and the other the validation sample:

  1. static member TrainingSample =
  2.     let midNumber = MachineLearningEngine.NumberOfRecords/ 2
  3.     MachineLearningEngine.BaseDataSet
  4.         |> Seq.filter(fun (a,b,c,d,e,f,g) -> a < midNumber)
  5.         |> Seq.toList
  6.  
  7. static member ValidationSample =
  8.     let midNumber = MachineLearningEngine.NumberOfRecords/ 2
  9.     MachineLearningEngine.BaseDataSet
  10.         |> Seq.filter(fun (a,b,c,d,e,f,g) -> a > midNumber)
  11.         |> Seq.toList

The next step was to actually run the KKN.  Before I could do that though, I had to create the distance function.  Since this was my 1st time, I dropped the geocoordinates and focused only on the time of day derivatives.

  1. static member RunKNN inputs outputs input =
  2.     let distanceFunction (a:int,b:int,c:int,d:int) (e:int,f:int,g:int,h:int) =  
  3.       let b1 = b * 4
  4.       let f1 = f * 4
  5.       let d1 = d * 2
  6.       let h1 = h * 2
  7.       float((pown(a-e) 2) + (pown(b1-f1) 2) + (pown(c-g) 2) + (pown(d1-h1) 2))
  8.  
  9.     let distanceDelegate =
  10.           System.Func<(int * int * int * int),(int * int * int * int),float>(distanceFunction)
  11.     
  12.     let knn = new KNearestNeighbors<int*int*int*int>(10,2,inputs,outputs,distanceDelegate)
  13.     knn.Compute(input)

You will notice I  tried to normalize the values so that they all had the same basis.  They are not exact, but they are close.  You will also notice that I had to create a delegate from for the distanceFunction (thanks to Mimo on SO).  This is because Accord.NET was written in C# with C# consumers in mind and F# has a couple of places where the interfaces are not as seemless as one would hope.

In any event, once the KKN function was written, I wrote a function that to the validation sample, made a guess via KKN, and then reported the result:

  1. static member GetValidationsViaKKN  =
  2.     let inputs = MachineLearningEngine.TrainingInputClass
  3.     let outputs = MachineLearningEngine.TrainingOutputClass
  4.     let validations = MachineLearningEngine.ValidationClass
  5.  
  6.     validations
  7.         |> Seq.map(fun (a,b,c,d,e) -> e, MachineLearningEngine.RunKNN inputs outputs (a,b,c,d))
  8.         |> Seq.toList
  9.  
  10. static member GetSuccessPercentageOfValidations =
  11.     let validations = MachineLearningEngine.GetValidationsViaKKN
  12.     let matches = validations
  13.                     |> Seq.map(fun (a,b) -> match (a=b) with
  14.                                                 | true -> 1
  15.                                                 | false -> 0)
  16.  
  17.     let recordCount =  validations |> Seq.length
  18.     let numberCorrect = matches |> Seq.sum
  19.     let successPercentage = double(numberCorrect) / double(recordCount)
  20.     recordCount, numberCorrect, successPercentage

I then hopped over to my UI console app and looked that the success percentage.

 

  1. private static void GetSuccessPercentageOfValidations()
  2. {
  3.     var output = MachineLearningEngine.GetSuccessPercentageOfValidations;
  4.     Console.WriteLine(output.Item1.ToString() + ":" + output.Item2.ToString() + ":" + output.Item3.ToString());
  5. }

image

So there are 12,837 records in the validation sample and the classifier guessed the correct disposition 9,001 times – a success percentage of 70%

So it looks like there is something there.  However, it is not clear that this is a good classifier without further tests – specifically seeing if the how to most common case results when pushing though the classifier.  Also, I would assume to make this a true ‘machine learning’ algorithm I would have to feed the results back to the distance function to see if I can alter it to get the success percentage higher.

One quick note about methodology – I used unit tests pretty extensively to understand how the KKN works.  I created a series of tests with some sample data to see who the function reacted. 

  1. [TestMethod]
  2. public void TestKKN_ReturnsExpected()
  3. {
  4.  
  5.     Tuple<int, int, int, int>[] inputs = {
  6.         new Tuple<int, int, int, int>(1, 0, 15, 1),
  7.         new Tuple<int,int,int,int>(1,0,11,1)};
  8.     int[] outputs = { 1, 1 };
  9.  
  10.     var input = new Tuple<int, int, int, int>(1, 1, 1, 1);
  11.  
  12.     var output = MachineLearningEngine.RunKNN(inputs, outputs, input);
  13.  
  14. }

This was a big help to get me up and running (walking, really..)…

Traffic Stop Analysis Using F#

Now that I have the traffic stop services up and running, it is time to actually do something with the data.  The data set is all traffic stops in my town for 2012 with some limited information: date/time of the stop, the geolocation of the stop, and the final disposition of the stop.  The data looks like this:

image

My 1st step was to look at the Date/Time and see if there are any patterns in DayOfMonth, MonthOfYear, And TimeOfDay.  To that end, I spun up a F# project and added my 1st method that determines the total number of records in the dataset:

  1. type roadAlert = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
  2. type AnalysisEngine =
  3.     static member RoadAlertDoc = roadAlert.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)
  4.  
  5.     static member NumberOfRecords =
  6.         AnalysisEngine.RoadAlertDoc
  7.             |> Seq.length

Since I am a TDDer more than a REPLer, I went and wrote a covering unit test.

  1. [TestMethod]
  2. public void NumberOfRecords_ReturnsExpected()
  3. {
  4.     Int32 notEpected = 0;
  5.     Int32 actual = AnalysisEngine.NumberOfRecords;
  6.     Assert.AreNotEqual(notEpected, actual);
  7. }

A couple of things to note about this:

1) This is really an integration test, not a unit test.  I could have written the test like this:

  1. [TestMethod]
  2. public void NumberOfRecordsFor2012DataSet_ReturnsExpected()
  3. {
  4.     Int32 expected = 27778;
  5.     Int32 actual = AnalysisEngine.NumberOfRecords;
  6.     Assert.AreEqual(expected, actual);
  7. }

But that means I am tying the test to the specific data sample (in its current state) – and I don’t want to do that.

2) I am finding that my F# code has many more functions than the code written by other people – esp data scientists.  I think it has to do with contrasting methodologies.  Instead of spending time in the REPL with a small piece of code to get it right and then adding the code into the larger code base, I am writing very small piece of code in the class and then using unit tests to get it right.  The upshot of that is that there are lots of small, independently testable pieces of code – I think this stems from my background of writing production apps that are for business problems and not for academic papers.  Also, I use classes in source files versus script files because I plan to plug the code into larger .NET applications that will be written in C# and/or VB.NET.

In any event, once I has the total number of records, I went to see how they broke down into month:

  1. static member ActualTrafficStopsByMonth =
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> x.StopDateTime.Month)
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.toList

  1. [TestMethod]
  2. public void ActualTrafficStopsByMonth_ReturnsExpected()
  3. {
  4.     Int32 notExpected = 0;
  5.     var stops = AnalysisEngine.ActualTrafficStopsByMonth;
  6.     Assert.AreNotEqual(notExpected, stops.Length);
  7.  
  8. }

 

I then created a function that shows the expected number of stops by month.  Pattern matching with F# makes creating the month list a snap.  Note that is is a true unit test because I am not dependent on external data:

  1. static member Months =
  2.     let monthList = [1..12]
  3.     Seq.map (fun x ->
  4.             match x with
  5.                 | 1 | 3 | 5 | 7 | 8 | 10 | 12 -> x,31,31./365.
  6.                 | 2 -> x,28,28./365.
  7.                 | 4 | 6 | 9 | 11 -> x,30, 30./365.
  8.                 | _ -> x,0,0.                    
  9.         ) monthList
  10.     |> Seq.toList   

  1. static member ExpectedTrafficStopsByMonth numberOfStops =
  2.     AnalysisEngine.Months
  3.         |> Seq.map(fun (x,y,z) ->
  4.             x, int(z*numberOfStops))
  5.         |> Seq.toList

  1. [TestMethod]
  2. public void ExpectedTrafficStopsByMonth_ReturnsExpected()
  3. {
  4.     var stops = AnalysisEngine.ExpectedTrafficStopsByMonth(27778);
  5.     double expected = 2359;
  6.     double actual =stops[0].Item2;
  7.  
  8.     Assert.AreEqual(expected, actual);
  9. }

With the actual and expected ready to go, I then put the two side by side:

  1. static member TrafficStopsByMonth =
  2.     let numberOfStops = float(AnalysisEngine.NumberOfRecords)
  3.     let monthlyExpected = AnalysisEngine.ExpectedTrafficStopsByMonth numberOfStops
  4.     let monthlyActual = AnalysisEngine.ActualTrafficStopsByMonth
  5.     Seq.zip monthlyExpected monthlyActual
  6.         |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
  7.         |> Seq.toList

  1. [TestMethod]
  2. public void TrafficStopsByMonth_ReturnsExpected()
  3. {
  4.     var output = AnalysisEngine.TrafficStopsByMonth;
  5.     Assert.IsNotNull(output);
  6.  
  7. }

All of my unit tests ran green

image

so now I am ready to roll.  I created a quick console UI

  1. static void Main(string[] args)
  2. {
  3.     Console.WriteLine("Start");
  4.  
  5.     foreach (var tuple in AnalysisEngine.TrafficStopsByMonth)
  6.     {
  7.         Console.WriteLine(tuple.Item1 + ":" + tuple.Item2 + ":" + tuple.Item3 + ":" + tuple.Item4 + ":" + tuple.Item5);
  8.     }
  9.  
  10.     Console.WriteLine("End");
  11.     Console.ReadKey();
  12. }

image

With the output.  Obviously, a UX person could put some real pizzaz front of this data, but that is something to do another day.  If you didn’t see it in the code above, the tuple is constructed as: Month,ExpectedStops,ActualStops,Difference,%Difference.  So the real interesting thing is that September was 47% higher than expected with December 26% less.  That kind of wide variation begs for more analysis.

I then did a similar analysis by DayOfMonth:

  1. static member ActualTrafficStopsByDay =
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> x.StopDateTime.Day)
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.toList
  6.  
  7. static member Days =
  8.     let dayList = [1..31]
  9.     Seq.map (fun x ->
  10.             match x with
  11.                 | x when x < 29 -> x, 12, 12./365.
  12.                 | 29 | 30 -> x, 11, 11./365.
  13.                 | 31 -> x, 7, 7./365.
  14.                 | _ -> x, 0, 0.                 
  15.         ) dayList
  16.     |> Seq.toList     
  17.  
  18. static member ExpectedTrafficStopsByDay numberOfStops =
  19.     AnalysisEngine.Days
  20.         |> Seq.map(fun (x,y,z) ->
  21.             x, int(z*numberOfStops))
  22.         |> Seq.toList    
  23.  
  24. static member TrafficStopsByDay =
  25.     let numberOfStops = float(AnalysisEngine.NumberOfRecords)
  26.     let dailyExpected = AnalysisEngine.ExpectedTrafficStopsByDay numberOfStops
  27.     let dailyActual = AnalysisEngine.ActualTrafficStopsByDay
  28.     Seq.zip dailyExpected dailyActual
  29.         |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
  30.         |> Seq.toList

image

The interesting thing is that there are higher than expected traffic stops in the last half of the month (esp the 25th and 26th) and much lower in the 1st part of the month.

And by TimeOfDay

  1. static member ActualTrafficStopsByHour =
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> x.StopDateTime.Hour)
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.toList
  6.  
  7. static member Hours =
  8.     let hourList = [1..24]
  9.     Seq.map (fun x ->
  10.                 x,1, 1./24.
  11.         ) hourList
  12.     |> Seq.toList     
  13.  
  14. static member ExpectedTrafficStopsByHour numberOfStops =
  15.     AnalysisEngine.Hours
  16.         |> Seq.map(fun (x,y,z) ->
  17.             x, int(z*numberOfStops))
  18.         |> Seq.toList    
  19.  
  20. static member TrafficStopsByHour =
  21.     let numberOfStops = float(AnalysisEngine.NumberOfRecords)
  22.     let hourlyExpected = AnalysisEngine.ExpectedTrafficStopsByHour numberOfStops
  23.     let hourlyActual = AnalysisEngine.ActualTrafficStopsByHour
  24.     Seq.zip hourlyExpected hourlyActual
  25.         |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
  26.         |> Seq.toList

image

 

The interesting thing here is that there are much higher than expected number of traffic stops from 1-2 AM (61% and 123%) with significantly less between 8PM and midnight.  Finally, I looked at GPS location for the stops.

  1. static member ActualTrafficStopsByGPS =  
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> System.Math.Round(x.Latitude,3).ToString() + ":" + System.Math.Round(x.Longitude,3).ToString())
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.sortBy snd
  6.         |> Seq.toList
  7.         |> List.rev
  8.  
  9. static member GetVarianceOfTrafficStopsByGPS =
  10.     let trafficStopList = AnalysisEngine.ActualTrafficStopsByGPS
  11.                             |> Seq.map(fun x -> double(snd x))
  12.                             |> Seq.toList
  13.     AnalysisEngine.Variance(trafficStopList)
  14.  
  15. static member GetAverageOfTrafficStopsByGPS =
  16.     AnalysisEngine.ActualTrafficStopsByGPS
  17.         |> Seq.map(fun x -> double(snd x))
  18.         |> Seq.average

 

You can see that I rounded the Latitude and Longitude to 3 decimal places.  Using Wikipedia, saying that 4 decimals at 23N is 10.24M and 45N it is 7.87M for latitude, I imputed that 35 is 8.94M.  With 1 M = 3.28 feet, that means that 4 decimals is with 30 feet and 3 decimals is within 300 feet and 2 decimals is within 3,000 feet.  300 feet seems like a good compromise so I ran with that.

So running the average and variance and the top GPS locations:

image

With an average of 11 stops per GPS location (less than 1 a month) and a variance of 725, there does not seem be a strong relationship between GPS location and traffic stops.

The upshot of all of this analysis seems to point to avoid getting stopped it is less important where you are than when you are.  This is confirmed anecdotally too – the Town actually broadcasts when they will have heightened traffic surveillance on Twitter and the like.  Ignore open data at your own risk.  

In any event, I my next step is to run this data though a machine-learning algorithm to see if there is anything else to uncover.

Setting up an OData Service on WebAPI2 to be used by F# Type Providers

I am prepping for the F#/Data Analytics workshop on January 8th and wanted to get the data that I used for the Road Alert application beck to better shape.  By better, I mean out of that crusty WCF SOAP that I have had it in for the last 2 years.  To that end, I jumped over to Mike Wasson’s Creating an OData tutorial.  All in all, it is a good step by step guide, but I had to make some changes to get it working for me.

Change #1 is that I am not using a local database, I am using a database located on WinHost.  Therefore, I had to swap out the EF connection string.

image

One of the things I can appreciate about the template is the comments to get the routing set up (I guess there is not attribute-based routing for O-Data?)

  1.         /*
  2. To add a route for this controller, merge these statements into the Register method of the WebApiConfig class. Note that OData URLs are case sensitive.
  3.  
  4. using System.Web.Http.OData.Builder;
  5. using ChkickenSoftware.RoadAlertServices.Models;
  6. ODataConventionModelBuilder builder = new ODataConventionModelBuilder();
  7. builder.EntitySet<TrafficStop>("TrafficStop");
  8. config.Routes.MapODataRoute("odata", "odata", builder.GetEdmModel());
  9. */

Things were looking good when I took a departure from the tutorial and added in a couple of unit tests (Change #2).  The 1st one was fairly benign:

  1. [TestClass]
  2. public class TrafficStopControllerIntegrationTests
  3. {
  4.     [TestMethod]
  5.     public void GetTrafficStopUsingKey_ReturnsExpected()
  6.     {
  7.         TrafficStopController controller = new TrafficStopController();
  8.         var trafficStop = controller.GetTrafficStop(1);
  9.         Assert.IsNotNull(trafficStop);
  10.     }
  11. }

Note that I had to add an app.config to the test project b/c this is an integration test and I am making a real database call – a unit test would using a mocking framework.  In any event, when I went to run the test, I got a compile error – I needed to add a reference to System.Web.Http.OData to resolve the return value from the controller.  Not big thing, though I wish I could install packages from Nuget via their .dll name and not just their package name:

image

In any event, I then ran the test and I got this exception:

image

So this is another reason why EF drives me nuts.  I have to add a reference to Entity Framework (and throw some crap in the .config file)

  1. <entityFramework>
  2.   <defaultConnectionFactory type="System.Data.Entity.Infrastructure.SqlConnectionFactory, EntityFramework" />
  3.   <providers>
  4.     <provider invariantName="System.Data.SqlClient" type="System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer" />
  5.   </providers>
  6. </entityFramework>

– even thought the calling application has nothing to do with EF.  In 2014, we have such dependency drip?  Really?  In any event, once I added a reference to EF and updated the .config file, my unit/integration test ran green so I was on the right track.

I then went to fiddler and tried to call the controller:

image

Yikes, it looks like my model has to match the EF exactly

The database:

 image

And the model:

  1. public class TrafficStop
  2. {
  3.     public Int32 Id { get; set; }
  4.     public double CadCallId { get; set; }
  5.     public DateTime StopDateTime { get; set; }
  6.     public Int32 DispositionId { get; set; }
  7.     public String DispositionDesc { get; set; }
  8.     public double Latitude { get; set; }
  9.     public double Longitude { get; set; }
  10. }

 

– I assume that I should be able to override this behavior – another thing to research.

So after matching up field names, I ran fiddler and sure enough:

image

So that was pretty painless to get an OData Service up and running.  I then removed everything but the read methods and I added an auth header (you can see the value in the screen shot above), feel free to hit up the service now that it is deployed to WinHost:

    One of the coolest things about OData is that it has a .WSDL type discovery:

http://chickensoftware.com/RoadAlert/odata/$metadata

I was really missing that when we went from SOAP Services to REST

Note that I had to do a couple of more things in Tsql (remember that?) to the original data to get it ready for general consumption (and analytics).  I had to create a real date/time from the 2 varchar fields:

Update [XXXX].[dbo].[TrafficStops]
Set StopDateTime = Convert(DateTime, right (left([Date],6),2) + ‘/’ + right([Date],2) + ‘/’ + left([Date],4) + ‘ ‘ + left(Time,2) + ‘:’ + Right(left(Time,4),2) + ‘:’ + Right(left(Time,6),2))

 

I also had to add an integral value for when we do statistical analysis:

Update [XXXXX].[dbo].[TrafficStops]
Set dispositionId =
CASE 
     WHEN dispositionDesc = ‘FURTHER ACTION NECESSARY’ THEN 1
     WHEN dispositionDesc = ‘UNABLE TO LOCATE’ THEN 2
     WHEN dispositionDesc = ‘FALSE ALARM’ THEN 3
     WHEN dispositionDesc = ‘WRITTEN WARNING’ THEN 4
     WHEN dispositionDesc = ‘OTHER    SEE NOTES’ THEN 5
     WHEN dispositionDesc = ‘REFERRED TO PROPER AGENCY’ THEN 6
     WHEN dispositionDesc = ‘VERBAL WARNING’ THEN 7
     WHEN dispositionDesc = ‘NULL’ THEN 8
     WHEN dispositionDesc = ‘ARREST’ THEN 9
     WHEN dispositionDesc = ‘NO FURTHER ACTION NECESSARY’ THEN 10
     WHEN dispositionDesc = ‘CIVIL PROBLEM’ THEN 11
     WHEN dispositionDesc = ‘COMPLETED AS REQUESTED’ THEN 12
     WHEN dispositionDesc = ‘INCIDENT REPORT’ THEN 13
     WHEN dispositionDesc = ‘UNFOUNDED’ THEN 14
     WHEN dispositionDesc = ‘CITATION’ THEN 15
     WHEN dispositionDesc = ‘FIELD CONTACT’ THEN 16
     WHEN dispositionDesc = ‘BACK UP UNIT’ THEN 17
     WHEN dispositionDesc = ‘CITY ORDINANCE VIOLATION’ THEN 18
END

So now I am ready to roll with doing the analytics.

So when I say “ready to roll”, I really meant to say “ready to flail.” When we last left the show, I was ready to start consuming the data from OData using the F# type providers.   Using Fiddler, I can see the data coming out of the OData service

image_thumb1

The problem started when I went to consume the data using the F# OData Type Provider as documented here.  I got the red squiggly line of approbation when I went to create the type:

image_thumb3

with the following message:

Error    1    The type provider ‘Microsoft.FSharp.Data.TypeProviders.DesignTime.DataProviders’ reported an error: error 7001: The element ‘DataService’ has an attribute ‘DataServiceVersion’ with an unrecognized version ‘3.0’.   

 

I went over to the F#-open source Google group to seek help and Isaac Abraham had this response:

WebAPI 2 now pushes out OData 3 endpoints by default, which are actually not even backwards compatible with the OData 2 standard. OData 3 was (AFAIK) released some time after the OData Type Provider was written, so I suspect it doesn’t support OData 3.

So I am stuck.  I really want to use type providers but they are behind.  I thought about if I could downgrade my WebAPI2 OData to go to OData2 standard (whatever that is).

My 1st thought was to trick out the client by removing the DataServiceVersion header like so:

  1. public class HeadersHandler : DelegatingHandler
  2. {
  3.     async protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
  4.     {
  5.         HttpResponseMessage response = await base.SendAsync(request, cancellationToken);
  6.        
  7.         response.Content.Headers.Remove("DataServiceVersion");
  8.         return response;
  9.     }
  10.  
  11. }

The header was removed, but alas, the RSLA is still with me with the same message.  I then thought, perhaps I can go back to the old version of Json so I modified the header like so:

  1. public class HeadersHandler : DelegatingHandler
  2. {
  3.     async protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
  4.     {
  5.         request.Headers.Add("Accept", "application/json;odata=verbose");
  6.  
  7.         HttpResponseMessage response = await base.SendAsync(request, cancellationToken);
  8.        
  9.         response.Content.Headers.Remove("DataServiceVersion");
  10.         return response;
  11.     }
  12.  
  13. }

So the Json is now the “old” version, but I am still getting the RSLA.  I then ran fiddler when creating the type provider and I see this:

image_thumb7

Crap.  I need to have Entity Framework use a lower version (I am using EF 6.0).  I guess?  My 1st thought was to remove EF from the situation entirely, which is always a good idea.  My next, and more time-efficient, thought was to ask Stack Overflow – which is what I did here.  While I wait for Stack Overflow to come to the rescue. I decided to press on.  I just exposed the data via a normal controller like so:

  1. public class TrafficStopSearchController : ApiController
  2. {
  3.     public List<TrafficStop> Get()
  4.     {
  5.         DataContext context = new DataContext();
  6.         return context.TrafficStops.ToList<TrafficStop>();
  7.     }
  8.     public TrafficStop Get(int id)
  9.     {
  10.         DataContext context = new DataContext();
  11.         return context.TrafficStops.Where(ts => ts.Id == id).FirstOrDefault();
  12.     }
  13.  
  14.     [HttpGet]
  15.     [Route("api/TrafficStopSearch/Sample/")]
  16.     public List<TrafficStop> Sample()
  17.     {
  18.         DataContext context = new DataContext();
  19.         return context.TrafficStops.Where(ts => ts.Id < 100).ToList();
  20.     }
  21. }

The reason I threw in the Sample method is that the F#  JSON type provider uses a sample to infer types and I didn’t want to send the entire set of data across the wire for that.  Once that was done, the traffic stop data was consumable in my F# application like so:

  1. type roadAlert = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
  2. type AnalysisEngine =
  3.     static member RoadAlertDoc = roadAlert.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)

Once/if I get the OData set up, I will swap this out but this is good enough for now – after all the interesting piece is not getting the data – but doing something with it!