Jamie Dixon's Home

Elevator App: Part 0

March 4, 2014 Leave a comment

When I was young, I used to have window races with my siblings in the car. In those days, windows would be manual with a crank like this:

Each kid would get a door and on “go”, we would crank, crank, crank the window all the way up and all the way down. These were great time passers on a long trip – until dad noticed that it was getting alternately windy and calm in the car and put an end to it. With the advent of electric motors for the windows in the car, the races became much closer – we honestly thought that harder you pushed on the button would cause the window move up and down faster. It was good fun until dad caught me pouring Crisco down the back left window of the Buick Skylark. Greasing the wheels, if you will.

Fast forward 15 years when I was just out of school at my 1^st job in San Francisco. One of the things we did when we didn’t have any money (often) on a Friday night was to ride elevators. Find a nice building, walk in like you knew what you were doing, and ride to the top and back down. Hotels were the best – ideally you could get free food in the bar for happy hour too – it was dinner theatre. Sometimes, if there were two elevators in the bank, we would turn it into a competition – pick your elevator wisely. If someone got on your elevator while you were racing, you were out of luck. This added a good deal of tension to the race. You were in the lead on the way back down but then someone stopped you on floor 2 to go down to G. “UGHHHHH. Take the stairs fat-ass!”

(BTW: how great would this be for a plot of a movie? The main characters pick different building to race until one night, they picked the wrong one….)

The biggest problem with elevator races (outside of getting caught and spending a night in jail), was that if the building only had 1 elevator, you really couldn’t race. Even if you had a stopwatch, you really didn’t know if the person made it to the top. Also, the chance of other people who were waiting stopping your elevator was now 100%. Plus, do you really trust your competition with the stop-watch? Maybe your friends, but not mine…

Fast forward 20 more years and I started working with open data. One of the data sets that TRINUG wanted to look at was elevator inspection data. When I went to the website where you can do reporting and there was a column called…. speed. Holy smokes, we can see the speed of the different elevators in town! I then thought about a phone app that could measure the speed of the elevator versus the reported speed. Also, we finally have a solution to the 1-elevator race problem! I then thought about how to tie this into coming up to speed with the latest javascript technologies (I am trying to get my MCSD in Web too). Thus, I am going to create an elevator speed app using Angular and Phonegap. Let’s see how it goes…

Filed under Certifications, Phone App

Restaurant Classifier: Async For Faster Performance?

March 4, 2014 2 Comments

Going back to my restaurant classifier using F# from last week, I decided to speed things up some. Each request to the Yellow Pages API takes 1 second, so with the 5,682 records, I am looking at a little over 1.5 hours to pull down the data when running serial.

I first thought about making my methods async so I changed the API call method to async and used the Http.AsyncRequest method like so (line 10 below):

member this.GetCatagoriesAsync(restaurantName: string, restaurantAddress: string) =
         async{
             if(String.IsNullOrEmpty(restaurantName)) then
                 failwith("restaurantName cannot be null or empty.")
             if(String.IsNullOrEmpty(restaurantAddress)) then
                 failwith("restaurantAddress cannot be null or empty.")
             let cleanedName = this.CleanName(restaurantName)
             let cleanedAddress = this.CleanAddress(restaurantAddress);
             let uri = "http://pubapi.atti.com/search-api/search/devapi/search?term=&quot;+cleanedName+"&searchloc="+cleanedAddress+"&format=json&key=qj5l8pphj5"
             let! response = FSharp.Net.Http.AsyncRequest(uri, headers=["user-agent", "None"])
             let ypResult = ypProvider.Parse(response)
             try
                 return ypResult.SearchResult.SearchListings.SearchListing.[0].Categories
             with
                 | ex -> return String.Empty
         } 

I then made the covering function async also (line 11 below)

member this.IsRestaurantInCatagoryAsync(restaurantName: string, restaurantAddress: string, restaurantCatagory: string) =
    async {
        if(String.IsNullOrEmpty(restaurantName)) then
            failwith("restaurantName cannot be null or empty.")
        if(String.IsNullOrEmpty(restaurantAddress)) then
            failwith("restaurantAddress cannot be null or empty.")
        if(String.IsNullOrEmpty(restaurantCatagory)) then
            failwith("restaurantCatagory cannot be null or empty.")
 
        System.Threading.Thread.Sleep(new System.TimeSpan(0,0,1))
        let! catagories = this.GetCatagoriesAsync(restaurantName, restaurantAddress)
        if(String.IsNullOrEmpty(catagories)) then return false
        else return this.IsCatagoryInCatagories(catagories,restaurantCatagory)
    }

The problem is that invoking the covering function via an anonymous method did not work easily.

After screwing around with the synax a bit, I went over to stack overflow where I found out two things:

There is not an easy way to do it (I was hoping for a Seq.FilterAsyc method)
Thomas Petricek is above my pay-grade.

In any event, I decided to drop the async and just look at parallelism. Turns out that there is a Parallel Seq class called PSeq, it is just not in the FSharp core library yet. I created a PSeq file in my project, moved it to the top and dropped the code in. I then changed the method call to use PSeq to invoke the serial methods:

member public this.GetChineseRestaurants () = 
    let catagoryRepository = new RestaurantCatagoryRepository()
    let catagory = "Chinese"
    this.GetRestaurants()
            |> PSeq.filter(fun (name, address) -> catagoryRepository.IsRestaurantInCatagory(name, address,catagory))
            |> Seq.toList    

When I first invoked it and looked at Fiddler (OT: did anyone notice that Fiddler’s new logo looks alot like a FSharp one? Probably just a coincidence), it was clear that things were running in parallel and that performance would improve. I have two cores on this workstation so my time be cut in half.

With the parallel method in my back pocket, I decided to see the ultimate result of the restaurant classification. I created a quick console app

class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("Start");
 
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        RestaurantBuilder builder = new RestaurantBuilder();
        var restaurants = builder.GetChineseRestaurants();
        
        foreach (var restaurant in restaurants)
        {
            Console.WriteLine(restaurant.Item1 + ":" + restaurant.Item2);
        }
        
        stopwatch.Stop();
        Console.WriteLine("Number of Chinese Restaurants: " + restaurants.Count());
        Console.WriteLine(stopwatch.Elapsed.ToString());
        Console.WriteLine("End");
        Console.ReadKey();
    }
}

I then ran the search on YP.com using my 4 core laptop and got the following results:

Compared to my original classifier based on name:

So the results make sense. The YP serial search would take at least 94.7 minutes, the YP parallel search took 41 minutes, and the in-memory name search took 3 seconds. The YP search(s) found restaurants that the name did not (Wang’s Kitchen, Crazy Fire Mongolian Grill, etc…) – 275 to 221, or 24% more restaurants.

I think that the next step is to look at the classifier and see how many restaurants are in both datasets and why the ones that are not in the YP one – where they are (did they even pay to be in the Yellow Pages?). Perhaps there is another YP category that can be considered. Also, it would be interesting to see of the restaurants that are in the name search and in the Yellow Pages that were not classified as Chinese – the false positive rate. Finally, I did see some 500s in Fiddler that had “read time out” so there is room for improvement to account for the transient faults…

Filed under Async, F#, Parallelism

Restaurant Classification Via the Yellow Pages API Using F#

February 25, 2014 1 Comment

As part of the restaurant analysis I did for open data day, I built a crude classifier to identify Chinese restaurants. The classifier looked at the name of the establishment and if certain key words were in the name, it was tagged as a Chinese restaurant.

member public x.IsEstablishmentAChineseRestraurant (establishmentName:string) =
    let upperCaseEstablishmentName = establishmentName.ToUpper()
    let numberOfMatchedWords = upperCaseEstablishmentName.Split(' ')
                                |> Seq.map(fun x -> match x with
                                                        | "ASIA" -> 1
                                                        | "ASIAN" -> 1
                                                        | "CHINA" -> 1
                                                        | "CHINESE" -> 1
                                                        | "PANDA" -> 1
                                                        | "PEKING" -> 1
                                                        | "WOK" -> 1
                                                        | _ -> 0)
                                |> Seq.sum
    match numberOfMatchedWords with
        | 0 -> false
        | _ -> true

Although this worked well enough for the analysis, I was interested in seeing if there was a way of using something that is more precise. To that end, I thought of the Yellow Pages – they classify restaurants into categories and assuming that the restaurant is in the yellow pages, it is a better way to determine the restaurant category versus just a name search.

The first thing I did was head over to the Yellow Pages (YP.com) website and sure enough, they have an API and a developers program. I signed up and had an API key within a couple of minutes.

The first thing I did was to try and search for a restaurant in the browser. I picked the first restaurant I came across in the dataset – Jumbo China #5. I created a request uri based on their API like so

http://pubapi.atti.com/search-api/search/devapi/search?term=Jumbo+China+5&searchloc=6108+Falls+Of+Neuse+Rd+27609&format=json&key=XXXXXXXXXX

When I plugged the name into the browser, I got this:

After screwing around with the code for about ten minutes thinking it was my API Key (Invalid Key would lead you to believe that, no?), Mike Thomas came over and told me that the url encoding was messing with my request – specifically the ‘#’ in Jumbo China #5. When I removed the # symbol, I got Json back:

Throwing the Json into Json2CSharp, the results look great:

I then took this URL and tried to load it into a F# type provider, I couldn’t understand why I was getting a red squiggly line of approbation (Json and XML):

so I pulled out Fiddler to see I was getting a 400. Digging into the response value, I found that “User Agent” was a required field.

The problem was then compounded because the FSharp Json type provider does not allow you to enter a User Agent into the constructor. I headed over to Stack Overflow where Thomas Petricek was kind enough to answer the question – basically you have to use the FSharp Http class to make the request (which you can add the user agent to) and then parse the response via the JsonProvider using the “Parse” versus the “Load” method. So spinning up the method like so:

This gave me the results back that I wanted. I then created a couple of methods to clean up any characters that might screw up the url encoding, added some argument validation, and I had a pretty good module to consume the YP.com listings:

namespace ChickenSoftware.RestaurantClassifier
 
open System
open FSharp.Data
open FSharp.Net
 
type ypProvider = JsonProvider< @"YP.txt">
 
type RestaurantCatagoryRepository() = 
   member this.GetCatagories(restaurantName: string, restaurantAddress: string) =
        if(String.IsNullOrEmpty(restaurantName)) then
            failwith("restaurantName cannot be null or empty.")
        if(String.IsNullOrEmpty(restaurantAddress)) then
            failwith("restaurantAddress cannot be null or empty.")
        let cleanedName = this.CleanName(restaurantName)
        let cleanedAddress = this.CleanAddress(restaurantAddress);
        let uri = "http://pubapi.atti.com/search-api/search/devapi/search?term=&quot;+cleanedName+"&searchloc="+cleanedAddress+"&format=json&key=XXXXXX"
        let response = FSharp.Net.Http.Request(uri, headers=["user-agent", "None"])
        let ypResult = ypProvider.Parse(response)
        try
            ypResult.SearchResult.SearchListings.SearchListing.[0].Categories
        with
            | ex -> String.Empty
 
    member this.CleanName(name: string) =
                name.Replace("#","").Replace(" ","+")
 
    member this.CleanAddress(address: string)=
                address.Replace("#","").Replace(" ","+")
    
    member this.IsCatagoryInCatagories(catagories: string, catagory: string) =
        if(String.IsNullOrEmpty(catagories)) then false
        else if (String.IsNullOrEmpty(catagory)) then false
        else catagories.Contains(catagory)
 
    member this.IsRestaurantInCatagory(restaurantName: string, restaurantAddress: string, restaurantCatagory: string) =
        if(String.IsNullOrEmpty(restaurantName)) then
            failwith("restaurantName cannot be null or empty.")
        if(String.IsNullOrEmpty(restaurantAddress)) then
            failwith("restaurantAddress cannot be null or empty.")
        if(String.IsNullOrEmpty(restaurantCatagory)) then
            failwith("restaurantCatagory cannot be null or empty.")
 
        System.Threading.Thread.Sleep(new System.TimeSpan(0,0,1))
        let catagories = this.GetCatagories(restaurantName, restaurantAddress)
        if(String.IsNullOrEmpty(catagories)) then false
        else this.IsCatagoryInCatagories(catagories,restaurantCatagory)
 
    member this.IsRestaurantInCatagoryAsync(restaurantName: string, restaurantAddress: string, restaurantCatagory: string) =
        async {
            if(String.IsNullOrEmpty(restaurantName)) then
                failwith("restaurantName cannot be null or empty.")
            if(String.IsNullOrEmpty(restaurantAddress)) then
                failwith("restaurantAddress cannot be null or empty.")
            if(String.IsNullOrEmpty(restaurantCatagory)) then
                failwith("restaurantCatagory cannot be null or empty.")
 
            let catagories = this.GetCatagories(restaurantName, restaurantAddress)
            if(String.IsNullOrEmpty(catagories)) then return false
            else return this.IsCatagoryInCatagories(catagories,restaurantCatagory)
        }

The associated unit and integration tests that I made in building this module look like this:

[TestClass]
public class CatagoryBuilderTests
{
 
    [TestMethod]
    public void CleanName_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
        String restaurantName = "Jumbo China #5";
 
        String expected = "Jumbo+China+5";
        String actual = repository.CleanName(restaurantName);
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void CleanAddress_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
        String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
 
        String expected = "6108+Falls+Of+Neuse+Rd+27609";
        String actual = repository.CleanAddress(restaurantAddress);
        Assert.AreEqual(expected, actual);
    }
 
 
    [TestMethod]
    public void GetCatagories_ReturnsExpectedValue()
    {
        string restaurantName = "Jumbo China #5";
        String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
 
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
        var result = repository.GetCatagories(restaurantName, restaurantAddress);
        Assert.IsNotNull(result);
    }
 
    [TestMethod]
    public void CatagoryIsContainedInCatagoriesUsingValidTrueData_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        String catagories = "Chinese Restaurants|Restaurants|";
        String catagory = "Chinese";
 
        Boolean expected = true;
        Boolean actual = repository.IsCatagoryInCatagories(catagories, catagory);
 
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void CatagoryIsContainedInCatagoriesUsingValidFalseData_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        String catagories = "Chinese Restaurants|Restaurants|";
        String catagory = "Seafood";
 
        Boolean expected = false;
        Boolean actual = repository.IsCatagoryInCatagories(catagories, catagory);
 
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void IsJumboChinaAChineseRestaurant_ReturnsTrue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        string restaurantName = "Jumbo China #5";
        String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
        String restaurantCatagory = "Chinese";
 
        Boolean expected = true;
        Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
 
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void IsJumboChinaAnItalianRestaurant_ReturnsFalse()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        string restaurantName = "Jumbo China #5";
        String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
        String restaurantCatagory = "Italian";
 
        Boolean expected = false;
        Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
 
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void IsUnknownAnItalianRestaurant_ReturnsFalse()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        string restaurantName = "Some Unknown Restaurant";
        String restaurantAddress = "Some Address";
        String restaurantCatagory = "Italian";
 
        Boolean expected = false;
        Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
 
        Assert.AreEqual(expected, actual);
    }
 
 
 
    [TestMethod]
    public void CatagoryIsContainedInCatagoriesUsingEmptyCatagory_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        String catagories = "Chinese Restaurants|Restaurants|";
        String catagory = String.Empty;
 
        Boolean expected = false;
        Boolean actual = repository.IsCatagoryInCatagories(catagories, catagory);
 
        Assert.AreEqual(expected, actual);
    }

The hardest test to get run green was the negative test – passing in a restaurant name that is not recognized

[TestMethod]
public void IsUnknownAnItalianRestaurant_ReturnsFalse()
{
    RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
    string restaurantName = "Some Unknown Restaurant";
    String restaurantAddress = "Some Address";
    String restaurantCatagory = "Italian";
 
    Boolean expected = false;
    Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
 
    Assert.AreEqual(expected, actual);
}

To code around the fact that a different set of Json came back and the original code is expecting a specific structure, I finally resorted to a try…catch

try
    ypResult.SearchResult.SearchListings.SearchListing.[0].Categories
with
    | ex -> String.Empty

I feel dirty, but I don’t know how else to get around it. In any event, I then coded up a module that pulled the list of restaurants from Azure and put them through the classifier.

namespace ChickenSoftware.RestaurantClassifier
 
open FSharp.Data
open System.Linq
open System.Configuration
open Microsoft.FSharp.Linq
open Microsoft.FSharp.Data.TypeProviders
 
type internal SqlConnection = SqlEntityConnection<ConnectionStringName="azureData">
 
type public RestaurantBuilder () =
    
    let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;
    
    member public this.GetRestaurants () =
        SqlConnection.GetDataContext(connectionString).Restaurants
            |> Seq.map(fun x -> x.EstablishmentName, x.EstablishmentAddress + " " + x.EstablishmnetZipCode)
            |> Seq.toArray
            
    member public this.GetChineseRestaurants () = 
        let catagoryRepository = new RestaurantCatagoryRepository()
        let catagory = "Chinese"
        this.GetRestaurants()
                |> Seq.filter(fun (name, address) -> catagoryRepository.IsRestaurantInCatagory(name, address,catagory))
                |> Seq.toList

This code is almost identical to the code I posted 2 weeks ago. Sure enough, When I threw my integration tests at the functions, check out fiddler.

I was getting responses. I ran into the problem on the 50th request though.

To get around this occasional timeout issue, I threw in a second delay between each request, which seemed the solve the problem.

System.Threading.Thread.Sleep(new System.TimeSpan(0,0,1))
let catagories = this.GetCatagories(restaurantName, restaurantAddress)
if(String.IsNullOrEmpty(catagories)) then false
else this.IsCatagoryInCatagories(catagories,restaurantCatagory)

However, this then introduced a new problem. There are 4,000 or so restaurants, so that is over 66 minutes of running. Not good. Next week, I hope to add some parallelism to speed things up…

Filed under F#, Open Data

Trigrams and F#

February 18, 2014 2 Comments

Rob Seder wrote a great post of trigrams last week. He then asked me how the same functionality would be implemented in F# – specifically dropping the for..each. Challenge accepted!.

The first thing I did was hit Stack Overflow to see if there is a built in function to parse a string by groups and I had a answer within minutes for exactly what I was looking for (thanks MattNewport).

So to match Rob’s BuildTrigram function, I wrote this:

type TrigramBuilder() = 
    member this.BuildTrigrams(inputString: string) =
        inputString
            |> Seq.windowed 3
            |> Seq.map(fun a -> System.String a)
            |> Seq.toArray

And I had a covering unit test already created:

[TestMethod]
public void GetTrigrams_ReturnsExpectedValue()
{
    var builder = new TrigramBuilder();
    String inputString = "ABCDEFG";
 
    String[] expected = new String[] { "ABC", "BCD", "CDE", "DEF", "EFG" };
    String[] actual = builder.BuildTrigrams(inputString);
 
    CollectionAssert.AreEqual(expected, actual);
}

I then Implemented a function that matches his double loops (can’t tell the function name from the code snippet on the blog post):

member this.GetMatchPercent(baseString: string, compareString: string) =
    let trigrams = this.BuildTrigrams(compareString)
    let matchCount = trigrams
                        |> Seq.map(fun t -> match baseString.Contains(t) with
                                                | true -> 1
                                                | false -> 0)
                        |> Seq.sum
    let totalCount = trigrams.Length
    float matchCount/float totalCount

And throwing in some covering unit tests:

public void GetMatchPercentageOfExactMatch_ReturnsExpectedValue()
{
    var builder = new TrigramBuilder();
    
    String baseString = "ABCDEF";
    String compareString = "ABCDEF";
 
    double expected = 1.0;
    double actual = builder.GetMatchPercent(baseString, compareString);
 
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void GetMatchPercentageOf50PercentMatch_ReturnsExpectedValue()
{
    var builder = new TrigramBuilder();
 
    String baseString = "ABCD";
    String compareString = "ABCDEF";
 
    double expected = 0.5;
    double actual = builder.GetMatchPercent(baseString, compareString);
 
    Assert.AreEqual(expected, actual);
}

Sure enough, green across the board:

Filed under F#

Analysis of Health Inspection Data using F#

February 11, 2014 4 Comments

As part of the TRINUG F#/Analytics SIG, I did a public records request from Wake County for all of the restaurant inspections in 2013. If you are not familiar, the inspectors go out and then give a score to the restaurant. The restaurant then has to display their score like this:

After some back and forth, I got the data as an Excel spreadsheet that looks like this

I then loaded the spreadsheet into a sql server and exposed it as some OData endpoints.

// GET odata/Restaurant
[Queryable]
public IQueryable<Restaurant> GetRestaurant()
{
    return db.Restaurants;
}
 
// GET odata/Restaurant(5)
[Queryable]
public SingleResult<Restaurant> GetRestaurant([FromODataUri] int key)
{
    return SingleResult.Create(db.Restaurants.Where(restaurant => restaurant.Id == key));
}

I then dove into the data to see if there were any interesting conclusions to be found. Following my pattern of doing analytics using F# and unit testing using C#, I created a project with the following code:

namespace ChickenSoftware.RestraurantChicken.Analysis
 
open System.Linq
open System.Configuration
open Microsoft.FSharp.Linq
open Microsoft.FSharp.Data.TypeProviders
 
type internal SqlConnection = SqlEntityConnection<ConnectionStringName="azureData">
 
type public RestaurantAnalysis () =
    
    let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;

Note that I am using the connection string in two places – the 1st for the type provider to do its magic at design time and the second for actually accessing the data at run time. With that set up, the 1st question I had was “ is there seasonality in inspection scores like there are in traffic tickets?” To that end, I created the following function:

member public x.GetAverageScoreByMonth () =
    SqlConnection.GetDataContext(connectionString).Restaurants
        |> Seq.map(fun x -> x.InspectionDate.Value.Month, x.InspectionScore.Value)
        |> Seq.groupBy(fun x -> fst x)
        |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
        |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
        |> Seq.toArray
        |> Array.sort

This is pretty vanilla F# code, with the tricky part being the average by month (lines 4 and 5 here). What the code is doing is grouping up the 4,000 or so tuples that were created on line 3 into another tuple – with the fst being the groupBy value (in this case month) and then the second tuple being a tuple with the month and score. Then, by averaging up the score of the second tuple, we get an average for each month. I create a unit (really integration) test like so:

[TestMethod]
public void GetAverageScoreByMonth_ReturnsTwelveItems()
{
    var analysis = new RestaurantAnalysis();
    var scores = analysis.GetAverageScoreByMonth();
    Int32 expected = 12;
    Int32 actual = scores.Length;
    Assert.AreEqual(expected, actual);
}

And the result ran green.

Putting a break on the Assert and a watch on scores, you can see the values:

A couple of things stand out

1) The overall average is around 96 and change

2) There does not seem to be any significant variance among months.

Since I am trying to also teach myself D3, I then added a MVC5 project to my solution and added an analysis controller that calls the function in the analysis module and serves the results as json:

public JsonResult AverageScoreByMonth()
{
    var analysis = new RestaurantAnalysis();
    var scores = analysis.GetAverageScoreByMonth();
    return Json(scores,JsonRequestBehavior.AllowGet);
}

I then made a page with a simple D3 chart that calls this controller

@{
    Layout = "~/Views/Shared/_Layout.cshtml";
}
 
<svg class="chart"></svg>
 
<style>
    .bar {
        fill: steelblue;
    }
 
        .bar:hover {
            fill: brown;
        }
 
    .axis {
        font: 10px sans-serif;
    }
 
        .axis path,
        .axis line {
            fill: none;
            stroke: #000;
            shape-rendering: crispEdges;
        }
 
    .x.axis path {
        display: none;
    }
</style>
 
 
 
<script>
 
    var margin = { top: 20, right: 20, bottom: 30, left: 40 },
        width = 960 – margin.left – margin.right,
        height = 500 – margin.top – margin.bottom;
 
    var x = d3.scale.ordinal()
        .rangeRoundBands([0, width], .1);
 
    var y = d3.scale.linear()
        .range([height, 0]);
 
    var xAxis = d3.svg.axis()
        .scale(x)
        .orient("bottom");
 
    var yAxis = d3.svg.axis()
        .scale(y)
        .orient("left")
        .ticks(10, "%");
 
    var svg = d3.select("body").append("svg")
        .attr("width", width + margin.left + margin.right)
        .attr("height", height + margin.top + margin.bottom)
      .append("g")
        .attr("transform", "translate(" + margin.left + "," + margin.top + ")");
 
 
 
    $.ajax({
        url: "http://localhost:3057/Analysis/AverageScoreByMonth/&quot;,
        dataType: "json",
        success: function (data) {
            x.domain(data.map(function (d) { return d.Item1; }));
            y.domain([0, d3.max(data, function (d) { return d.Item2; })]);
 
            svg.append("g")
                .attr("class", "x axis")
                .attr("transform", "translate(0," + height + ")")
                .call(xAxis);
 
            svg.append("g")
                .attr("class", "y axis")
                .call(yAxis)
              .append("text")
                .attr("transform", "rotate(-90)")
                .attr("y", 6)
                .attr("dy", ".71em")
                .style("text-anchor", "end")
                .text("Frequency");
 
            svg.selectAll(".bar")
                .data(data)
              .enter().append("rect")
                .attr("class", "bar")
                .attr("x", function (d) { return x(d.Item1); })
                .attr("width", x.rangeBand())
                .attr("y", function (d) { return y(d.Item2); })
                .attr("height", function (d) { return height – y(d.Item2); });
 
        },
        error: function (e) {
            alert("error");
        }
    });
 
    function type(d) {
        d.Item2 = +d.Item2;
        return d;
    }
</script>

And when I run it, a run-of-the mill barchart (I did have to adjust the F# to shift the decimal to the left two positions so that I could match the scale of the chart’s template. For me, it is easier to alter the F# than the javascript:

Following this pattern, I did some other season analysis like average by DayOfMonth

DayOf Week.

So there does not seem to be any seasonality in inspection scores.

I then did an average of inspectors

And there looks to be some variance, but it is getting lost of the scale of the map. The problem is that the range of the scores is not 0 to 100

Here is a function that counts the number of scores (rounded to 0)

member public x.CountOfRoundedScores () =
    SqlConnection.GetDataContext(connectionString).Restaurants
        |> Seq.map(fun x -> System.Math.Round(x.InspectionScore.Value,0), x.InspectionID)
        |> Seq.groupBy(fun x -> fst x)
        |> Seq.map(fun (x,y) -> (x,y |> Seq.countBy snd))
        |> Seq.map(fun (x,y) -> (x,y |> Seq.sumBy snd))
        |> Seq.toArray

That graphically looks like:

So back to inspectors, I needed to adjust the scale from 0 to 100 to 80 to 100. I also needed to remove the null inspection Ids and the records that were for the ‘test facility’ and the 6 records that were below 80.

member public x.AverageScoreByInspector () =
    SqlConnection.GetDataContext(connectionString).Restaurants
        |> Seq.filter(fun x -> x.EstablishmentName <> "Test Facility")
        |> Seq.filter(fun x -> x.InspectionScore.Value > 80.)
        |> Seq.filter(fun x -> x.InspectionID <> null)
        |> Seq.map(fun x -> x.InspectorID, x.InspectionScore.Value)
        |> Seq.groupBy(fun x -> fst x)
        |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
        |> Seq.map(fun (x,y) -> x, y/100.)
        |> Seq.map(fun (x,y) -> x, System.Math.Round(y,4))
        |> Seq.toArray
        |> Array.sort

I then adjusted the scale of the inspector graph to have to domain from 80 to 100 (versus 0 to 100) and the scale of the y axis. This was a good article explaining Scales and Domains in D3.

var yAxis = d3.svg.axis()
    .scale(y)
    .orient("left")
    .ticks(10);

$.ajax({
    url: "http://localhost:3057/Analysis/AverageScoreByInspector/&quot;,
    dataType: "json",
    success: function (data) {
        x.domain(data.map(function (d) { return d.Item1; }));
        y.domain([80, d3.max(data, function (d) { return d.Item2; })]);

and now there is pretty good graph showing the variance among inspectors:

So the interesting this is that #1168 is 2 below the average – which of a domain of 10 is pretty significant. Interestingly, 1168 is also the inspector who has all of the “Test facility” records – so they are probably the trainer and/or lead inspector. With this analysis in the back pocket, ran a function that did the inspection score by establishment type:

This is kinda interesting (esp that pushcarts got the highest scores) but I wanted to see if there was any truth the the common perception that Chinese restaurants are less sanitary than other kinds of restaurants. To that end, I created a rudimentary classifier that searched the name of the establishment to see if it had a name that is typically associated with fast-food Chinese:

member public x.IsEstablishmentAChineseRestraurant (establishmentName:string) =
    let upperCaseEstablishmentName = establishmentName.ToUpper()
    let numberOfMatchedWords = upperCaseEstablishmentName.Split(' ')
                                |> Seq.map(fun x -> match x with
                                                        | "ASIA" -> 1
                                                        | "ASIAN" -> 1
                                                        | "CHINA" -> 1
                                                        | "CHINESE" -> 1
                                                        | "PANDA" -> 1
                                                        | "PEKING" -> 1
                                                        | "WOK" -> 1
                                                        | _ -> 0)
                                |> Seq.sum
    match numberOfMatchedWords with
        | 0 -> false
        | _ -> true

I then created a function that returned the average and ran my unit tests.

[TestMethod]
public void IsEstablishmentAChineseRestraurantUsingWOK_ReturnsTrue()
{
    var analysis = new RestaurantAnalysis();
    String establishmentName = "JAMIE'S WOK";
 
    var expected = true;
    var actual = analysis.IsEstablishmentAChineseRestraurant(establishmentName);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void IsEstablishmentAChineseRestraurantUsingWok_ReturnsTrue()
{
    var analysis = new RestaurantAnalysis();
    String establishmentName = "Jamie's Wok";
 
    var expected = true;
    var actual = analysis.IsEstablishmentAChineseRestraurant(establishmentName);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void AverageScoreForChineseRestaurants_ReturnsExpected()
{
    var analysis = new RestaurantAnalysis();
    var actual = analysis.AverageScoreForChineseRestaurants();
    Assert.IsNotNull(actual);
}

When a break was put on the value of the average, it was apparent that Chinese restaurants scored significantly lower than the average of 96

So then I applied 1 more segmentation: Chinese versus Non-Chinese scores by inspector:

member public x.AverageScoresOfChineseAndNonChineseByInspector () =
    let dataSet = SqlConnection.GetDataContext(connectionString).Restaurants
                    |> Seq.map(fun x -> x.EstablishmentName, x.InspectorID,x.InspectionScore.Value)
    let chineseRestraurants = dataSet
                                |> Seq.filter(fun (a,b,c) -> x.IsEstablishmentAChineseRestraurant(a))
                                |> Seq.map(fun (a,b,c) -> b,c)
                                |> Seq.groupBy(fun x -> fst x)
                                |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
                                |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
                                |> Seq.toArray
                                |> Array.sort
    let nonChineseRestraurants = dataSet
                                |> Seq.filter(fun (a,b,c) -> not(x.IsEstablishmentAChineseRestraurant(a)))
                                |> Seq.map(fun (a,b,c) -> b,c)
                                |> Seq.groupBy(fun x -> fst x)
                                |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
                                |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
                                |> Seq.toArray
                                |> Array.sort
    Seq.zip chineseRestraurants nonChineseRestraurants
           |> Seq.map(fun ((a,b),(c,d)) -> a,b,d)
           |> Seq.toList

And in graphics using a double-bar chart:

So this is kinda interesting. The lead inspector (1168) who grades everyone lower actually gives Chinese restaurants higher marks. Everyone else pretty much grades Chinese restaurants lower except for 1 inspector. Also, 1708 must really not like Chinese restaurants – or their inspection list has a series of really bad Chinese restaurants.

Note that this may not be statistically significant (I didn’t control for sample size, etc..) – but further analysis might be warranted, no? If you are interested, here is the endpoint: http://restaurantchicken.cloudapp.net/odata/Restaurant

Finally, when I presented this analysis to TRINUG last week, lots of people became interested in F# and analytics (ok, maybe 3). You can see the comments here. Also, I now have an appointment with the head of the health department department and the CIO of Wake County later this week – let’s see what they say…

Filed under Analytics, D3, F#

Unit Testing F# Projects

February 1, 2014 1 Comment

I am a big believer of unit tests and as I write more and more code, I suffer what sociologists call “confirmation bias” whereby I keep finding more and more reasons to confirm that I am right. But I am at the point where I don’t believe in developer documentation (sorry Sandcastle), reflector (sorry Redgate), code comments, or architectural diagrams. I believe in the code, the whole code, and nothing but the code. Or as Rasheed Wallace might say if he was a coder versus a professional basketball player: “Code Don’t Lie!”.

And the only code that tells you what a module is doing is the unit tests. If you want to see how the module behaves, look at the green unit tests. If you want to see how the module is supposed to behave but is not, look at the red unit tests. If you want to see how the module might or might behave because the code is out of control, look for the non-existent unit tests.

So when I started writing code in F#, the unit tests went along for the ride. F# folks will tell you to use the REPL to get your code working and/or use a unit test project in F#. I don’t do either because:

1) The REPL is designed for quick prototyping, not for having a durable, cantonal example of module behavior. Having a full suite of unit tests gives you code coverage, tests at the build, and a fail-proof way of documenting the code’s behavior.

2) Most of the other developers I work with are CSharpers. Having the unit tests in C# allows them to understand the behavior in a language in which they are familiar. Since the tests need to communicate the working code’s intent, having that in a language they understand is critical. They don’t have to understand F# to use a F# module. Also, porting from C# in MSTest to NUnit in C# is a snap.

So when you add a C# Unit Test project to your solution that has a F# module you want to test, there are a couple of things you need to do.

1) Add a reference from the Unit Test Project to the F# Project (Right Click, Add Reference)

2) Add a reference to F#

3) If you are using type providers (and who doesn’t) and you have the connection string in the .config file of the working code project, add a .config file to the unit test project and copy over the connection string

Note that you code has to reflect that the connection string is being used in 2 different ways by the type provider, as explained in this post. Your F# code needs to look like this:

type internal SqlConnection = SqlEntityConnection<ConnectionStringName="azureData">
 
 
type public RestaurantAnalysis () =
    
    let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;
    member public x.GetScoresByMonth () =
        SqlConnection.GetDataContext(connectionString).Restaurants
            |> Seq.map(fun x -> x.InspectionDate.Value.Month, x.InspectionScore.Value)
            |> Seq.groupBy(fun x -> fst x)
            |> Seq.toList

4) Finally, you have to rebuild the F# project each time you want the unit tests to pick up changes. That is different from a C# unit test project referencing a C# working code project.

Finally, not related to unit testing but too short for a blog post, if you are using the type providers (and who doesn’t) and you need to expose your classes publicly, you can’t use the SqlEntity provider – you need to use the SqlData provider. The catch is that SqlData does not work with Azure Sql Storage as far as I can tell. In my case, I used SqlEntity and exposed tuples and custom types publicly. Not the best, but still better than using C# and Entity Framework…

Filed under F#, Unit Testing || Mocking

Basic Insert Operation Using F#

January 28, 2014 4 Comments

So the more I use F#, the more I come to understand the benefits and limitations of the language. Since I spend a majority of my day job in C# and JavaScript, it is a natural comparison between these two languages and F#. One of the tenants of F# is ‘less noise, more signal’. After looking at some projects, I am coming to the conclusion that Entity Framework, LinqToSql, <Any other ORM> is just noise. It is expensive noise at that – if you have worked on a production app using EF and you to do anything outside of the examples on MSDN, you know what I mean.

So can EF type providers replace the overhead, code bloat, and additional costs of Entity Framework? I decided to do a small test to see. I needed to load into SqlServer 27,481 records of crash data that I get from the North Carolina DOT. The records came to me in an Excel Spreadsheet which I pumped into MSAccess. Then, instead of using SQl Server SSIS/Bulk Data Load functions, I decided to create an app that pulls that data from the Access database and load it into the SqlServer database via a type provider.

My first step was to look for a MSAccess type provider. No luck. I then hit up Stack Overflow and found this and this article for working with Access. I coded up a solution to get the data into a DataReader like so

static member GetCrashData =
    let connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=E:\Documents\Road Alert\WakeCountyCrashes.accdb; Persist Security Info=False;"
    use connection = new OleDbConnection(connectionString)
    let commandText = "Select * from Data"
    use command = new OleDbCommand(commandText,connection)
    connection.Open()
    use reader = command.ExecuteReader()

My first attempt to get the data from the read was a tuple like so:

[while reader.Read() do
     yield reader.GetInt32(0), 
     reader.GetFieldValue(1).ToString(), 
     reader.GetFieldValue(2).ToString(), 
     reader.GetDouble(3),
     reader.GetFieldValue(4).ToString(), 
     reader.GetFieldValue(5).ToString(), 
     reader.GetFieldValue(6).ToString(), 
     reader.GetDateTime(7), 
     reader.GetDateTime(8), 
     reader.GetFieldValue(9).ToString(), 
     reader.GetFieldValue(10).ToString() 
 ]

Sure enough, it works like a champ from my C# UI (once I ran AccessDatabaseEngine.exe on my machine – sigh)

private static void GetCrashData()
{
    var results = CrashDataLoader.GetCrashData;
    Console.WriteLine(results.Count().ToString());
}

Gives

The next thing I did was to create a Type Provider

And then create the method to insert the data into the database:

static member LoadCrashData  =
    let targetDatabase = targetSchema.GetDataContext()
    let rows = CrashDataLoader.GetCrashData
    targetDatabase.TrafficCrashes.InsertAllOnSubmit(rows)
    targetDatabase.DataContext.SubmitChanges()
    true

The problem is that I ran into was that the GetCrashData was returning a Tuple and the LoadCrashData was expecting a Typed CrashData element. I searched for a bit and then gave up trying to figure out how to map the two without explicitly assigning each field. So then I did it the old fashion way like so:

static member TrafficCrashFromReader(reader: OleDbDataReader) =
    let trafficCrash = new targetSchema.ServiceTypes.TrafficCrashes()
    trafficCrash.NCDmvCrashId <- System.Nullable<float> (float (reader.GetFieldValue(0).ToString()))
    trafficCrash.Municipality <- reader.GetFieldValue(1).ToString()  
    trafficCrash.OnRoad <- reader.GetFieldValue(2).ToString()  
    trafficCrash.Miles <- System.Nullable<double> (double (reader.GetFieldValue(3).ToString()))
    trafficCrash.Direction <- reader.GetFieldValue(4).ToString()
    trafficCrash.FromRoad <- reader.GetFieldValue(5).ToString()
    trafficCrash.TowardRoad <- reader.GetFieldValue(6).ToString()
    trafficCrash.DateOfCrash <- System.Nullable<DateTime> (reader.GetDateTime(7))
    trafficCrash.TimeOfCrash <- System.Nullable<DateTime> (reader.GetDateTime(8))
    trafficCrash.CrashType <- reader.GetFieldValue(9).ToString()
    trafficCrash.CrashSeverity <- reader.GetFieldValue(10).ToString()
    trafficCrash

The fact that I am using the <- symbol is a code smell, but I am not sure how to get around it.

In any event, once I ran it in my Console app:

private static void LoadCrashData()
{
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    CrashDataLoader.LoadCrashData();
    stopWatch.Stop();
    Console.WriteLine("The load took " + stopWatch.Elapsed.TotalSeconds + " seconds.");
}

I got nothing after 30m minutes! Yikes.

I then went back and wrote a function to insert 1 row at a time:

static member LoadCrashDataRow dataRow =
    let targetDatabase = targetSchema.GetDataContext()
    targetDatabase.TrafficCrashes.InsertOnSubmit(dataRow)
    targetDatabase.DataContext.SubmitChanges()
    true

And the consuming app:

private static void LoadCrashData()
{
 
    var crashRows = CrashDataLoader.GetCrashData;
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    foreach (var crashRow in crashRows)
    {
        CrashDataLoader.LoadCrashDataRow(crashRow);
        Console.WriteLine(crashRow.NCDmvCrashId + " loaded.");
    }
    stopWatch.Stop();
    Console.WriteLine("The load took " + stopWatch.Elapsed.TotalSeconds + " seconds.");
}

Sure enough, it works like a champ.

So it is slow – though I am not sure EF is any faster. But not having to deal with that .edmx files, the .tt files, the whatever-else-we-throw-in files, I think further research is definitely warranted. Also, there are some other things I think F# Type Providers need to have:

1) Ability to handle proxies

2) Making Plural tables singular (The table name is Crashes, the type should be Crash)

3) An MS Access TP would be great

4) An Azure Sql Database TP would be doubly great

Filed under Entity Frameworks, F#

Traffic Stop Visualization Using D3

January 28, 2014 Leave a comment

One of the comments I got from the TRINUG Data SIG was that the data and analysis were exciting but the results were, well, boring. So I went back to the traffic stop data and thought about how I could sexy it up. Since lots of people have been using D3 to present the data. I thought it would be a good place to start.

My 1st step was to look at their samples page and they have a simple bar chart that seems like a good introduction to the library. I created an endpoint on my webapi for the summary data like so:

[HttpGet]
[Route("api/TrafficStopSearch/StopsByMonth/")]
public dynamic StopsByMonth()
{
    return ChickenSoftware.RoadAlert.Analysis.AnalysisEngine.TrafficStopsByMonth;
 
}

I then spun up an empty asp.net website and created an index page based on their sample. I then added an ajax call to the controller and replaced the reference to the data.tsv file:

$.ajax({
   url: "http://localhost:17680/api/TrafficStopSearch/StopsByMonth/&quot;,
   dataType: "json",
   success: function (data) {
       x.domain(data.map(function (d) { return d.m_Item1; }));
       y.domain([0, d3.max(data, function (d) { return d.m_Item6; })]);
 
       svg.append("g")
           .attr("class", "x axis")
           .attr("transform", "translate(0," + height + ")")
           .call(xAxis);
 
       svg.append("g")
           .attr("class", "y axis")
           .call(yAxis)
         .append("text")
           .attr("transform", "rotate(-90)")
           .attr("y", 6)
           .attr("dy", ".71em")
           .style("text-anchor", "end")
           .text("Frequency");
 
       svg.selectAll(".bar")
           .data(data)
         .enter().append("rect")
           .attr("class", "bar")
           .attr("x", function (d) { return x(d.m_Item1); })
           .attr("width", x.rangeBand())
           .attr("y", function (d) { return y(d.m_Item6); })
           .attr("height", function (d) { return height – y(d.m_Item6); });
       
   },
   error: function(e){
       alert("error");
   }
});

On thing to note is that the tuple that was created in F# and then passed though via the C# controller had its name changed. Specially, Tuple.Item1 became Tuple.m_Item1. I think that passing out tuple.anything is a horrible idea, so I created a POCO that actually lets the consumer know what each field means:

public class OutputValue
{
    public Int32 Month { get; set; }
    public Int32 ExpectedStops { get; set; }
    public Int32 ActualStops { get; set; }
    public Double DifferenceBetweenExpectedAndActual { get; set; }
    public Double PercentDifferenceBetweenExpectedAndActual { get; set; }
    public Double Frequency { get; set; }
}

and then I adjusted the controller like so:

 
[HttpGet]
[Route("api/TrafficStopSearch/StopsByMonth/")]
public dynamic StopsByMonth()
{
    var outputs = new List<OutputValue>();
    var resultSet= ChickenSoftware.RoadAlert.Analysis.AnalysisEngine.TrafficStopsByMonth;
    foreach (var tuple in resultSet)
    {
        var outputValue = new OutputValue()
        {
            Month = tuple.Item1,
            ExpectedStops = tuple.Item2,
            ActualStops = tuple.Item3,
            DifferenceBetweenExpectedAndActual = tuple.Item4,
            PercentDifferenceBetweenExpectedAndActual = tuple.Item5,
            Frequency = tuple.Item6
        };
        outputs.Add(outputValue);
    }
 
    return outputs;
}

So I adjusted the javascript and voila: a bar chart:

Up next – some real charts…

Filed under D3

Screen Scraping The Department Of Health

January 21, 2014 Leave a comment

As part of TRINUG’s Analytics SIG, some of the people were interested in health inspections found here. I created a Public Records Request (PRR) on their website’s intact form:

And in the comments, I said this:

I would like to make a Public Records Request on this data. I would like the following fields for all inspections 1/1/2013 to 12/31/2013:

InspectionId
InspectorId
EstablishmentId
EstablishmenName
EstablishmentAddress
EstablishmentCity
EstablishmentZip
EstablishmentLat
EstablishmentLong
EstablishmentTypeId
EstablishmentTypeDesc
InspectionScore
NumberOfNonCriticalViolations
NumberOfCriticalViolations
InspectionDate

After a week, I did not hear back (their response is supposed to be 48 hours) so I emailed the director:

Not wanting to wait any longer for the data, I decided to do some screen scraping. To that end, I did an on-line report and go something like this:

with the pages of the report down here:

I then went into source and checked out the pagination. Fortunately, it was uri-based so there are 144 different uris like this one:

I pulled down all of the uris and put them into Excel. I then trimmed off the <a and the text after the word “class”:

Fortunately, there was no restaurant with the word “class” in its name. I know have 144 uris ready to go. I then saved the uris into a .csv.

My next step is to suck these uris into a a database. I have learned the hard way that screen scraping is fraught with mal-formed data and unstable connections. Therefore, I will make a request and pull down a page and store it when the getting is good. I will then parse that page separately. Since the data appears in the past, I am less concerned about the data changing after I pull it local.

When I spun up an instance of Sql Server and tried to import the data, I kept getting things like this.

So it is pretty obvious that Sql Server doesn’t make it easy to import text like uris (and I can image HTML). I decided to spin up an instance of MongoDb. Also, because the F# MongoDb driver is not in NuGet, I decided to go with C#.

I then fired up a C# and read all of the uris into a List

static List<String> GetUrisFromFileSystem()
{
    var path = @"C:\HealthDepartment\Inspections.csv";
    var contents = File.ReadAllText(path);
    var splitContents = contents.Split('\n');
    var contentList = splitContents.ToList<String>();
    contentList.RemoveAt(0);
    return contentList;
}

I then wrote the list into MongoDb.

static void LoadDataIntoMongo(List<String> uris)
{
    var connectionString = "mongodb://localhost";
    var client = new MongoClient(connectionString);
    var server = client.GetServer();
    var database = server.GetDatabase("HealthDepartment");
    var collection = database.GetCollection<String>("UriEntities");
    foreach (String uri in uris)
    {
        var entity = new UriEntity { Uri = uri };
        collection.Insert(entity);
        var id = entity.Id;
        Console.WriteLine(id);
    }
 
    
}

The 1 gotcha is that I made my UriEntity class have a string as the Id. This is not idomatic to Mongo and I got this:

It needs to be a a type of ObjectId. Once I made that switch, I got this:

The fact that MongoDb makes things so much easier than SqlServer is really impressive.

With the Uris in Mongo, I then wanted to make a request to the individual pages. I created a method that got to the contents of the page:

static String GetHtmlForAUri(String uri)
{
    var fullyQualifiedUri = "http://wake.digitalhealthdepartment.com/&quot; + uri;
    var request = WebRequest.Create(fullyQualifiedUri);
    var response = request.GetResponse();
    using(var stream = response.GetResponseStream())
    {
        using(var reader = new StreamReader(stream))
        {
            return reader.ReadToEnd();
        }
    }
 
}

I then fired up class to take the contents that are associated with the Uri:

public class PageContentEntity
{
    public ObjectId Id { get; set; }
    public ObjectId UriId { get; set; }
    public String PageContent { get; set; }
}

And created a method to persist the contents:

static void LoadPageContentIntoMongo(PageContentEntity entity)
{
    var connectionString = "mongodb://localhost";
    var client = new MongoClient(connectionString);
    var server = client.GetServer();
    var database = server.GetDatabase("HealthDepartment");
    var collection = database.GetCollection<PageContentEntity>("PageContentEntities");
    collection.Insert(entity);
    Console.WriteLine(entity.Id);
}

And then a method to put everything together

static void LoadAllPageContentIntoMongo()
{
    var connectionString = "mongodb://localhost";
    var client = new MongoClient(connectionString);
    var server = client.GetServer();
    var database = server.GetDatabase("HealthDepartment");
    var collection = database.GetCollection<UriEntity>("UriEntities");
    foreach(var uriEntity in collection.FindAllAs<UriEntity>())
    {
        String pageContent =  GetHtmlForAUri(uriEntity.TargetUri);
        var pageEntity = new PageContentEntity()
        {
            UriId = uriEntity.Id,
            PageContent =pageContent
        };
        LoadPageContentIntoMongo(pageEntity);
    }
}

So sure enough, I know have all of the pages local. Doing something with it – that is the next trick…

Note that as soon as I finished up this piece, I got a note from the director of the department saying that they are looking at my request and will get back to me soon.

Filed under MongoDb

Traffic Stop Disposition: Classification Using F# and KNN

January 14, 2014 1 Comment

I have already looked at the summary statistics of the traffic stop data I received from the town here. My next stop was to try and do a machine learning exercise with the data. One of the more interesting questions I want to answer is what factors into weather a person gets a warning or a ticket (called disposition)? Of all of the factors that may be involved, the dataset that I have is fairly limited:

Using dispositionId as the result variable, there is StopDateTime and Location (Latitude/Longitude). Fortunately, DateTime can be decomposed into several input variables. For this exercise, I wanted to use the following:

TimeOfDay
DayOfWeek
DayOfMonth
MonthOfYear
Location (Latitude:Longitude)

And the resulting variable being disposition. To make it easier for analysis, I limited the analysis set to finalDisposition as either “verbal warning” or “citation” I decided to do a K-Nearest Neighbor because it is regarded as an easy machine learning algorithm to learn and the question does seem to be a classification problem.

My first step was to decide weather to write or borrow the KNN algorithm. After looking at what kind of code would be needed to write my own and then looking at some other libraries, I decided to use Accord.Net.

My next first step was to get the data via the web service I spun up here.

namespace ChickenSoftware.RoadAlert.Analysis
 
open FSharp.Data
open Microsoft.FSharp.Data.TypeProviders
open Accord.MachineLearning
 
type roadAlert2 = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
type MachineLearningEngine =
    static member RoadAlertDoc = roadAlert2.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)

My next first step was to filter the data to only verbal warnings (7) or citations (15).

  static member BaseDataSet =
      MachineLearningEngine.RoadAlertDoc
            |> Seq.filter(funx -> x.DispositionId = 7 || x.DispositionId = 15)
          |> Seq.map(fun x -> x.Id, x.StopDateTime, x.Latitude, x.Longitude, x.DispositionId)
          |> Seq.map(fun (a,b,c,d,e) -> a, b, System.Math.Round(c,3), System.Math.Round(d,3), e)
          |> Seq.map(fun (a,b,c,d,e) -> a, b, c.ToString() + ":" + d.ToString(), e)
          |> Seq.map(fun (a,b,c,d) -> a,b,c, match d with
                                              |7 -> 0
                                              |15 -> 1
                                              |_ -> 1)
          |> Seq.map(fun (a,b,c,d) -> a, b.Hour, b.DayOfWeek.GetHashCode(), b.Day, b.Month, c, d)
          |> Seq.toList

You will notice that I had to transform the dispositionIds from 7 and 15 to 1 and 0. The reason why is that the KNN method in Accord.Net assumes that the values match the index position in the array. I had to dig into the source code of Accord.Net to figure that one out.

My next step was to divide the dataset in half: one half being the training sample and the other the validation sample:

static member TrainingSample =
    let midNumber = MachineLearningEngine.NumberOfRecords/ 2
    MachineLearningEngine.BaseDataSet
        |> Seq.filter(fun (a,b,c,d,e,f,g) -> a < midNumber)
        |> Seq.toList
 
static member ValidationSample =
    let midNumber = MachineLearningEngine.NumberOfRecords/ 2
    MachineLearningEngine.BaseDataSet
        |> Seq.filter(fun (a,b,c,d,e,f,g) -> a > midNumber)
        |> Seq.toList

The next step was to actually run the KKN. Before I could do that though, I had to create the distance function. Since this was my 1st time, I dropped the geocoordinates and focused only on the time of day derivatives.

static member RunKNN inputs outputs input =
    let distanceFunction (a:int,b:int,c:int,d:int) (e:int,f:int,g:int,h:int) =  
      let b1 = b * 4
      let f1 = f * 4
      let d1 = d * 2
      let h1 = h * 2
      float((pown(a-e) 2) + (pown(b1-f1) 2) + (pown(c-g) 2) + (pown(d1-h1) 2))
 
    let distanceDelegate = 
          System.Func<(int * int * int * int),(int * int * int * int),float>(distanceFunction)
    
    let knn = new KNearestNeighbors<int*int*int*int>(10,2,inputs,outputs,distanceDelegate)
    knn.Compute(input)

You will notice I tried to normalize the values so that they all had the same basis. They are not exact, but they are close. You will also notice that I had to create a delegate from for the distanceFunction (thanks to Mimo on SO). This is because Accord.NET was written in C# with C# consumers in mind and F# has a couple of places where the interfaces are not as seemless as one would hope.

In any event, once the KKN function was written, I wrote a function that to the validation sample, made a guess via KKN, and then reported the result:

static member GetValidationsViaKKN  =
    let inputs = MachineLearningEngine.TrainingInputClass
    let outputs = MachineLearningEngine.TrainingOutputClass
    let validations = MachineLearningEngine.ValidationClass
 
    validations
        |> Seq.map(fun (a,b,c,d,e) -> e, MachineLearningEngine.RunKNN inputs outputs (a,b,c,d))
        |> Seq.toList
 
static member GetSuccessPercentageOfValidations =
    let validations = MachineLearningEngine.GetValidationsViaKKN
    let matches = validations
                    |> Seq.map(fun (a,b) -> match (a=b) with
                                                | true -> 1
                                                | false -> 0)
 
    let recordCount =  validations |> Seq.length
    let numberCorrect = matches |> Seq.sum
    let successPercentage = double(numberCorrect) / double(recordCount)
    recordCount, numberCorrect, successPercentage

I then hopped over to my UI console app and looked that the success percentage.

private static void GetSuccessPercentageOfValidations()
{
    var output = MachineLearningEngine.GetSuccessPercentageOfValidations;
    Console.WriteLine(output.Item1.ToString() + ":" + output.Item2.ToString() + ":" + output.Item3.ToString());
}

So there are 12,837 records in the validation sample and the classifier guessed the correct disposition 9,001 times – a success percentage of 70%

So it looks like there is something there. However, it is not clear that this is a good classifier without further tests – specifically seeing if the how to most common case results when pushing though the classifier. Also, I would assume to make this a true ‘machine learning’ algorithm I would have to feed the results back to the distance function to see if I can alter it to get the success percentage higher.

One quick note about methodology – I used unit tests pretty extensively to understand how the KKN works. I created a series of tests with some sample data to see who the function reacted.

[TestMethod]
public void TestKKN_ReturnsExpected()
{
 
    Tuple<int, int, int, int>[] inputs = { 
        new Tuple<int, int, int, int>(1, 0, 15, 1), 
        new Tuple<int,int,int,int>(1,0,11,1)};
    int[] outputs = { 1, 1 };
 
    var input = new Tuple<int, int, int, int>(1, 1, 1, 1);
 
    var output = MachineLearningEngine.RunKNN(inputs, outputs, input);
 
}

This was a big help to get me up and running (walking, really..)…

Filed under Analytics, F#

← Older posts

Newer posts →