D3 | Jamie Dixon's Home

Analysis of Health Inspection Data using F#

February 11, 2014 4 Comments

As part of the TRINUG F#/Analytics SIG, I did a public records request from Wake County for all of the restaurant inspections in 2013. If you are not familiar, the inspectors go out and then give a score to the restaurant. The restaurant then has to display their score like this:

After some back and forth, I got the data as an Excel spreadsheet that looks like this

I then loaded the spreadsheet into a sql server and exposed it as some OData endpoints.

// GET odata/Restaurant
[Queryable]
public IQueryable<Restaurant> GetRestaurant()
{
    return db.Restaurants;
}
 
// GET odata/Restaurant(5)
[Queryable]
public SingleResult<Restaurant> GetRestaurant([FromODataUri] int key)
{
    return SingleResult.Create(db.Restaurants.Where(restaurant => restaurant.Id == key));
}

I then dove into the data to see if there were any interesting conclusions to be found. Following my pattern of doing analytics using F# and unit testing using C#, I created a project with the following code:

namespace ChickenSoftware.RestraurantChicken.Analysis
 
open System.Linq
open System.Configuration
open Microsoft.FSharp.Linq
open Microsoft.FSharp.Data.TypeProviders
 
type internal SqlConnection = SqlEntityConnection<ConnectionStringName="azureData">
 
type public RestaurantAnalysis () =
    
    let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;

Note that I am using the connection string in two places – the 1st for the type provider to do its magic at design time and the second for actually accessing the data at run time. With that set up, the 1st question I had was “ is there seasonality in inspection scores like there are in traffic tickets?” To that end, I created the following function:

member public x.GetAverageScoreByMonth () =
    SqlConnection.GetDataContext(connectionString).Restaurants
        |> Seq.map(fun x -> x.InspectionDate.Value.Month, x.InspectionScore.Value)
        |> Seq.groupBy(fun x -> fst x)
        |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
        |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
        |> Seq.toArray
        |> Array.sort

This is pretty vanilla F# code, with the tricky part being the average by month (lines 4 and 5 here). What the code is doing is grouping up the 4,000 or so tuples that were created on line 3 into another tuple – with the fst being the groupBy value (in this case month) and then the second tuple being a tuple with the month and score. Then, by averaging up the score of the second tuple, we get an average for each month. I create a unit (really integration) test like so:

[TestMethod]
public void GetAverageScoreByMonth_ReturnsTwelveItems()
{
    var analysis = new RestaurantAnalysis();
    var scores = analysis.GetAverageScoreByMonth();
    Int32 expected = 12;
    Int32 actual = scores.Length;
    Assert.AreEqual(expected, actual);
}

And the result ran green.

Putting a break on the Assert and a watch on scores, you can see the values:

A couple of things stand out

1) The overall average is around 96 and change

2) There does not seem to be any significant variance among months.

Since I am trying to also teach myself D3, I then added a MVC5 project to my solution and added an analysis controller that calls the function in the analysis module and serves the results as json:

public JsonResult AverageScoreByMonth()
{
    var analysis = new RestaurantAnalysis();
    var scores = analysis.GetAverageScoreByMonth();
    return Json(scores,JsonRequestBehavior.AllowGet);
}

I then made a page with a simple D3 chart that calls this controller

@{
    Layout = "~/Views/Shared/_Layout.cshtml";
}
 
<svg class="chart"></svg>
 
<style>
    .bar {
        fill: steelblue;
    }
 
        .bar:hover {
            fill: brown;
        }
 
    .axis {
        font: 10px sans-serif;
    }
 
        .axis path,
        .axis line {
            fill: none;
            stroke: #000;
            shape-rendering: crispEdges;
        }
 
    .x.axis path {
        display: none;
    }
</style>
 
 
 
<script>
 
    var margin = { top: 20, right: 20, bottom: 30, left: 40 },
        width = 960 – margin.left – margin.right,
        height = 500 – margin.top – margin.bottom;
 
    var x = d3.scale.ordinal()
        .rangeRoundBands([0, width], .1);
 
    var y = d3.scale.linear()
        .range([height, 0]);
 
    var xAxis = d3.svg.axis()
        .scale(x)
        .orient("bottom");
 
    var yAxis = d3.svg.axis()
        .scale(y)
        .orient("left")
        .ticks(10, "%");
 
    var svg = d3.select("body").append("svg")
        .attr("width", width + margin.left + margin.right)
        .attr("height", height + margin.top + margin.bottom)
      .append("g")
        .attr("transform", "translate(" + margin.left + "," + margin.top + ")");
 
 
 
    $.ajax({
        url: "http://localhost:3057/Analysis/AverageScoreByMonth/&quot;,
        dataType: "json",
        success: function (data) {
            x.domain(data.map(function (d) { return d.Item1; }));
            y.domain([0, d3.max(data, function (d) { return d.Item2; })]);
 
            svg.append("g")
                .attr("class", "x axis")
                .attr("transform", "translate(0," + height + ")")
                .call(xAxis);
 
            svg.append("g")
                .attr("class", "y axis")
                .call(yAxis)
              .append("text")
                .attr("transform", "rotate(-90)")
                .attr("y", 6)
                .attr("dy", ".71em")
                .style("text-anchor", "end")
                .text("Frequency");
 
            svg.selectAll(".bar")
                .data(data)
              .enter().append("rect")
                .attr("class", "bar")
                .attr("x", function (d) { return x(d.Item1); })
                .attr("width", x.rangeBand())
                .attr("y", function (d) { return y(d.Item2); })
                .attr("height", function (d) { return height – y(d.Item2); });
 
        },
        error: function (e) {
            alert("error");
        }
    });
 
    function type(d) {
        d.Item2 = +d.Item2;
        return d;
    }
</script>

And when I run it, a run-of-the mill barchart (I did have to adjust the F# to shift the decimal to the left two positions so that I could match the scale of the chart’s template. For me, it is easier to alter the F# than the javascript:

Following this pattern, I did some other season analysis like average by DayOfMonth

DayOf Week.

So there does not seem to be any seasonality in inspection scores.

I then did an average of inspectors

And there looks to be some variance, but it is getting lost of the scale of the map. The problem is that the range of the scores is not 0 to 100

Here is a function that counts the number of scores (rounded to 0)

member public x.CountOfRoundedScores () =
    SqlConnection.GetDataContext(connectionString).Restaurants
        |> Seq.map(fun x -> System.Math.Round(x.InspectionScore.Value,0), x.InspectionID)
        |> Seq.groupBy(fun x -> fst x)
        |> Seq.map(fun (x,y) -> (x,y |> Seq.countBy snd))
        |> Seq.map(fun (x,y) -> (x,y |> Seq.sumBy snd))
        |> Seq.toArray

That graphically looks like:

So back to inspectors, I needed to adjust the scale from 0 to 100 to 80 to 100. I also needed to remove the null inspection Ids and the records that were for the ‘test facility’ and the 6 records that were below 80.

member public x.AverageScoreByInspector () =
    SqlConnection.GetDataContext(connectionString).Restaurants
        |> Seq.filter(fun x -> x.EstablishmentName <> "Test Facility")
        |> Seq.filter(fun x -> x.InspectionScore.Value > 80.)
        |> Seq.filter(fun x -> x.InspectionID <> null)
        |> Seq.map(fun x -> x.InspectorID, x.InspectionScore.Value)
        |> Seq.groupBy(fun x -> fst x)
        |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
        |> Seq.map(fun (x,y) -> x, y/100.)
        |> Seq.map(fun (x,y) -> x, System.Math.Round(y,4))
        |> Seq.toArray
        |> Array.sort

I then adjusted the scale of the inspector graph to have to domain from 80 to 100 (versus 0 to 100) and the scale of the y axis. This was a good article explaining Scales and Domains in D3.

var yAxis = d3.svg.axis()
    .scale(y)
    .orient("left")
    .ticks(10);

$.ajax({
    url: "http://localhost:3057/Analysis/AverageScoreByInspector/&quot;,
    dataType: "json",
    success: function (data) {
        x.domain(data.map(function (d) { return d.Item1; }));
        y.domain([80, d3.max(data, function (d) { return d.Item2; })]);

and now there is pretty good graph showing the variance among inspectors:

So the interesting this is that #1168 is 2 below the average – which of a domain of 10 is pretty significant. Interestingly, 1168 is also the inspector who has all of the “Test facility” records – so they are probably the trainer and/or lead inspector. With this analysis in the back pocket, ran a function that did the inspection score by establishment type:

This is kinda interesting (esp that pushcarts got the highest scores) but I wanted to see if there was any truth the the common perception that Chinese restaurants are less sanitary than other kinds of restaurants. To that end, I created a rudimentary classifier that searched the name of the establishment to see if it had a name that is typically associated with fast-food Chinese:

member public x.IsEstablishmentAChineseRestraurant (establishmentName:string) =
    let upperCaseEstablishmentName = establishmentName.ToUpper()
    let numberOfMatchedWords = upperCaseEstablishmentName.Split(' ')
                                |> Seq.map(fun x -> match x with
                                                        | "ASIA" -> 1
                                                        | "ASIAN" -> 1
                                                        | "CHINA" -> 1
                                                        | "CHINESE" -> 1
                                                        | "PANDA" -> 1
                                                        | "PEKING" -> 1
                                                        | "WOK" -> 1
                                                        | _ -> 0)
                                |> Seq.sum
    match numberOfMatchedWords with
        | 0 -> false
        | _ -> true

I then created a function that returned the average and ran my unit tests.

[TestMethod]
public void IsEstablishmentAChineseRestraurantUsingWOK_ReturnsTrue()
{
    var analysis = new RestaurantAnalysis();
    String establishmentName = "JAMIE'S WOK";
 
    var expected = true;
    var actual = analysis.IsEstablishmentAChineseRestraurant(establishmentName);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void IsEstablishmentAChineseRestraurantUsingWok_ReturnsTrue()
{
    var analysis = new RestaurantAnalysis();
    String establishmentName = "Jamie's Wok";
 
    var expected = true;
    var actual = analysis.IsEstablishmentAChineseRestraurant(establishmentName);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void AverageScoreForChineseRestaurants_ReturnsExpected()
{
    var analysis = new RestaurantAnalysis();
    var actual = analysis.AverageScoreForChineseRestaurants();
    Assert.IsNotNull(actual);
}

When a break was put on the value of the average, it was apparent that Chinese restaurants scored significantly lower than the average of 96

So then I applied 1 more segmentation: Chinese versus Non-Chinese scores by inspector:

member public x.AverageScoresOfChineseAndNonChineseByInspector () =
    let dataSet = SqlConnection.GetDataContext(connectionString).Restaurants
                    |> Seq.map(fun x -> x.EstablishmentName, x.InspectorID,x.InspectionScore.Value)
    let chineseRestraurants = dataSet
                                |> Seq.filter(fun (a,b,c) -> x.IsEstablishmentAChineseRestraurant(a))
                                |> Seq.map(fun (a,b,c) -> b,c)
                                |> Seq.groupBy(fun x -> fst x)
                                |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
                                |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
                                |> Seq.toArray
                                |> Array.sort
    let nonChineseRestraurants = dataSet
                                |> Seq.filter(fun (a,b,c) -> not(x.IsEstablishmentAChineseRestraurant(a)))
                                |> Seq.map(fun (a,b,c) -> b,c)
                                |> Seq.groupBy(fun x -> fst x)
                                |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
                                |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
                                |> Seq.toArray
                                |> Array.sort
    Seq.zip chineseRestraurants nonChineseRestraurants
           |> Seq.map(fun ((a,b),(c,d)) -> a,b,d)
           |> Seq.toList

And in graphics using a double-bar chart:

So this is kinda interesting. The lead inspector (1168) who grades everyone lower actually gives Chinese restaurants higher marks. Everyone else pretty much grades Chinese restaurants lower except for 1 inspector. Also, 1708 must really not like Chinese restaurants – or their inspection list has a series of really bad Chinese restaurants.

Note that this may not be statistically significant (I didn’t control for sample size, etc..) – but further analysis might be warranted, no? If you are interested, here is the endpoint: http://restaurantchicken.cloudapp.net/odata/Restaurant

Finally, when I presented this analysis to TRINUG last week, lots of people became interested in F# and analytics (ok, maybe 3). You can see the comments here. Also, I now have an appointment with the head of the health department department and the CIO of Wake County later this week – let’s see what they say…

Filed under Analytics, D3, F#

Traffic Stop Visualization Using D3

January 28, 2014 Leave a comment

One of the comments I got from the TRINUG Data SIG was that the data and analysis were exciting but the results were, well, boring. So I went back to the traffic stop data and thought about how I could sexy it up. Since lots of people have been using D3 to present the data. I thought it would be a good place to start.

My 1st step was to look at their samples page and they have a simple bar chart that seems like a good introduction to the library. I created an endpoint on my webapi for the summary data like so:

[HttpGet]
[Route("api/TrafficStopSearch/StopsByMonth/")]
public dynamic StopsByMonth()
{
    return ChickenSoftware.RoadAlert.Analysis.AnalysisEngine.TrafficStopsByMonth;
 
}

I then spun up an empty asp.net website and created an index page based on their sample. I then added an ajax call to the controller and replaced the reference to the data.tsv file:

$.ajax({
   url: "http://localhost:17680/api/TrafficStopSearch/StopsByMonth/&quot;,
   dataType: "json",
   success: function (data) {
       x.domain(data.map(function (d) { return d.m_Item1; }));
       y.domain([0, d3.max(data, function (d) { return d.m_Item6; })]);
 
       svg.append("g")
           .attr("class", "x axis")
           .attr("transform", "translate(0," + height + ")")
           .call(xAxis);
 
       svg.append("g")
           .attr("class", "y axis")
           .call(yAxis)
         .append("text")
           .attr("transform", "rotate(-90)")
           .attr("y", 6)
           .attr("dy", ".71em")
           .style("text-anchor", "end")
           .text("Frequency");
 
       svg.selectAll(".bar")
           .data(data)
         .enter().append("rect")
           .attr("class", "bar")
           .attr("x", function (d) { return x(d.m_Item1); })
           .attr("width", x.rangeBand())
           .attr("y", function (d) { return y(d.m_Item6); })
           .attr("height", function (d) { return height – y(d.m_Item6); });
       
   },
   error: function(e){
       alert("error");
   }
});

On thing to note is that the tuple that was created in F# and then passed though via the C# controller had its name changed. Specially, Tuple.Item1 became Tuple.m_Item1. I think that passing out tuple.anything is a horrible idea, so I created a POCO that actually lets the consumer know what each field means:

public class OutputValue
{
    public Int32 Month { get; set; }
    public Int32 ExpectedStops { get; set; }
    public Int32 ActualStops { get; set; }
    public Double DifferenceBetweenExpectedAndActual { get; set; }
    public Double PercentDifferenceBetweenExpectedAndActual { get; set; }
    public Double Frequency { get; set; }
}

and then I adjusted the controller like so:

 
[HttpGet]
[Route("api/TrafficStopSearch/StopsByMonth/")]
public dynamic StopsByMonth()
{
    var outputs = new List<OutputValue>();
    var resultSet= ChickenSoftware.RoadAlert.Analysis.AnalysisEngine.TrafficStopsByMonth;
    foreach (var tuple in resultSet)
    {
        var outputValue = new OutputValue()
        {
            Month = tuple.Item1,
            ExpectedStops = tuple.Item2,
            ActualStops = tuple.Item3,
            DifferenceBetweenExpectedAndActual = tuple.Item4,
            PercentDifferenceBetweenExpectedAndActual = tuple.Item5,
            Frequency = tuple.Item6
        };
        outputs.Add(outputValue);
    }
 
    return outputs;
}

So I adjusted the javascript and voila: a bar chart:

Up next – some real charts…

Filed under D3

Jamie Dixon's Home

Analysis of Health Inspection Data using F#

Traffic Stop Visualization Using D3

Categories

Recent Posts

Archives

Blogroll

Meta