Analytics | Jamie Dixon's Home

Association Rule Learning Via F# (Part 2)

May 20, 2014 1 Comment

Continuing on the path of re-writing the association rule learning code found in this month’s MSDN, I started with the next function on list:

MakeAntecedent

Here is the original code C#

public static int[] MakeAntecedent(int[] itemSet, int[] comb)
{
  // if item-set = (1 3 4 6 8) and combination = (0 2) 
  // then antecedent = (1 4)
  int[] result = new int[comb.Length];
  for (int i = 0; i < comb.Length; ++i)
  {
    int idx = comb[i];
    result[i] = itemSet[idx];
  }
  return result;
}

and the F# code:

static member MakeAntecedent(itemSet:int[] , comb:int[]) =
    comb |> Array.map(fun x -> itemSet.[x])

It is much easier to figure out what is going on via the F# code. The function takes in 2 arrays. Array #1 has values, Array #2 has the index of array #1 that is needed. Using the Array.Map, I return an array where the index number is swapped out to the actual value. The unit tests run green:

[TestMethod]
public void MakeAntecedentCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[2] { 1, 4 };
    var actual = CS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
}
 
[TestMethod]
public void MakeAntecedentFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[2] { 1, 4 };
    var actual = FS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
}

MakeConsequent

Here is the original C# code:

public static int[] MakeConsequent(int[] itemSet, int[] comb)
{
  // if item-set = (1 3 4 6 8) and combination = (0 2) 
  // then consequent = (3 6 8)
  int[] result = new int[itemSet.Length – comb.Length];
  int j = 0; // ptr into combination
  int p = 0; // ptr into result
  for (int i = 0; i < itemSet.Length; ++i)
  {
    if (j < comb.Length && i == comb[j]) // we are at an antecedent
      ++j; // so continue
    else
      result[p++] = itemSet[i]; // at a consequent so add it
  }
  return result;
}

Here is the F# Code:

static member MakeConsequent(itemSet:int[] , comb:int[])=   
    let isNotInComb x = not(Array.exists(fun elem -> elem = x) comb)
    itemSet 
        |> Array.mapi(fun indexer value -> value,indexer )
        |> Array.filter(fun (value,indexer) -> isNotInComb indexer)
        |> Array.map(fun x -> fst x)

Again, it is easier to look at the F# code to figure out what is going on. In this case, we have to take all of the items in the first array that are not in the second array. The trick is that the second array does not contain values to be checked, rather the index position. If you add the Antecedent and the Consequent, you have the total original array.

This code code me a bit of time to figure out because I kept trying to use the out of the box Array features (including slicing) for F# when it hit me that it would be much easier to create a tuple from the original array –> the value and the index. I would then look up the index in the second array and confirm it is not there and then filter the ones that are not there. The map function at the end removes the index part of the tuple because it is not needed anymore.

Sure enough, my unit tests ran green:

[TestMethod]
public void MakeConsequentCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[3] { 3, 6, 8 };
    var actual = CS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
    Assert.AreEqual(expected[2], actual[2]);
}
 
[TestMethod]
public void MakeConsequentFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[3] { 3, 6, 8 };
    var actual = FS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
    Assert.AreEqual(expected[2], actual[2]);
}

IndexOf

I then decided to tackle the remaining three functions in reverse because they depend on each other (CountInTrans –> IsSubSetOf –> IndexOf). IndexOf did not have any code comments of example cases, but the C# code is clear

public static int IndexOf(int[] array, int item, int startIdx)
{
  for (int i = startIdx; i < array.Length; ++i)
  {
    if (i > item) return -1; // i is past where the target could possibly be
    if (array[i] == item) return i;
  }
  return -1;
}

What is even clearer is the F# code that does the same thing (yes, I am happy that FindIndex returns a –1 when not found and so did McCaffey):

static member IndexOf(array:int[] , item:int, startIdx:int) =
    Array.FindIndex(array, fun x -> x=item)

And I built some unit tests that run green that I think reflect McCaffey’s intent:

[TestMethod]
public void IndexOfCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[4] { 0, 1, 4, 5 };
    Int32 item = 1;
    Int32 startIndx = 1;
 
    int expected = 1;
    int actual = CS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
 
    Assert.AreEqual(expected, actual);
}
public void IndexOfFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[4] { 0, 1, 4, 5 };
    Int32 item = 1;
    Int32 startIndx = 1;
 
    int expected = 1;
    int actual = FS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
 
    Assert.AreEqual(expected, actual);
}

IsSubsetOf

In the C# implementation, IndexOf is called to keep track of where the search is currently pointed.

public static bool IsSubsetOf(int[] itemSet, int[] trans)
{
  // 'trans' is an ordered transaction like [0 1 4 5 8]
  int foundIdx = -1;
  for (int j = 0; j < itemSet.Length; ++j)
  {
    foundIdx = IndexOf(trans, itemSet[j], foundIdx + 1);
    if (foundIdx == -1) return false;
  }
  return true;
}

In The F# one, that is not needed:

static member IsSubsetOf(itemSet:int[] , trans:int[]) =
    let isInTrans x = (Array.exists(fun elem -> elem = x) trans)
    let filteredItemSet = itemSet
                            |> Array.map(fun value -> value, isInTrans value)
                            |> Array.filter(fun (value, trans) -> trans = false)
    if filteredItemSet.Length = 0 then true
        else false

CountInTrans

Here is the original C# code uses the IsSubsetOf function.

public static int CountInTrans(int[] itemSet, List<int[]> trans, Dictionary<int[], int> countDict)
{
   //number of times itemSet occurs in transactions, using a lookup dict
 
    if (countDict.ContainsKey(itemSet) == true)
    return countDict[itemSet]; // use already computed count
 
  int ct = 0;
  for (int i = 0; i < trans.Count; ++i)
    if (IsSubsetOf(itemSet, trans[i]) == true)
      ++ct;
  countDict.Add(itemSet, ct);
  return ct;
}

And here is the F# Code that also uses that subfunction

static member CountInTrans(itemSet: int[], trans: List<int[]>, countDict: Dictionary<int[], int>) =
    let trans' = trans |> Seq.map(fun value -> value, AssociationRuleProgram.IsSubsetOf (itemSet,value))
    trans' |> Seq.filter(fun item -> snd item = true)
           |> Seq.length

GetHighConfRules

With the subfunctions created and running green, I then tackled the point of the exercise –> GetHighConfRules. The C# implementation is pretty verbose and there are lots of things happening:

    public static List<Rule> GetHighConfRules(List<int[]> freqItemSets, List<int[]> trans, double minConfidencePct)
    {
      // generate candidate rules from freqItemSets, save rules that meet min confidence against transactions
      List<Rule> result = new List<Rule>();
 
      Dictionary<int[], int> itemSetCountDict = new Dictionary<int[], int>(); // count of item sets
 
      for (int i = 0; i < freqItemSets.Count; ++i) // each freq item-set generates multiple candidate rules
      {
        int[] currItemSet = freqItemSets[i]; // for clarity only
        int ctItemSet = CountInTrans(currItemSet, trans, itemSetCountDict); // needed for each candidate rule
 
        for (int len = 1; len <= currItemSet.Length – 1; ++len) // antecedent len = 1, 2, 3, . .
        {
          int[] c = NewCombination(len); // a mathematical combination
 
          while (c != null) // each combination makes a candidate rule
          {
            int[] ante = MakeAntecedent(currItemSet, c);
            int[] cons = MakeConsequent(currItemSet, c); // could defer this until known if needed
          
            int ctAntecendent = CountInTrans(ante, trans, itemSetCountDict); // use lookup if possible 
            double confidence = (ctItemSet * 1.0) / ctAntecendent;
 
            if (confidence >= minConfidencePct) // we have a winner!
            {
              Rule r = new Rule(ante, cons, confidence); 
              result.Add(r); // if freq item-sets are distinct, no dup rules ever created
            }
            c = NextCombination(c, currItemSet.Length);
          } // while each combination
        } // len each possible antecedent for curr item-set
      } // i each freq item-set
 
      return result;
    } // GetHighConfRules

In the F# code, It decided to work inside out and get the rule for 1 itemset. I think the code reads pretty clear where each step is laid out

static member GetHighConfRules(freqItemSets:List<int[]>, trans:List<int[]>,  minConfidencePct:float) =
    let returnValue = new List<Rule>()
    freqItemSets 
        |> Seq.map (fun i -> i, AssociationRuleProgram.CountInTrans'(i,trans))
        |> Seq.filter(fun (i,c) -> (float)c > minConfidencePct)
        |> Seq.map(fun (i,mcp) -> i,mcp,AssociationRuleProgram.MakeAntecedent(i, trans.[0]))
        |> Seq.map(fun (i,mcp,a) -> i,mcp, a, AssociationRuleProgram.MakeConsequent(i, trans.[0]))
        |> Seq.iter(fun (i,mcp,a,c) -> returnValue.Add(new Rule(a,c,mcp)))
    returnValue

I then attempted to put this block into a larger block (trans.[0]) but then I realized that I was going about this the wrong way. Instead of using the C# code as a my base line, I need to approach the problem from a functional viewpoint. That will be the subject of my blog next week…

Filed under Analytics, F#

Association Rule Learning Via F# (Part 1)

May 13, 2014 1 Comment

I was reading the most recent MSDN when I came across this article. How awesome is this? McCaffrey did a great job explaining a really interesting area of analytics and I am loving the fact that MSDN is including articles about data analytics. When I was reading the article, I ran across this sentence “The demo program is coded in C# but you should be able to refactor the code to other .NET languages such as Visual Basic or Iron Python without too much difficulty” Iron Python? Iron Python! What about F#, the language that matches analytics the way peanut butter goes with chocolate? Challenge accepted!

The first thing I did was to download his source code from here. When I first opened the source code, I realized that the code would be a little bit hard to port because it is written from a scientific angle, not a business application point of view. 34 FxCop errors in 259 lines of code confirmed this:

Also, there are tons of comments which is very distracting. I generally hate comments, but I figure that since it is a MSDN article and it is supposed to explain what is going on, comments are OK. However, many of the comments can be refactored into more descriptive variables and method names. For example:

In any event, let’s look at the code. The first thing I did was change the CS project from a console app to a library and move the test data into an another project . I then moved the console code to the UI. I also moved the Rule class code into its own file, made sure the namespaces matched, and made the AssociationRuleProgram public. Yup it still runs:

So then I created a FSharp library in the solution and set up the class with the single method:

A couple of things to note:

1) I left the parameter naming the same, even though it is not particularly intention-revealing

2) F# is typed inferred, so I don’t have to assign the types to the parameters

Next, started looking at the supporting functions to GetHighConfRules. Up first was the function call NextCombination. Here is the side by side between the imperative style and the functional style:

The next function was NextCombination was more difficult for me to understand. I stopped what I was doing and built a unit test project that proved correctness using the commented examples as the expected values. I used 1 test project for both the C# and F# project so I could see both side by side. An interesting side not is that the unit test naming is different than usual –> instead of naming the class XXXXTests where XXXX is the name of another class, XXXX is the function name that both classes are implementing:

So going back to the example,

I wrote two unit tests that match the two comments

When I ran the tests, the 1st test passed but the second did not:

The problem with the failing test is that null is not being returned, rather {3,4,6}. So now I have a problem, do I base the F# implementation on the code comments or the code itself? I decided to base it on the code, because comments often lie but CODE DON”T LIE (thanks ‘sheed). I adjusted the unit test, got green.

One of the reasons the code is pretty hard to read/understand is because of the use of ‘i’,’j’,’k’,’n’ variables. I went back to the article and McCaffrey explains what is going on at the bottom left of page 60. Another name for the function ‘NextCombination’ could be called ‘GetLexicographicalSuccessor’ and the variable ‘n’ could be called ‘numberOfPossibleItems’. With that mental vocabulary in place, I went through the functional and divided it into 4 parts:

1Checking to see if the value of the first element is of a certain length

2) Creating a result array that is seeded with the values of the input array

3) Looping backwards to identify the 1st number in the array that will be adjusted

4) From that target element, looping forward and adjusting all subsequent items

#1 I will not worry about now and #2 is not needed in F#, so #3 is the first place to start. What I need is a way of splitting the array into two parts. Part 1 has the original values that will not change and part 2 has the values that will change. Seq.Take and Seq.Skip are perfect for this:

let i = Array.LastIndexOf(comb,n)
let i' = if i = – 1 then 0 else i
let comb' = comb |> Seq.take(i') |> Seq.toArray
let comb'' = comb |> Seq.skip(i') |> Seq.toArray

Looking at #4, I now need to increment the values in part 2 by 1. Seq.take will work:

And then putting part 1 and part 2 back together via Array.Append, we have equivalence*:

*Equivalence is defined by my unit tests, which both pass green. I have no idea if other inputs will not work. Note that the second unit test runs red, so I really think that the code is wrong and that the comment to return null is correct. The value I am getting for (3;4;5)(5) is (3;4;1) which seems to make sense.

I am not crazy about these explanatory variables (comb’, comb’’, and comb’’’) but I am not sure how to combine them without sacrificing readability. I definitely want to combine the i and i’ into 1 statement…

I am not sure why Scan is returning 4 items in an array when I am passing in an array that has a length of 3. I am running out of time today so I just hacked in a Seq.Take.

I’ll continue this exercise in my blog next week.

Filed under Analytics, F#

Kaplan-Meier Survival Analysis Using F#

May 6, 2014 5 Comments

I was reading the most recent issue of MSDN a couple of days ago when I came across this article on doing a Kaplan-Meier survival analysis. I thought the article was great and I am excited that MSDN is starting to publish articles on data analytics. However, I did notice that there wasn’t any code in the article, which is odd, so I went to the on-line article and others had a similar question:

I decided to implement a Kaplan-Meier survival (KMS) analysis using F#. After reading the article a couple of times, I was still a bit unclear on how the KMS is implemented and there does not seem to be any pre-rolled in the standard .NET stat libraries out there. I went on over to this site where there was an excellent description of how the survival probability is calculated. I went ahead and built an Excel spreadsheet to match the nih one and then compare to what Topol is doing:

Notice that Topol censored the data for the article. If we only cared about the probability of crashes, then we would not censor the data for when the device was turned off.

So then I was ready to start coding so spun up a solution with an F# project for the analysis and a C# project for the testing.

I then loaded into the unit test project the datasets that Topol used:

[TestMethod]
public void EstimateForApplicationX_ReturnsExpected()
{
    var appX = new CrashMetaData[]
    {
        new CrashMetaData(0,1,false),
        new CrashMetaData(1,5,true),
        new CrashMetaData(2,5,false),
        new CrashMetaData(3,8,false),
        new CrashMetaData(4,10,false),
        new CrashMetaData(5,12,true),
        new CrashMetaData(6,15,false),
        new CrashMetaData(7,18,true),
        new CrashMetaData(8,21,false),
        new CrashMetaData(9,22,true),
    };
}

I could then wire up the unit tests to compare the output to the article and what I had come up with.

public void EstimateForApplicationX_ReturnsExpected()
{
    var appX = new CrashMetaData[]
    {
        new CrashMetaData(0,1,false),
        new CrashMetaData(1,5,true),
        new CrashMetaData(2,5,false),
        new CrashMetaData(3,8,false),
        new CrashMetaData(4,10,false),
        new CrashMetaData(5,12,true),
        new CrashMetaData(6,15,false),
        new CrashMetaData(7,18,true),
        new CrashMetaData(8,21,false),
        new CrashMetaData(9,22,true),
    };
 
    var expected = new SurvivalProbabilityData[]
    {
        new SurvivalProbabilityData(0,1.000),
        new SurvivalProbabilityData(5,.889),
        new SurvivalProbabilityData(12,.711),
        new SurvivalProbabilityData(18,.474),
        new SurvivalProbabilityData(22,.000)
    };
 
    KaplanMeierEstimator estimator = new KaplanMeierEstimator();
    var actual = estimator.CalculateSurvivalProbability(appX);
 
    Assert.AreSame(expected, actual);
}

However, one of the neat features of F# is the REPL so I don’t need to keep running unit tests to prove correctness when I am proving out a concept. So I added equivalent test code in the beginning of the F# project so I could run in the REPL my ideas:

type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
 
type KapalanMeierAnalysis() = 
    member this.GenerateXAppData ()= 
                    [|  {userId=0; crashTime=1; crashed=false};{userId=1; crashTime=5; crashed=true};
                        {userId=2; crashTime=5; crashed=false};{userId=3; crashTime=8; crashed=false};
                        {userId=4; crashTime=10; crashed=false};{userId=5; crashTime=12; crashed=true};
                        {userId=6; crashTime=15; crashed=false};{userId=7; crashTime=18; crashed=true};
                        {userId=8; crashTime=21; crashed=false};{userId=9; crashTime=22; crashed=true}|]
    
    member this.RunAnalysis(crashMetaData: array<CrashMetaData>) = 

The first thing I did was duplicate the 1st 3 columns of the Excel spreadsheet:

let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))

In the REPL:

The forth column is tricky because it is a cumulative calculation. Instead of for..eaching in an imperative style, I took advantage of the functional language constructs to make the code much more readable. Once I calculated that column outside of the base Sequence, I added it back in via Seq.Zip

let cumulativeDevices = crashMetaData.Length
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)

In the REPL:

The next two columns were a snap –> they were just calculations based on the existing values:

let cumulativeDevices = crashMetaData.Length
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
let crashSequence'' = crashSequence'
                            |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 

The last column was another cumulative calculation so I added another accumulator and used Seq.scan and Seq.Zip.

let cumulativeDevices = crashMetaData.Length
let cumulativeSurvivalProbability = 1.
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
let crashSequence'' = crashSequence'
                            |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 
 
let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
let survivalProbabilitySequence' = survivalProbabilitySequence
                                            |> Seq.skip 1

The last step was to map all of the columns and only output what was in the article. The final answer is:

namespace ChickenSoftware.SurvivalAnalysis
 
type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
type public SurvivalProbabilityData = {crashTime: int; survivalProbaility: float}
 
type KaplanMeierEstimator() = 
    member this.CalculateSurvivalProbability(crashMetaData: array<CrashMetaData>) = 
            let cumulativeDevices = crashMetaData.Length
            let cumulativeSurvivalProbability = 1.
 
            let crashSequence = crashMetaData 
                                    |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                            | true -> 1
                                                                                            | false -> 0),
                                                                             (match crash.crashed with
                                                                                            | true -> 0
                                                                                            | false -> 1))
            let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
            let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                                        |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
            let crashSequence'' = crashSequence'
                                        |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 
 
            let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
            let survivalProbabilitySequence' = survivalProbabilitySequence
                                                        |> Seq.skip 1
 
            let crashSequence''' = Seq.zip crashSequence'' survivalProbabilitySequence'
                                        |> Seq.map(fun ((t,c,nc,cumld,dp,sp),cumlsp) -> t,c,nc,cumld,dp,sp,cumlsp)
            crashSequence'''
                    |> Seq.filter(fun (t,c,nc,cumld,dp,sp,cumlsp) -> c=1 )
                    |> Seq.map(fun (t,c,nc,cumld,dp,sp,cumlsp) -> t,System.Math.Round(cumlsp,3))

And this matches the article (almost exactly). The article also has a row for iteration zero, which I did not bake in. Instead of fixing my code, I changed the unit test and removed that 1st column. In any event, I ran the test and it ran red –> but the values are identical so I assume it is a problem with the Assert.AreSame() function. I would take the time to figure it out but it is 75 degrees on a Sunday afternoon and I want to go play catch with my kids…

Note it also matches the other data set Topol has in the article:

In any event, this code reads pretty much the way I was thinking about the problem – each column of the Excel spreadsheet has a 1 to 1 correspondence to the F# code block. I did use explanatory variables liberally which might offend the more advanced functional programmers but taking each step in turn really helped me focus on getting each step correct before going to the next one.

1) I had to offset the cumulativeSurvivalProabability by one because the calculation is how many crashed on a day compared to how many were working at the start of the day. The Seq.Scan increments the counter for the next row of the sequence and I need it for the current row. Perhaps there is an overload for Seq.Scan?

2) I adopted the functional convention of using ticks to denote different physical manifestations of the same logic concept (crashedDeviceSequence “became” crashedDeviceSequence’, etc…). Since everything is immutable by default in F#, this kind of naming convention makes a lot of sense to me. However, I can see it quickly becoming unwieldy.

3) I could not figure out how to operate on the base tuple so instead I used a couple of supporting Sequences and then put everything together using Seq.Zip. I assume there is a more efficient way to do that.

4) One of the knocks against functional/scientific programming is that values are named poorly. To combat that, I used the full names in my tuples to start. After a certain point though, the names got too unwieldy so I resorted to their initials. I am not sure what the right answer is here, or even if there is right answer.

Filed under Analytics, F#

Apriori Algorithm and F# Using Elevator Inspection Data

March 18, 2014 1 Comment

Now that I have the elevator dataset in a workable state, I wanted to see what I could see with the data. I was reading Machine Learning In Action and the authors suggested that an Apriori Algorithm as a way to quantify associations among data points. I read both Harrington’s code and Wikipedia’s description and I found both the be impenetrable – the former because their code was unreadable and the later because the mathematical formulas depended on a level of algebra that I don’t have.

Fortunately, I found a C# project on Codeproject that had both an excellent example/introduction and C# code. I used the examples on the website to formulate my F# implementation.

The first thing I did was create a class that matched the 1st grid in the example

namespace ChickenSoftware.ElevatorChicken.Analysis
 
open System.Collections.Generic
 
type Transaction = {TID: string; Items: List<string> }
 
type Apriori(database: List<Transaction>, support: float, confidence: float) = 
    member this.Database = database
    member this.Support = support
    member this.Confidence = confidence

Note that because F# is immutable by default, the properties are read-only. I then created a unit test project that makes sure the constructor works without exceptions. The data matches the example:

public AprioriTests()
{
    var database = new List<Transaction>();
    database.Add(new Transaction("100", new List<string>() { "A", "C", "D" }));
    database.Add(new Transaction("200", new List<string>() { "B", "C", "E" }));
    database.Add(new Transaction("300", new List<string>() { "A", "B", "C", "E" }));
    database.Add(new Transaction("400", new List<string>() { "B", "E" }));
 
    _apriori = new Apriori(database, .5, .80);
 
}
 
[TestMethod]
public void ConstructorUsingValidArguments_ReturnsExpected()
{
    Assert.IsNotNull(_apriori);
}

I then need a function to count up all of the items in the Itemsets. I refused to use loops, so I first started using Seq.Fold, but I was having zero luck because I was trying to fold a Seq of List. I then started experimenting with other functions when I found Seq.Collect – which was perfect. So I created a function like this:

member this.GetC1() =
    database
 
member this.GetL1() =
    let numberOfTransactions = this.GetC1().Count
 
    this.GetC1()
        |> Seq.collect(fun d -> d.Items)
        |> Seq.countBy(fun i -> i)
        |> Seq.map(fun (t,i) -> t, i, float i/ float numberOfTransactions)
        |> Seq.filter(fun (t,i,p) -> p >= support)
        |> Seq.map(fun (t,i,p) -> t,i)
        |> Seq.sort
        |> Seq.toList

Note that the numberOfTransactions is for the database, not the individual items in the List<Item>. And the results match the example:

So this is great. My next stop was to build a list of pair combinations of the remaining values

The trick is that is not a Cartesian join of the original sets – it is only the surviving sets that are needed. My first attempt looked like:

let C1 = database
 
let L1 = C1
        |> Seq.map(fun t -> t.Items)
        |> Seq.collect(fun i -> i)
        |> Seq.countBy(fun i -> i)
        |> Seq.map(fun (t,i) -> t, i, float i/ float numberOftransactions)
        |> Seq.filter(fun (t,i,p) -> p >= support)
        |> Seq.toArray
let C2A = L1 
            |> Seq.map(fun (x,y,z) -> x)
            |> Seq.toArray
let C2B = L1 
            |> Seq.map(fun (x,y,z) -> x)
            |> Seq.toArray
let C2 = C2A |> Seq.collect(fun x -> C2B |> Seq.map(fun y -> x+y))
C2   

With the output like this:

I was running out of Saturday morning so I went over to stack overflow and got a couple of responses. I was on the right track with the concat, but I didn’t think about the List.Filter(), which would prune my list. With this in mind, I copied Mark’s code and got what I was looking for

member this.GetC2() =
    let l1Itemset = this.GetL1() 
                    |> Seq.map(fun (i,s) -> i)
 
    let itemset = 
        l1Itemset
            |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
            |> Seq.concat
            |> Seq.filter(fun (x,y) -> x < y)
            |> Seq.sort
            |> Seq.toList         
    
    let listContainsItem(l:List<string>, a,b) =
            l.Contains(a) && l.Contains(b)
    
    let someFunctionINeedToRename(l1:List<string>, l2)=
            l2 |> Seq.map(fun (x,y) -> listContainsItem(l1,x,y))
 
    let itemsetMatches = this.GetC1()
                            |> Seq.map(fun t -> t.Items)
                            |> Seq.map(fun i -> someFunctionINeedToRename(i,itemset))
 
    let itemSupport = itemsetMatches
                            |> Seq.map(Seq.map(fun i -> if i then 1 else 0))
                            |> Seq.reduce(Seq.map2(+))
 
    itemSupport
        |> Seq.zip(itemset)
        |> Seq.toList

So now I have C2 filling correctly:

Taking the results, I needed to get L2.

That was much simpler that getting C2 –> here is the code:

member this.GetL2() = 
    let numberOfTransactions = this.GetC1().Count
    
    this.GetC2()
            |> Seq.map(fun (i,n) -> i,n,float n/float numberOfTransactions)
            |> Seq.filter(fun (i,n,p) -> p >= support)
            |> Seq.map(fun (t,i,p) -> t,i)
            |> Seq.sort
            |> Seq.toList    

And when I run it – it matches this example exactly:

Finally, I added in a C# and L3. This code is identical to the C2/L2 code with one exception: mapping a triple and not a tuple: The C2 code maps like this

let itemset = 
    l1Itemset
        |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
        |> Seq.concat
        |> Seq.filter(fun (x,y) -> x < y)
        |> Seq.sort
        |> Seq.toList     

and the C3 code looks like this (took me 15 minutes to figure out line 3 below):

let itemset = 
    l2Itemset
        |> Seq.map(fun x -> l2Itemset |> Seq.map(fun y-> l2Itemset |> Seq.map(fun z->(fst x,fst y,snd z))))
        |> Seq.concat
        |> Seq.collect(fun d -> d)
        |> Seq.filter(fun (x,y,z) -> x < y && y < z)
        |> Seq.distinct
        |> Seq.sort
        |> Seq.toList    

With the C3 and L3 matching the example also:

I was now ready to put in the elevator data into the analysis. I think I am getting better at F# because I did the mapping, filtering, and transformation of the data from the server without looking at any other material and it look only 15 minutes.

type public ElevatorBuilder() = 
    let connectionString = ConfigurationManager.ConnectionStrings.["localData2"].ConnectionString;
 
    member public this.GetElevatorTransactions() =
        let transactions = this.GetElevators() 
                              |> Seq.map(fun e ->this.ConvertElevatorToTransaction(e))
        let transactionsList = new System.Collections.Generic.List<Transaction>(transactions)
        transactionsList
 
    member public this.ConvertElevatorToTransaction(i: string, t:string, c:string, s:string) =
        let items = new System.Collections.Generic.List<String>()
        items.Add(t)
        items.Add(c)
        items.Add(s)
        let transaction = {TID=i; Items=items}
        transaction
 
    member public this.GetElevators () =
        SqlConnection.GetDataContext(connectionString).ElevatorData201402
            |> Seq.map(fun e -> e.ID, e.EquipType,e.Capacity,e.Speed)
            |> Seq.filter(fun (i,et,c,s) -> not(String.IsNullOrEmpty(et)))
            |> Seq.filter(fun (i,et,c,s) -> c.HasValue)
            |> Seq.filter(fun (i,et,c,s) -> s.HasValue)
            |> Seq.map(fun (i,t,c,s) -> i, this.CatagorizeEquipmentType(t),c,s)
            |> Seq.map(fun (i,t,c,s) -> i,t,this.CatagorizeCapacity(c.Value),s)
            |> Seq.map(fun (i,t,c,s) -> i,t,c,this.CatagorizeSpeed(s.Value))
            |> Seq.map(fun (i,t,c,s) -> i.ToString(),t,c,s)

The longest part was aggregating the free-form text of the Equipment Type field (here is partial snip, you get the idea…)

member public this.CatagorizeEquipmentType(et: string) =
    match et.Trim() with 
        | "OTIS" -> "OTIS"
        | "OTIS (1-2)" -> "OTIS"
        | "OTIS (2-1)" -> "OTIS"
        | "OTIS hydro" -> "OTIS"
        | "OTIS, HYD" -> "OTIS"
        | "OTIS/ ASHEVILLE " -> "OTIS"
        | "OTIS/ MOUNTAIN " -> "OTIS"
        | "OTIS/#1" -> "OTIS"
        | "OTIS/#19 " -> "OTIS"

Assigning categories for speed and capacity was a snap using F#

member public this.CatagorizeCapacity(c: int) =
    let lowerBound = (c/25 * 25) + 1
    let upperBound = lowerBound + 24
    lowerBound.ToString() + "-" + upperBound.ToString()        
 
member public this.CatagorizeSpeed(s: int) =
    let lowerBound = (s/50 * 50) + 1
    let upperBound = lowerBound + 49
    lowerBound.ToString() + "-" + upperBound.ToString()    

With this in hand, I created a Console app that takes the 27K records and pushes them though the apriori algorithm:

private static void RunElevatorAnalysis()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    ElevatorBuilder builder = new ElevatorBuilder();
    var transactions = builder.GetElevatorTransactions();
    stopwatch.Stop();
    Console.WriteLine("Building " + transactions.Count + " transactions took: " + stopwatch.Elapsed.TotalSeconds);
    var apriori = new Apriori(transactions, .1, .75);
    var c2 = apriori.GetC2();
    stopwatch.Reset();
    stopwatch.Start();
    var l1 = apriori.GetL1();
    Console.WriteLine("Getting L1 took: " + stopwatch.Elapsed.TotalSeconds);
    var l2 = apriori.GetL2();
    Console.WriteLine("Getting L2 took: " + stopwatch.Elapsed.TotalSeconds);
    var l3 = apriori.GetL3();
    Console.WriteLine("Getting L3 took: " + stopwatch.Elapsed.TotalSeconds);
    stopwatch.Stop();
    Console.WriteLine("–L1");
    foreach (var t in l1)
    {
        Console.WriteLine(t.Item1 + ":" + t.Item2);
    }
    Console.WriteLine("–L2");
    foreach (var t in l2)
    {
        Console.WriteLine(t.Item1 + ":" + t.Item2);
    }
    Console.WriteLine("–L3");
    foreach (var t in l3)
    {
        Console.WriteLine(t.Item1 + ":" + t.Item2);
    }
}

I then made an offering to the F# Gods and hit F5:

Doh! The gods were not pleased. I then went back to my initial filtering function and added a Seq.Take(25000) and the results:

So there a couple of things to draw from this exercise.

1) Apriori Algorithm is the wrong classification technique for this dataset. I had to bring the support way down (10%) to even get any readings. Also, there is too much dispersion of the values. This kind of algorithm is much better with N number of a smaller set of data values versus a fixed number of large values.

2) Even so, how cool is this? Compare the files just to make the C#/OO work versus with F#

And the Total LOC is 539 for C# versus 120 for F# – and the F# can be optimized using a better way to create search and itemsets. Hard-coding each level was a hack I did to get thing working and give me an understanding of how AA works. I bet this can be consolidated to well under 75 lines without sacrificing readability

3) I think the StackOverflow exception is because I am doing a Cartesian join and then paring the result. Using one of the other techniques suggested on SO will give much better results.

I any event, what a fun project! I can’t wait to optimize this and perhaps throw a different algorithm at the dataset in the coming weeks.

Filed under Analytics, F#, Open Data

Analysis of Health Inspection Data using F#

February 11, 2014 4 Comments

As part of the TRINUG F#/Analytics SIG, I did a public records request from Wake County for all of the restaurant inspections in 2013. If you are not familiar, the inspectors go out and then give a score to the restaurant. The restaurant then has to display their score like this:

After some back and forth, I got the data as an Excel spreadsheet that looks like this

I then loaded the spreadsheet into a sql server and exposed it as some OData endpoints.

// GET odata/Restaurant
[Queryable]
public IQueryable<Restaurant> GetRestaurant()
{
    return db.Restaurants;
}
 
// GET odata/Restaurant(5)
[Queryable]
public SingleResult<Restaurant> GetRestaurant([FromODataUri] int key)
{
    return SingleResult.Create(db.Restaurants.Where(restaurant => restaurant.Id == key));
}

I then dove into the data to see if there were any interesting conclusions to be found. Following my pattern of doing analytics using F# and unit testing using C#, I created a project with the following code:

namespace ChickenSoftware.RestraurantChicken.Analysis
 
open System.Linq
open System.Configuration
open Microsoft.FSharp.Linq
open Microsoft.FSharp.Data.TypeProviders
 
type internal SqlConnection = SqlEntityConnection<ConnectionStringName="azureData">
 
type public RestaurantAnalysis () =
    
    let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;

Note that I am using the connection string in two places – the 1st for the type provider to do its magic at design time and the second for actually accessing the data at run time. With that set up, the 1st question I had was “ is there seasonality in inspection scores like there are in traffic tickets?” To that end, I created the following function:

member public x.GetAverageScoreByMonth () =
    SqlConnection.GetDataContext(connectionString).Restaurants
        |> Seq.map(fun x -> x.InspectionDate.Value.Month, x.InspectionScore.Value)
        |> Seq.groupBy(fun x -> fst x)
        |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
        |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
        |> Seq.toArray
        |> Array.sort

This is pretty vanilla F# code, with the tricky part being the average by month (lines 4 and 5 here). What the code is doing is grouping up the 4,000 or so tuples that were created on line 3 into another tuple – with the fst being the groupBy value (in this case month) and then the second tuple being a tuple with the month and score. Then, by averaging up the score of the second tuple, we get an average for each month. I create a unit (really integration) test like so:

[TestMethod]
public void GetAverageScoreByMonth_ReturnsTwelveItems()
{
    var analysis = new RestaurantAnalysis();
    var scores = analysis.GetAverageScoreByMonth();
    Int32 expected = 12;
    Int32 actual = scores.Length;
    Assert.AreEqual(expected, actual);
}

And the result ran green.

Putting a break on the Assert and a watch on scores, you can see the values:

A couple of things stand out

1) The overall average is around 96 and change

2) There does not seem to be any significant variance among months.

Since I am trying to also teach myself D3, I then added a MVC5 project to my solution and added an analysis controller that calls the function in the analysis module and serves the results as json:

public JsonResult AverageScoreByMonth()
{
    var analysis = new RestaurantAnalysis();
    var scores = analysis.GetAverageScoreByMonth();
    return Json(scores,JsonRequestBehavior.AllowGet);
}

I then made a page with a simple D3 chart that calls this controller

@{
    Layout = "~/Views/Shared/_Layout.cshtml";
}
 
<svg class="chart"></svg>
 
<style>
    .bar {
        fill: steelblue;
    }
 
        .bar:hover {
            fill: brown;
        }
 
    .axis {
        font: 10px sans-serif;
    }
 
        .axis path,
        .axis line {
            fill: none;
            stroke: #000;
            shape-rendering: crispEdges;
        }
 
    .x.axis path {
        display: none;
    }
</style>
 
 
 
<script>
 
    var margin = { top: 20, right: 20, bottom: 30, left: 40 },
        width = 960 – margin.left – margin.right,
        height = 500 – margin.top – margin.bottom;
 
    var x = d3.scale.ordinal()
        .rangeRoundBands([0, width], .1);
 
    var y = d3.scale.linear()
        .range([height, 0]);
 
    var xAxis = d3.svg.axis()
        .scale(x)
        .orient("bottom");
 
    var yAxis = d3.svg.axis()
        .scale(y)
        .orient("left")
        .ticks(10, "%");
 
    var svg = d3.select("body").append("svg")
        .attr("width", width + margin.left + margin.right)
        .attr("height", height + margin.top + margin.bottom)
      .append("g")
        .attr("transform", "translate(" + margin.left + "," + margin.top + ")");
 
 
 
    $.ajax({
        url: "http://localhost:3057/Analysis/AverageScoreByMonth/&quot;,
        dataType: "json",
        success: function (data) {
            x.domain(data.map(function (d) { return d.Item1; }));
            y.domain([0, d3.max(data, function (d) { return d.Item2; })]);
 
            svg.append("g")
                .attr("class", "x axis")
                .attr("transform", "translate(0," + height + ")")
                .call(xAxis);
 
            svg.append("g")
                .attr("class", "y axis")
                .call(yAxis)
              .append("text")
                .attr("transform", "rotate(-90)")
                .attr("y", 6)
                .attr("dy", ".71em")
                .style("text-anchor", "end")
                .text("Frequency");
 
            svg.selectAll(".bar")
                .data(data)
              .enter().append("rect")
                .attr("class", "bar")
                .attr("x", function (d) { return x(d.Item1); })
                .attr("width", x.rangeBand())
                .attr("y", function (d) { return y(d.Item2); })
                .attr("height", function (d) { return height – y(d.Item2); });
 
        },
        error: function (e) {
            alert("error");
        }
    });
 
    function type(d) {
        d.Item2 = +d.Item2;
        return d;
    }
</script>

And when I run it, a run-of-the mill barchart (I did have to adjust the F# to shift the decimal to the left two positions so that I could match the scale of the chart’s template. For me, it is easier to alter the F# than the javascript:

Following this pattern, I did some other season analysis like average by DayOfMonth

DayOf Week.

So there does not seem to be any seasonality in inspection scores.

I then did an average of inspectors

And there looks to be some variance, but it is getting lost of the scale of the map. The problem is that the range of the scores is not 0 to 100

Here is a function that counts the number of scores (rounded to 0)

member public x.CountOfRoundedScores () =
    SqlConnection.GetDataContext(connectionString).Restaurants
        |> Seq.map(fun x -> System.Math.Round(x.InspectionScore.Value,0), x.InspectionID)
        |> Seq.groupBy(fun x -> fst x)
        |> Seq.map(fun (x,y) -> (x,y |> Seq.countBy snd))
        |> Seq.map(fun (x,y) -> (x,y |> Seq.sumBy snd))
        |> Seq.toArray

That graphically looks like:

So back to inspectors, I needed to adjust the scale from 0 to 100 to 80 to 100. I also needed to remove the null inspection Ids and the records that were for the ‘test facility’ and the 6 records that were below 80.

member public x.AverageScoreByInspector () =
    SqlConnection.GetDataContext(connectionString).Restaurants
        |> Seq.filter(fun x -> x.EstablishmentName <> "Test Facility")
        |> Seq.filter(fun x -> x.InspectionScore.Value > 80.)
        |> Seq.filter(fun x -> x.InspectionID <> null)
        |> Seq.map(fun x -> x.InspectorID, x.InspectionScore.Value)
        |> Seq.groupBy(fun x -> fst x)
        |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
        |> Seq.map(fun (x,y) -> x, y/100.)
        |> Seq.map(fun (x,y) -> x, System.Math.Round(y,4))
        |> Seq.toArray
        |> Array.sort

I then adjusted the scale of the inspector graph to have to domain from 80 to 100 (versus 0 to 100) and the scale of the y axis. This was a good article explaining Scales and Domains in D3.

var yAxis = d3.svg.axis()
    .scale(y)
    .orient("left")
    .ticks(10);

$.ajax({
    url: "http://localhost:3057/Analysis/AverageScoreByInspector/&quot;,
    dataType: "json",
    success: function (data) {
        x.domain(data.map(function (d) { return d.Item1; }));
        y.domain([80, d3.max(data, function (d) { return d.Item2; })]);

and now there is pretty good graph showing the variance among inspectors:

So the interesting this is that #1168 is 2 below the average – which of a domain of 10 is pretty significant. Interestingly, 1168 is also the inspector who has all of the “Test facility” records – so they are probably the trainer and/or lead inspector. With this analysis in the back pocket, ran a function that did the inspection score by establishment type:

This is kinda interesting (esp that pushcarts got the highest scores) but I wanted to see if there was any truth the the common perception that Chinese restaurants are less sanitary than other kinds of restaurants. To that end, I created a rudimentary classifier that searched the name of the establishment to see if it had a name that is typically associated with fast-food Chinese:

member public x.IsEstablishmentAChineseRestraurant (establishmentName:string) =
    let upperCaseEstablishmentName = establishmentName.ToUpper()
    let numberOfMatchedWords = upperCaseEstablishmentName.Split(' ')
                                |> Seq.map(fun x -> match x with
                                                        | "ASIA" -> 1
                                                        | "ASIAN" -> 1
                                                        | "CHINA" -> 1
                                                        | "CHINESE" -> 1
                                                        | "PANDA" -> 1
                                                        | "PEKING" -> 1
                                                        | "WOK" -> 1
                                                        | _ -> 0)
                                |> Seq.sum
    match numberOfMatchedWords with
        | 0 -> false
        | _ -> true

I then created a function that returned the average and ran my unit tests.

[TestMethod]
public void IsEstablishmentAChineseRestraurantUsingWOK_ReturnsTrue()
{
    var analysis = new RestaurantAnalysis();
    String establishmentName = "JAMIE'S WOK";
 
    var expected = true;
    var actual = analysis.IsEstablishmentAChineseRestraurant(establishmentName);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void IsEstablishmentAChineseRestraurantUsingWok_ReturnsTrue()
{
    var analysis = new RestaurantAnalysis();
    String establishmentName = "Jamie's Wok";
 
    var expected = true;
    var actual = analysis.IsEstablishmentAChineseRestraurant(establishmentName);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void AverageScoreForChineseRestaurants_ReturnsExpected()
{
    var analysis = new RestaurantAnalysis();
    var actual = analysis.AverageScoreForChineseRestaurants();
    Assert.IsNotNull(actual);
}

When a break was put on the value of the average, it was apparent that Chinese restaurants scored significantly lower than the average of 96

So then I applied 1 more segmentation: Chinese versus Non-Chinese scores by inspector:

member public x.AverageScoresOfChineseAndNonChineseByInspector () =
    let dataSet = SqlConnection.GetDataContext(connectionString).Restaurants
                    |> Seq.map(fun x -> x.EstablishmentName, x.InspectorID,x.InspectionScore.Value)
    let chineseRestraurants = dataSet
                                |> Seq.filter(fun (a,b,c) -> x.IsEstablishmentAChineseRestraurant(a))
                                |> Seq.map(fun (a,b,c) -> b,c)
                                |> Seq.groupBy(fun x -> fst x)
                                |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
                                |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
                                |> Seq.toArray
                                |> Array.sort
    let nonChineseRestraurants = dataSet
                                |> Seq.filter(fun (a,b,c) -> not(x.IsEstablishmentAChineseRestraurant(a)))
                                |> Seq.map(fun (a,b,c) -> b,c)
                                |> Seq.groupBy(fun x -> fst x)
                                |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
                                |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
                                |> Seq.toArray
                                |> Array.sort
    Seq.zip chineseRestraurants nonChineseRestraurants
           |> Seq.map(fun ((a,b),(c,d)) -> a,b,d)
           |> Seq.toList

And in graphics using a double-bar chart:

So this is kinda interesting. The lead inspector (1168) who grades everyone lower actually gives Chinese restaurants higher marks. Everyone else pretty much grades Chinese restaurants lower except for 1 inspector. Also, 1708 must really not like Chinese restaurants – or their inspection list has a series of really bad Chinese restaurants.

Note that this may not be statistically significant (I didn’t control for sample size, etc..) – but further analysis might be warranted, no? If you are interested, here is the endpoint: http://restaurantchicken.cloudapp.net/odata/Restaurant

Finally, when I presented this analysis to TRINUG last week, lots of people became interested in F# and analytics (ok, maybe 3). You can see the comments here. Also, I now have an appointment with the head of the health department department and the CIO of Wake County later this week – let’s see what they say…

Filed under Analytics, D3, F#

Traffic Stop Disposition: Classification Using F# and KNN

January 14, 2014 1 Comment

I have already looked at the summary statistics of the traffic stop data I received from the town here. My next stop was to try and do a machine learning exercise with the data. One of the more interesting questions I want to answer is what factors into weather a person gets a warning or a ticket (called disposition)? Of all of the factors that may be involved, the dataset that I have is fairly limited:

Using dispositionId as the result variable, there is StopDateTime and Location (Latitude/Longitude). Fortunately, DateTime can be decomposed into several input variables. For this exercise, I wanted to use the following:

TimeOfDay
DayOfWeek
DayOfMonth
MonthOfYear
Location (Latitude:Longitude)

And the resulting variable being disposition. To make it easier for analysis, I limited the analysis set to finalDisposition as either “verbal warning” or “citation” I decided to do a K-Nearest Neighbor because it is regarded as an easy machine learning algorithm to learn and the question does seem to be a classification problem.

My first step was to decide weather to write or borrow the KNN algorithm. After looking at what kind of code would be needed to write my own and then looking at some other libraries, I decided to use Accord.Net.

My next first step was to get the data via the web service I spun up here.

namespace ChickenSoftware.RoadAlert.Analysis
 
open FSharp.Data
open Microsoft.FSharp.Data.TypeProviders
open Accord.MachineLearning
 
type roadAlert2 = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
type MachineLearningEngine =
    static member RoadAlertDoc = roadAlert2.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)

My next first step was to filter the data to only verbal warnings (7) or citations (15).

  static member BaseDataSet =
      MachineLearningEngine.RoadAlertDoc
            |> Seq.filter(funx -> x.DispositionId = 7 || x.DispositionId = 15)
          |> Seq.map(fun x -> x.Id, x.StopDateTime, x.Latitude, x.Longitude, x.DispositionId)
          |> Seq.map(fun (a,b,c,d,e) -> a, b, System.Math.Round(c,3), System.Math.Round(d,3), e)
          |> Seq.map(fun (a,b,c,d,e) -> a, b, c.ToString() + ":" + d.ToString(), e)
          |> Seq.map(fun (a,b,c,d) -> a,b,c, match d with
                                              |7 -> 0
                                              |15 -> 1
                                              |_ -> 1)
          |> Seq.map(fun (a,b,c,d) -> a, b.Hour, b.DayOfWeek.GetHashCode(), b.Day, b.Month, c, d)
          |> Seq.toList

You will notice that I had to transform the dispositionIds from 7 and 15 to 1 and 0. The reason why is that the KNN method in Accord.Net assumes that the values match the index position in the array. I had to dig into the source code of Accord.Net to figure that one out.

My next step was to divide the dataset in half: one half being the training sample and the other the validation sample:

static member TrainingSample =
    let midNumber = MachineLearningEngine.NumberOfRecords/ 2
    MachineLearningEngine.BaseDataSet
        |> Seq.filter(fun (a,b,c,d,e,f,g) -> a < midNumber)
        |> Seq.toList
 
static member ValidationSample =
    let midNumber = MachineLearningEngine.NumberOfRecords/ 2
    MachineLearningEngine.BaseDataSet
        |> Seq.filter(fun (a,b,c,d,e,f,g) -> a > midNumber)
        |> Seq.toList

The next step was to actually run the KKN. Before I could do that though, I had to create the distance function. Since this was my 1st time, I dropped the geocoordinates and focused only on the time of day derivatives.

static member RunKNN inputs outputs input =
    let distanceFunction (a:int,b:int,c:int,d:int) (e:int,f:int,g:int,h:int) =  
      let b1 = b * 4
      let f1 = f * 4
      let d1 = d * 2
      let h1 = h * 2
      float((pown(a-e) 2) + (pown(b1-f1) 2) + (pown(c-g) 2) + (pown(d1-h1) 2))
 
    let distanceDelegate = 
          System.Func<(int * int * int * int),(int * int * int * int),float>(distanceFunction)
    
    let knn = new KNearestNeighbors<int*int*int*int>(10,2,inputs,outputs,distanceDelegate)
    knn.Compute(input)

You will notice I tried to normalize the values so that they all had the same basis. They are not exact, but they are close. You will also notice that I had to create a delegate from for the distanceFunction (thanks to Mimo on SO). This is because Accord.NET was written in C# with C# consumers in mind and F# has a couple of places where the interfaces are not as seemless as one would hope.

In any event, once the KKN function was written, I wrote a function that to the validation sample, made a guess via KKN, and then reported the result:

static member GetValidationsViaKKN  =
    let inputs = MachineLearningEngine.TrainingInputClass
    let outputs = MachineLearningEngine.TrainingOutputClass
    let validations = MachineLearningEngine.ValidationClass
 
    validations
        |> Seq.map(fun (a,b,c,d,e) -> e, MachineLearningEngine.RunKNN inputs outputs (a,b,c,d))
        |> Seq.toList
 
static member GetSuccessPercentageOfValidations =
    let validations = MachineLearningEngine.GetValidationsViaKKN
    let matches = validations
                    |> Seq.map(fun (a,b) -> match (a=b) with
                                                | true -> 1
                                                | false -> 0)
 
    let recordCount =  validations |> Seq.length
    let numberCorrect = matches |> Seq.sum
    let successPercentage = double(numberCorrect) / double(recordCount)
    recordCount, numberCorrect, successPercentage

I then hopped over to my UI console app and looked that the success percentage.

private static void GetSuccessPercentageOfValidations()
{
    var output = MachineLearningEngine.GetSuccessPercentageOfValidations;
    Console.WriteLine(output.Item1.ToString() + ":" + output.Item2.ToString() + ":" + output.Item3.ToString());
}

So there are 12,837 records in the validation sample and the classifier guessed the correct disposition 9,001 times – a success percentage of 70%

So it looks like there is something there. However, it is not clear that this is a good classifier without further tests – specifically seeing if the how to most common case results when pushing though the classifier. Also, I would assume to make this a true ‘machine learning’ algorithm I would have to feed the results back to the distance function to see if I can alter it to get the success percentage higher.

One quick note about methodology – I used unit tests pretty extensively to understand how the KKN works. I created a series of tests with some sample data to see who the function reacted.

[TestMethod]
public void TestKKN_ReturnsExpected()
{
 
    Tuple<int, int, int, int>[] inputs = { 
        new Tuple<int, int, int, int>(1, 0, 15, 1), 
        new Tuple<int,int,int,int>(1,0,11,1)};
    int[] outputs = { 1, 1 };
 
    var input = new Tuple<int, int, int, int>(1, 1, 1, 1);
 
    var output = MachineLearningEngine.RunKNN(inputs, outputs, input);
 
}

This was a big help to get me up and running (walking, really..)…

Filed under Analytics, F#

Traffic Stop Analysis Using F#

January 7, 2014 1 Comment

Now that I have the traffic stop services up and running, it is time to actually do something with the data. The data set is all traffic stops in my town for 2012 with some limited information: date/time of the stop, the geolocation of the stop, and the final disposition of the stop. The data looks like this:

My 1st step was to look at the Date/Time and see if there are any patterns in DayOfMonth, MonthOfYear, And TimeOfDay. To that end, I spun up a F# project and added my 1st method that determines the total number of records in the dataset:

type roadAlert = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
type AnalysisEngine =
    static member RoadAlertDoc = roadAlert.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)
 
    static member NumberOfRecords =
        AnalysisEngine.RoadAlertDoc 
            |> Seq.length

Since I am a TDDer more than a REPLer, I went and wrote a covering unit test.

[TestMethod]
public void NumberOfRecords_ReturnsExpected()
{
    Int32 notEpected = 0;
    Int32 actual = AnalysisEngine.NumberOfRecords;
    Assert.AreNotEqual(notEpected, actual);
}

A couple of things to note about this:

1) This is really an integration test, not a unit test. I could have written the test like this:

[TestMethod]
public void NumberOfRecordsFor2012DataSet_ReturnsExpected()
{
    Int32 expected = 27778;
    Int32 actual = AnalysisEngine.NumberOfRecords;
    Assert.AreEqual(expected, actual);
}

But that means I am tying the test to the specific data sample (in its current state) – and I don’t want to do that.

2) I am finding that my F# code has many more functions than the code written by other people – esp data scientists. I think it has to do with contrasting methodologies. Instead of spending time in the REPL with a small piece of code to get it right and then adding the code into the larger code base, I am writing very small piece of code in the class and then using unit tests to get it right. The upshot of that is that there are lots of small, independently testable pieces of code – I think this stems from my background of writing production apps that are for business problems and not for academic papers. Also, I use classes in source files versus script files because I plan to plug the code into larger .NET applications that will be written in C# and/or VB.NET.

In any event, once I has the total number of records, I went to see how they broke down into month:

static member ActualTrafficStopsByMonth =
    AnalysisEngine.RoadAlertDoc
        |> Seq.map(fun x -> x.StopDateTime.Month)
        |> Seq.countBy(fun x-> x)
        |> Seq.toList

[TestMethod]
public void ActualTrafficStopsByMonth_ReturnsExpected()
{
    Int32 notExpected = 0;
    var stops = AnalysisEngine.ActualTrafficStopsByMonth;
    Assert.AreNotEqual(notExpected, stops.Length);
 
}

I then created a function that shows the expected number of stops by month. Pattern matching with F# makes creating the month list a snap. Note that is is a true unit test because I am not dependent on external data:

static member Months =
    let monthList = [1..12]
    Seq.map (fun x -> 
            match x with
                | 1 | 3 | 5 | 7 | 8 | 10 | 12 -> x,31,31./365.
                | 2 -> x,28,28./365.
                | 4 | 6 | 9 | 11 -> x,30, 30./365.
                | _ -> x,0,0.                    
        ) monthList
    |> Seq.toList   

static member ExpectedTrafficStopsByMonth numberOfStops =
    AnalysisEngine.Months
        |> Seq.map(fun (x,y,z) -> 
            x, int(z*numberOfStops))
        |> Seq.toList

[TestMethod]
public void ExpectedTrafficStopsByMonth_ReturnsExpected()
{
    var stops = AnalysisEngine.ExpectedTrafficStopsByMonth(27778);
    double expected = 2359;
    double actual =stops[0].Item2;
 
    Assert.AreEqual(expected, actual);
}

With the actual and expected ready to go, I then put the two side by side:

static member TrafficStopsByMonth =
    let numberOfStops = float(AnalysisEngine.NumberOfRecords)
    let monthlyExpected = AnalysisEngine.ExpectedTrafficStopsByMonth numberOfStops
    let monthlyActual = AnalysisEngine.ActualTrafficStopsByMonth
    Seq.zip monthlyExpected monthlyActual
        |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
        |> Seq.toList

[TestMethod]
public void TrafficStopsByMonth_ReturnsExpected()
{
    var output = AnalysisEngine.TrafficStopsByMonth;
    Assert.IsNotNull(output);
 
}

All of my unit tests ran green

so now I am ready to roll. I created a quick console UI

static void Main(string[] args)
{
    Console.WriteLine("Start");
 
    foreach (var tuple in AnalysisEngine.TrafficStopsByMonth)
    {
        Console.WriteLine(tuple.Item1 + ":" + tuple.Item2 + ":" + tuple.Item3 + ":" + tuple.Item4 + ":" + tuple.Item5);
    }
 
    Console.WriteLine("End");
    Console.ReadKey();
}

With the output. Obviously, a UX person could put some real pizzaz front of this data, but that is something to do another day. If you didn’t see it in the code above, the tuple is constructed as: Month,ExpectedStops,ActualStops,Difference,%Difference. So the real interesting thing is that September was 47% higher than expected with December 26% less. That kind of wide variation begs for more analysis.

I then did a similar analysis by DayOfMonth:

static member ActualTrafficStopsByDay = 
    AnalysisEngine.RoadAlertDoc
        |> Seq.map(fun x -> x.StopDateTime.Day)
        |> Seq.countBy(fun x-> x)
        |> Seq.toList
 
static member Days =
    let dayList = [1..31]
    Seq.map (fun x -> 
            match x with
                | x when x < 29 -> x, 12, 12./365.
                | 29 | 30 -> x, 11, 11./365.
                | 31 -> x, 7, 7./365.
                | _ -> x, 0, 0.                 
        ) dayList
    |> Seq.toList     
 
static member ExpectedTrafficStopsByDay numberOfStops =
    AnalysisEngine.Days
        |> Seq.map(fun (x,y,z) -> 
            x, int(z*numberOfStops))
        |> Seq.toList    
 
static member TrafficStopsByDay =
    let numberOfStops = float(AnalysisEngine.NumberOfRecords)
    let dailyExpected = AnalysisEngine.ExpectedTrafficStopsByDay numberOfStops
    let dailyActual = AnalysisEngine.ActualTrafficStopsByDay
    Seq.zip dailyExpected dailyActual
        |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
        |> Seq.toList

The interesting thing is that there are higher than expected traffic stops in the last half of the month (esp the 25th and 26th) and much lower in the 1st part of the month.

And by TimeOfDay

static member ActualTrafficStopsByHour = 
    AnalysisEngine.RoadAlertDoc
        |> Seq.map(fun x -> x.StopDateTime.Hour)
        |> Seq.countBy(fun x-> x)
        |> Seq.toList
 
static member Hours =
    let hourList = [1..24]
    Seq.map (fun x -> 
                x,1, 1./24.
        ) hourList
    |> Seq.toList     
 
static member ExpectedTrafficStopsByHour numberOfStops =
    AnalysisEngine.Hours
        |> Seq.map(fun (x,y,z) -> 
            x, int(z*numberOfStops))
        |> Seq.toList    
 
static member TrafficStopsByHour =
    let numberOfStops = float(AnalysisEngine.NumberOfRecords)
    let hourlyExpected = AnalysisEngine.ExpectedTrafficStopsByHour numberOfStops
    let hourlyActual = AnalysisEngine.ActualTrafficStopsByHour
    Seq.zip hourlyExpected hourlyActual
        |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
        |> Seq.toList

The interesting thing here is that there are much higher than expected number of traffic stops from 1-2 AM (61% and 123%) with significantly less between 8PM and midnight. Finally, I looked at GPS location for the stops.

static member ActualTrafficStopsByGPS =  
    AnalysisEngine.RoadAlertDoc
        |> Seq.map(fun x -> System.Math.Round(x.Latitude,3).ToString() + ":" + System.Math.Round(x.Longitude,3).ToString())
        |> Seq.countBy(fun x-> x)
        |> Seq.sortBy snd
        |> Seq.toList
        |> List.rev
 
static member GetVarianceOfTrafficStopsByGPS =
    let trafficStopList = AnalysisEngine.ActualTrafficStopsByGPS
                            |> Seq.map(fun x -> double(snd x))
                            |> Seq.toList
    AnalysisEngine.Variance(trafficStopList)
 
static member GetAverageOfTrafficStopsByGPS =
    AnalysisEngine.ActualTrafficStopsByGPS
        |> Seq.map(fun x -> double(snd x))
        |> Seq.average

You can see that I rounded the Latitude and Longitude to 3 decimal places. Using Wikipedia, saying that 4 decimals at 23N is 10.24M and 45N it is 7.87M for latitude, I imputed that 35 is 8.94M. With 1 M = 3.28 feet, that means that 4 decimals is with 30 feet and 3 decimals is within 300 feet and 2 decimals is within 3,000 feet. 300 feet seems like a good compromise so I ran with that.

So running the average and variance and the top GPS locations:

With an average of 11 stops per GPS location (less than 1 a month) and a variance of 725, there does not seem be a strong relationship between GPS location and traffic stops.

The upshot of all of this analysis seems to point to avoid getting stopped it is less important where you are than when you are. This is confirmed anecdotally too – the Town actually broadcasts when they will have heightened traffic surveillance on Twitter and the like. Ignore open data at your own risk.

In any event, I my next step is to run this data though a machine-learning algorithm to see if there is anything else to uncover.

Filed under Analytics, F#

Correlation Between Recruit Rankings and Final Standings in Big Ten Football

December 31, 2013 1 Comment

Following up on my last post about screen scraping college football in F#, I took the next step and analyzed the data that I scraped. I am a big believer in Domain Specific Language so ‘Rankings’ means the ranking assigned by Rivals about how well a school recruits players. ‘Standings’ means the final position in the Big Ten after the games have been played. Ranking is for recruiting and standings is for actually playing the games.

Going back to the code, the 1st thing I did was to separate the Standings call from the search for a given school – so that the XmlDocument is loaded once and then searched several times versus loading it for each search. This improved performance dramatically:

static member getAnnualConferenceStandings(year:int)=
    let url = "http://espn.go.com/college-football/conferences/standings/_/id/5/year/&quot;+year.ToString()+"/big-ten-conference";         
    let request = WebRequest.Create(Uri(url)) 
    use response = request.GetResponse() 
    use stream = response.GetResponseStream() 
    use reader = new IO.StreamReader(stream) 
    let htmlString = reader.ReadToEnd()
    let divMarkerStartPosition = htmlString.IndexOf("my-teams-table");
    let tableStartPosition = htmlString.IndexOf("<table",divMarkerStartPosition);
    let tableEndPosition = htmlString.IndexOf("</table",tableStartPosition);
    let data = htmlString.Substring(tableStartPosition, tableEndPosition- tableStartPosition+8)
    let xmlDocument = new XmlDocument();
    xmlDocument.LoadXml(data);
    xmlDocument        
 
static member getSchoolStanding(xmlDocument: XmlDocument,school) =
    let keyNode = xmlDocument.GetElementsByTagName("td")
                        |> Seq.cast<XmlNode>
                        |> Seq.find (fun node -> node.InnerText = school)
    let valueNode = keyNode.NextSibling
    let returnValue = (keyNode.InnerText, valueNode.InnerText)
    returnValue
 
static member getConferenceStandings(year:int) =
    let xmlDocument = RankingProvider.getAnnualConferenceStandings(year)
    Seq.map(fun school -> RankingProvider.getSchoolStanding(xmlDocument,school)) RankingProvider.schools
        |> Seq.sortBy snd
        |> Seq.toList
        |> List.rev
        |> Seq.mapi(fun index (school,ranking) -> school, index+1)
        |> Seq.sortBy fst
        |> Seq.toList

Thanks for Valera Kolupaev for showing me how to use mapi to create a tuple from the list of schools and what rank they were in the list in getConferenceStandings().

I then went to the rankings call and added a way to parse down only the schools I am interested in. That way I can compare individual schools, groups of schools, or the entire conference:

static member getConferenceRankings(year) =
    RankingProvider.schools
            |> Seq.map(fun schoolName -> RankingProvider.getSchoolInSequence(year, schoolName))
            |> Seq.toList
    
 
static member getSchoolInSequence(year, schoolName) =
    RankingProvider.getRecrutRankings(year)
                    |> Seq.find(fun (school,rank) -> school = schoolName)

After these two refactorings, my unit tests still ran green so I was ready to do the analysis.

I went out to my project of a couple of weeks ago for correlation and copied in the module. The Correlation function takes in two lists of doubles. The first list would be a school’s ranking and the second would be the standings:

static member getCorrelationBetweenRankingsAndStandings(year, rankings, standings ) =
    let ranks = Seq.map(fun (school,rank) -> rank) rankings
    let stands = Seq.map(fun (school,standing) -> standing) standings
    Calculations.Correlation(ranks,stands)
 
static member getCorrelation(year:int) =
    let rankings = RankingProvider.getConferenceRankings year
                    |> Seq.map(fun (school,rank) -> school,Convert.ToDouble(rank))
    let standings = RankingProvider.getConferenceStandings(year+RankingProvider.yearDifferenceBetwenRankingsAndStandings)
                    |> Seq.map(fun (school, standing) -> school, Convert.ToDouble(standing))
    let correlation = RankingProvider.getCorrelationBetweenRankingsAndStandings(year,rankings, standings)
    (year, correlation)

A couple of things to note:

1) This function assumes that both the rankings and the standings are the same length and are in order by school name. A production application would check this as part of standard argument validation.

2) I used Convert.ToDouble() to change the Int32 of the ranking to Double of the correlation function. Having these .NET assemblies available at key points in the application really moved things along.

In any event, all that was left was to list the Big Ten schools to analyze, the number of years to analyze, and the year difference between the recruit rankings and the standings from the games they played in.

As a first step, I did all original big ten schools with 7 years of recruiting and a 1,2,3,4 years difference (2002 ranking compared to 2003, 2004,2005,2006 standings ,etc…):

The average is .3303/.2650/.5138/.6065

And so yeah – there is a really strong correlation between a recruit ranking and the outcome on the field. Also, the most impact the class has seems to be senior year – which makes sense. I don’t have a hypothesis on why it drops sophomore year – perhaps the ‘impact freshmen’ leave after 1 year?

Also of interest, the correlation does not seem to follow a normal distribution. If you only look at the schools that have an emphasis on academics, the correlation drops significantly – to a negative correlation!

The average is .1485/-.1446/-.2817/-.0381

So another great reason to create the new big ten – sometimes there is a really good recruit class does not do well on the field and other times a poorly-ranked recruiting class does well on the field. This kind of unpredictability is both exciting and probably much more likely to bring in the casual fans.

Based on this analysis, here is what is going to happen in the Big Ten next year:

Michigan State and Ohio State will be the leaders
Michigan and Penn State are in the best position beat Michigan State and Ohio State

But you didn’t need a statistical analysis to tell you that. The one key surprise that this analysis tells you is that

Nebraska will have a significant improvement in the standings in 2014
Indiana will have a significant improvement in the standings in 2015 and 2016

As a final note, I got this after doing a bunch of requests to Yahoo:

So I wonder if I hit the page too many times and my IP was flagged a as a bot? I waited a day for the server to reset to finish my analysis. Perhaps this is a case where I should get the data when the getting is good and take their pages and bring them locally?

Filed under Analytics, F#

The Big Ten and F#

December 10, 2013 3 Comments

I was talking to fellow TRINUGer David Green about football schools a couple of weeks ago. He went to Northwestern and I went to Michigan and we were discussing the relative merits of universities doing well in football. Assuming Goro was counting, on one hand, it is great to have a sport that can bring in tons of money to the school to fund non-football sports and activities, on the second hand it keeps alumni interested in their school, on the 3rd hand it can give locals a source of pride in the school, and on the last hand it can take the focus away from the other parts of the academic institution.

I then was talking to a professor at Ohio State University – she cares absolutely zero about the football team. I made the comment that the smartest kids in Ohio don’t go to OSU. They will go and root for their gladiators on Saturday but when it comes down to their academic and subsequent professional success, they look elsewhere. She agreed.

Putting those two conversations together, it put OSU and MSU’s continued success in the Big Ten in context – as the inevitable bellyaching that those teams get the short stick when compared to the SEC. For example, OSU and MSU both would be undefeated in the Ivy League in 2013– does that mean they should be considered in the same conversation as Alabama and Auburn for the national championship? I think the biggest problem that OSU and MSU have is that they are in the Big Ten – which historically has been about geography, academic success, and athletic competition (in that order).

Looking at the Big Ten Schools, I pulled their most recent academic ranking for US News and World Report and their BCS Ranking. I then went over to MathIsFun to get the recipe for correlation:

I then went over to Visual Studio and created a solution like so:

Learning from my last project, I created my unit test first to verify that the calculation is correct:

[TestMethod]
public void FindCorrelationUsingStandardInput_ReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
 
    Double expected = .9575;
    Double actual = Calculations.Correlation(tempatures, sales);
    Assert.AreEqual(expected, actual);
}

I then hopped over to my working code and started coding:

type Calculations() =
    static member Correlation(x:IEnumerable<double>, y:IEnumerable<double>) =
        let meanX = Seq.average x
        let meanY = Seq.average y
        
        let a = Seq.map(fun x -> x-meanX) x
        let b = Seq.map(fun y -> y-meanY) y
 
        let ab = Seq.zip a b
        let abProduct = Seq.map(fun (a,b) -> a * b) ab
 
        let aSquare = Seq.map(fun a -> a * a) a
        let bSquare = Seq.map(fun b -> b * b) b
        
        let abSum = Seq.sum abProduct
        let aSquareSum = Seq.sum aSquare
        let bSquareSum = Seq.sum bSquare
 
        let sums = aSquareSum * bSquareSum
        let squareRootOfSums = sqrt(sums)
 
        abSum/squareRootOfSums

What I noticed is that those intermediate variables make the code much more wordy than they need to be – so a mathematician might think that the code is too verbose– but a developer might appreciate that each step is laid out. In fact, I would argue that a better component design would be to break out each of the steps into their own function that can be independently testable (and perhaps reused by other functions):

[TestMethod]
public void GetMeanUsingStandardInputReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double expected = 18.675;
    Double actual = Calculations.Mean(tempatures);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void GetBothMeansProductUsingStandardInputReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
 
    Double expected = 5325;
    Double actual = Calculations.MeanProduct(tempatures);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void GetMeanSquareUsingStandardInputReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
 
    Double expected = 177;
    Double actual = Calculations.MeanSquared(tempatures);
    Assert.AreEqual(expected, actual);
}

I’ll leave that implementation for another day as it is already getting late. In any event, I ran the unit test and I got red (pink, really):

The spreadsheet rounded and my calculation does not. I adjusted the unit test appropriately:

[TestMethod]
public void FindCorrelationUsingStandardInput_ReturnsExpectedValue()
{
    Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
    Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
 
    Double correlation = Calculations.Correlation(tempatures, sales);
    Double expected = .9575;
 
    Double actual = Math.Round(correlation, 4);
    Assert.AreEqual(expected, actual);
}

And now I am green:

So going back to the original question, I took the current Big Ten Schools and put their academic rankings and football rankings side by side:

I then made a revised Big Ten that had a much higher academic ranking based on schools that play in a power football conference but still maintain high academics.

Note that I left Penn State out of both of these lists b/c they have a NaN for their football ranking – but they certainly have a high enough academic score to be part of the revised Big Ten.

And then when I put those values through the correlation function via a Console UI:

static void Main(string[] args)
{
    Console.WriteLine("Start");
 
    Double[] academicRanking = new Double[12] { 12,28,41,41,52,62,68,69,73,73,75,101 };
    Double[] footballRanking = new Double[12] { 65,41,82,19,7,61,105,36,4,34,63,37 };
 
    Double originalCorrelation = Calculations.Correlation(academicRanking, footballRanking);
    Console.WriteLine("Original BigTen Correlation {0}", originalCorrelation);
 
    academicRanking = new Double[10] { 7,12,17,18,23,23,28,30,41,41 };
    footballRanking = new Double[10] { 24, 65, 32, 26, 94, 84, 41, 58, 82, 19 };
    Double revisedCorrelation = Calculations.Correlation(academicRanking, footballRanking);
    Console.WriteLine("Revised BigTen Correlation {0}", revisedCorrelation);
 
    
    Console.WriteLine("End");
    Console.ReadKey();
}

I get:

And just looking at the data seems to support this. There is a negative correlation between academics and football success in the current Big Ten – Higher the academics = lower the football ranking and vice versa. In the revised Big Ten, there is positive correlation of the same magnitude – higher academics and higher (relative) football rankings. Put another way, the new Big Ten has a much stronger academic ranking and pretty much the same football ranking.

Looking at a map, this new conference is like a doughnut with Ohio, West Virginia, and Kentucky in the middle. Perhaps they can have a football championship sponsored by Krispie Kreeme? In any event, OSU and MSU are much closer academically and football-wise to the Alabamas and Auburns than the Northwesterns and Michigans of the world. In terms of geographic proximity, Columbus, Ohio is closer to Tuscalosa, AL than Lincoln, NB. So perhaps the OSU and MSU fans would be better served in a conference that is more aligned with their University’s priorities? If they went undefeated or even 1 loss, they would still be in the national championship discussion.

Filed under Analytics, F#

Newer posts →

Jamie Dixon's Home

Association Rule Learning Via F# (Part 2)

Association Rule Learning Via F# (Part 1)

Kaplan-Meier Survival Analysis Using F#

Apriori Algorithm and F# Using Elevator Inspection Data

Analysis of Health Inspection Data using F#

Traffic Stop Disposition: Classification Using F# and KNN

Traffic Stop Analysis Using F#

Correlation Between Recruit Rankings and Final Standings in Big Ten Football

The Big Ten and F#

Categories

Recent Posts

Archives

Blogroll

Meta