Association Rule Learning Via F# (Part 2)

Continuing on the path of re-writing the association rule learning code found in this month’s MSDN, I started with the next function on list:

MakeAntecedent

Here is the original code C#

  1. public static int[] MakeAntecedent(int[] itemSet, int[] comb)
  2. {
  3.   // if item-set = (1 3 4 6 8) and combination = (0 2)
  4.   // then antecedent = (1 4)
  5.   int[] result = new int[comb.Length];
  6.   for (int i = 0; i < comb.Length; ++i)
  7.   {
  8.     int idx = comb[i];
  9.     result[i] = itemSet[idx];
  10.   }
  11.   return result;
  12. }

 

and the F# code:

  1. static member MakeAntecedent(itemSet:int[] , comb:int[]) =
  2.     comb |> Array.map(fun x -> itemSet.[x])

 

It is much easier to figure out what is going on via the F# code.  The function takes in 2 arrays.  Array #1 has values, Array #2 has the index of array #1 that is needed.  Using the Array.Map, I return an array where the index number is swapped out to the actual value.  The unit tests run green:

  1. [TestMethod]
  2. public void MakeAntecedentCSUsingExample_ReturnsExpectedValue()
  3. {
  4.     int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
  5.     int[] combo = new int[2] { 0, 2 };
  6.     int[] expected = new int[2] { 1, 4 };
  7.     var actual = CS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
  8.     Assert.AreEqual(expected.Length, actual.Length);
  9.     Assert.AreEqual(expected[0], actual[0]);
  10.     Assert.AreEqual(expected[1], actual[1]);
  11. }
  12.  
  13. [TestMethod]
  14. public void MakeAntecedentFSUsingExample_ReturnsExpectedValue()
  15. {
  16.     int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
  17.     int[] combo = new int[2] { 0, 2 };
  18.     int[] expected = new int[2] { 1, 4 };
  19.     var actual = FS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
  20.     Assert.AreEqual(expected.Length, actual.Length);
  21.     Assert.AreEqual(expected[0], actual[0]);
  22.     Assert.AreEqual(expected[1], actual[1]);
  23. }

 

MakeConsequent

Here is the original C# code:

  1. public static int[] MakeConsequent(int[] itemSet, int[] comb)
  2. {
  3.   // if item-set = (1 3 4 6 8) and combination = (0 2)
  4.   // then consequent = (3 6 8)
  5.   int[] result = new int[itemSet.Length – comb.Length];
  6.   int j = 0; // ptr into combination
  7.   int p = 0; // ptr into result
  8.   for (int i = 0; i < itemSet.Length; ++i)
  9.   {
  10.     if (j < comb.Length && i == comb[j]) // we are at an antecedent
  11.       ++j; // so continue
  12.     else
  13.       result[p++] = itemSet[i]; // at a consequent so add it
  14.   }
  15.   return result;
  16. }

 

Here is the F# Code:

  1. static member MakeConsequent(itemSet:int[] , comb:int[])=   
  2.     let isNotInComb x = not(Array.exists(fun elem -> elem = x) comb)
  3.     itemSet
  4.         |> Array.mapi(fun indexer value -> value,indexer )
  5.         |> Array.filter(fun (value,indexer) -> isNotInComb indexer)
  6.         |> Array.map(fun x -> fst x)

 

Again, it is easier to look at the F# code to figure out what is going on.  In this case, we have to take all of the items in the first array that are not in the second array.  The trick is that the second array does not contain values to be checked, rather the index position.  If you add the Antecedent and the Consequent, you have the total original array.

This code code me a bit of time to figure out because I kept trying to use the out of the box Array features (including slicing) for F# when it hit me that it would be much easier to create a tuple from the original array –> the value and the index.  I would then look up the index in the second array and confirm it is not there and then filter the ones that are not there.  The map function at the end removes the index part of the tuple because it is not needed anymore.

Sure enough, my unit tests ran green:

  1. [TestMethod]
  2. public void MakeConsequentCSUsingExample_ReturnsExpectedValue()
  3. {
  4.     int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
  5.     int[] combo = new int[2] { 0, 2 };
  6.     int[] expected = new int[3] { 3, 6, 8 };
  7.     var actual = CS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
  8.     Assert.AreEqual(expected.Length, actual.Length);
  9.     Assert.AreEqual(expected[0], actual[0]);
  10.     Assert.AreEqual(expected[1], actual[1]);
  11.     Assert.AreEqual(expected[2], actual[2]);
  12. }
  13.  
  14. [TestMethod]
  15. public void MakeConsequentFSUsingExample_ReturnsExpectedValue()
  16. {
  17.     int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
  18.     int[] combo = new int[2] { 0, 2 };
  19.     int[] expected = new int[3] { 3, 6, 8 };
  20.     var actual = FS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
  21.     Assert.AreEqual(expected.Length, actual.Length);
  22.     Assert.AreEqual(expected[0], actual[0]);
  23.     Assert.AreEqual(expected[1], actual[1]);
  24.     Assert.AreEqual(expected[2], actual[2]);
  25. }

 

IndexOf

I then decided to tackle the remaining three functions in reverse because they depend on each other (CountInTrans –> IsSubSetOf –> IndexOf).  IndexOf did not have any code comments of example cases, but the C# code is clear

  1. public static int IndexOf(int[] array, int item, int startIdx)
  2. {
  3.   for (int i = startIdx; i < array.Length; ++i)
  4.   {
  5.     if (i > item) return -1; // i is past where the target could possibly be
  6.     if (array[i] == item) return i;
  7.   }
  8.   return -1;
  9. }

 

What is even clearer is the F# code that does the same thing (yes, I am happy that FindIndex returns a –1 when not found and so did McCaffey):

  1. static member IndexOf(array:int[] , item:int, startIdx:int) =
  2.     Array.FindIndex(array, fun x -> x=item)

 

And I built some unit tests that run green that I think reflect McCaffey’s intent:

  1. [TestMethod]
  2. public void IndexOfCSUsingExample_ReturnsExpectedValue()
  3. {
  4.     int[] itemSet = new int[4] { 0, 1, 4, 5 };
  5.     Int32 item = 1;
  6.     Int32 startIndx = 1;
  7.  
  8.     int expected = 1;
  9.     int actual = CS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
  10.  
  11.     Assert.AreEqual(expected, actual);
  12. }
  13. public void IndexOfFSUsingExample_ReturnsExpectedValue()
  14. {
  15.     int[] itemSet = new int[4] { 0, 1, 4, 5 };
  16.     Int32 item = 1;
  17.     Int32 startIndx = 1;
  18.  
  19.     int expected = 1;
  20.     int actual = FS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
  21.  
  22.     Assert.AreEqual(expected, actual);
  23. }

 

IsSubsetOf

In the C# implementation, IndexOf is called to keep track of where the search is currently pointed. 

  1. public static bool IsSubsetOf(int[] itemSet, int[] trans)
  2. {
  3.   // 'trans' is an ordered transaction like [0 1 4 5 8]
  4.   int foundIdx = -1;
  5.   for (int j = 0; j < itemSet.Length; ++j)
  6.   {
  7.     foundIdx = IndexOf(trans, itemSet[j], foundIdx + 1);
  8.     if (foundIdx == -1) return false;
  9.   }
  10.   return true;
  11. }

In The F# one, that is not needed:

  1. static member IsSubsetOf(itemSet:int[] , trans:int[]) =
  2.     let isInTrans x = (Array.exists(fun elem -> elem = x) trans)
  3.     let filteredItemSet = itemSet
  4.                             |> Array.map(fun value -> value, isInTrans value)
  5.                             |> Array.filter(fun (value, trans) -> trans = false)
  6.     if filteredItemSet.Length = 0 then true
  7.         else false

CountInTrans

Here is the original C# code uses the IsSubsetOf function.

  1. public static int CountInTrans(int[] itemSet, List<int[]> trans, Dictionary<int[], int> countDict)
  2. {
  3.    //number of times itemSet occurs in transactions, using a lookup dict
  4.  
  5.     if (countDict.ContainsKey(itemSet) == true)
  6.     return countDict[itemSet]; // use already computed count
  7.  
  8.   int ct = 0;
  9.   for (int i = 0; i < trans.Count; ++i)
  10.     if (IsSubsetOf(itemSet, trans[i]) == true)
  11.       ++ct;
  12.   countDict.Add(itemSet, ct);
  13.   return ct;
  14. }

And here is the F# Code that also uses that subfunction

  1. static member CountInTrans(itemSet: int[], trans: List<int[]>, countDict: Dictionary<int[], int>) =
  2.     let trans' = trans |> Seq.map(fun value -> value, AssociationRuleProgram.IsSubsetOf (itemSet,value))
  3.     trans' |> Seq.filter(fun item -> snd item = true)
  4.            |> Seq.length

 

 GetHighConfRules

With the subfunctions created and running green, I then tackled the point of the exercise –> GetHighConfRules.  The C# implementation is pretty verbose and there are lots of things happening: 

  1.     public static List<Rule> GetHighConfRules(List<int[]> freqItemSets, List<int[]> trans, double minConfidencePct)
  2.     {
  3.       // generate candidate rules from freqItemSets, save rules that meet min confidence against transactions
  4.       List<Rule> result = new List<Rule>();
  5.  
  6.       Dictionary<int[], int> itemSetCountDict = new Dictionary<int[], int>(); // count of item sets
  7.  
  8.       for (int i = 0; i < freqItemSets.Count; ++i) // each freq item-set generates multiple candidate rules
  9.       {
  10.         int[] currItemSet = freqItemSets[i]; // for clarity only
  11.         int ctItemSet = CountInTrans(currItemSet, trans, itemSetCountDict); // needed for each candidate rule
  12.         for (int len = 1; len <= currItemSet.Length – 1; ++len) // antecedent len = 1, 2, 3, . .
  13.         {
  14.           int[] c = NewCombination(len); // a mathematical combination
  15.  
  16.           while (c != null) // each combination makes a candidate rule
  17.           {
  18.             int[] ante = MakeAntecedent(currItemSet, c);
  19.             int[] cons = MakeConsequent(currItemSet, c); // could defer this until known if needed
  20.           
  21.             int ctAntecendent = CountInTrans(ante, trans, itemSetCountDict); // use lookup if possible
  22.             double confidence = (ctItemSet * 1.0) / ctAntecendent;
  23.  
  24.             if (confidence >= minConfidencePct) // we have a winner!
  25.             {
  26.               Rule r = new Rule(ante, cons, confidence);
  27.               result.Add(r); // if freq item-sets are distinct, no dup rules ever created
  28.             }
  29.             c = NextCombination(c, currItemSet.Length);
  30.           } // while each combination
  31.         } // len each possible antecedent for curr item-set
  32.       } // i each freq item-set
  33.  
  34.       return result;
  35.     } // GetHighConfRules

In the F# code, It decided to work inside out and get the rule for 1 itemset.  I think the code reads pretty clear where each step is laid out

  1. static member GetHighConfRules(freqItemSets:List<int[]>, trans:List<int[]>,  minConfidencePct:float) =
  2.     let returnValue = new List<Rule>()
  3.     freqItemSets
  4.         |> Seq.map (fun i -> i, AssociationRuleProgram.CountInTrans'(i,trans))
  5.         |> Seq.filter(fun (i,c) -> (float)c > minConfidencePct)
  6.         |> Seq.map(fun (i,mcp) -> i,mcp,AssociationRuleProgram.MakeAntecedent(i, trans.[0]))
  7.         |> Seq.map(fun (i,mcp,a) -> i,mcp, a, AssociationRuleProgram.MakeConsequent(i, trans.[0]))
  8.         |> Seq.iter(fun (i,mcp,a,c) -> returnValue.Add(new Rule(a,c,mcp)))
  9.     returnValue

I then attempted to put this block into a larger block (trans.[0]) but then I realized that I was going about this the wrong way.  Instead of using the C# code as a my base line, I need to approach the problem from a functional viewpoint.  That will be the subject of my blog next week…

 

 

 

 

 

 

 

 

 

 

 

 

Association Rule Learning Via F# (Part 1)

I was reading the most recent MSDN when I came across this article.  How awesome is this?  McCaffrey did a great job explaining a really interesting area of analytics and I am loving the fact that MSDN is including articles about data analytics.  When I was reading the article, I ran across this sentence “The demo program is coded in C# but you should be able to refactor the code to other .NET languages such as Visual Basic or Iron Python without too much difficulty”  Iron Python?  Iron Python!  What about F#, the language that matches analytics the way peanut butter goes with chocolate?  Challenge accepted!

The first thing I did was to download his source code from here.  When I first opened the source code, I realized that the code would be a little bit hard to port because it is written from a scientific angle, not a business application point of view.  34 FxCop errors in 259 lines of code confirmed this:

image

Also, there are tons of comments which is very distracting. I generally hate comments, but I figure that since it is a MSDN article and it is supposed to explain what is going on, comments are OK. However, many of the comments can be refactored into more descriptive variables and method names. For example:

imageimage

In any event, let’s look at the code. The first thing I did was change the CS project from a console app to a library and move the test data into an another project . I then moved the console code to the UI. I also moved the Rule class code into its own file, made sure the namespaces matched, and made the AssociationRuleProgram public.  Yup it still runs:

imageimage

So then I created a FSharp library in the solution and set up the class with the single method:

image

A couple of things to note:

1) I left the parameter naming the same, even though it is not particularly intention-revealing

2) F# is typed inferred, so I don’t have to assign the types to the parameters

Next, started looking at the supporting functions to GetHighConfRules.  Up first was the function call NextCombination.  Here is the side by side between the imperative style and the functional style:

imageimage

The next function was NextCombination was more difficult for me to understand.  I stopped what I was doing and built a unit test project that proved correctness using the commented examples as the expected values.  I used 1 test project for both the C# and F# project so I could see both side by side.  An interesting side not is that the unit test naming is different than usual –> instead of naming the class XXXXTests where XXXX is the name of another class, XXXX is the function name that both classes are implementing:

So going back to the example,

image

I wrote two unit tests that match the two comments

image

When I ran the tests, the 1st test passed but the second did not:

image

The problem with the failing test is that null is not being returned, rather {3,4,6}.  So now I have a problem, do I base the F# implementation on the code comments or the code itself?  I decided to base it on the code, because comments often lie but CODE DON”T LIE (thanks ‘sheed).  I adjusted the unit test, got green. 

One of the reasons the code is pretty hard to read/understand is because of the use of ‘i’,’j’,’k’,’n’ variables.  I went back to the article and McCaffrey explains what is going on at the bottom left of page 60.  Another name for the function ‘NextCombination’ could be called ‘GetLexicographicalSuccessor’ and the variable ‘n’ could be called ‘numberOfPossibleItems’.   With that mental vocabulary in place, I went through the functional and divided it into 4 parts:

1Checking to see if the value of the first element is of a certain length

image

2) Creating a result array that is seeded with the values of the input array

image

3) Looping backwards to identify the 1st number in the array that will be adjusted

image

4) From that target element, looping forward and adjusting all subsequent items

image

#1 I will not worry about now and #2 is not needed in F#, so #3 is the first place to start.  What I need is a way of splitting the array into two parts.  Part 1 has the original values that will not change and part 2 has the values that will change.  Seq.Take and Seq.Skip are perfect for this:

  1. let i = Array.LastIndexOf(comb,n)
  2. let i' = if i = – 1 then 0 else i
  3. let comb' = comb |> Seq.take(i') |> Seq.toArray
  4. let comb'' = comb |> Seq.skip(i') |> Seq.toArray

Looking at #4, I now need to increment the values in part 2 by 1.  Seq.take will work:

image

And then putting part 1 and part 2 back together via Array.Append, we have equivalence*:

imageimage

*Equivalence is defined by my unit tests, which both pass green.  I have no idea if other inputs will not work.  Note that the second unit test runs red, so I really think that the code is wrong and that the comment to return null is correct.  The value I am getting for (3;4;5)(5) is (3;4;1) which seems to make sense.

I am not crazy about these explanatory variables (comb’, comb’’, and comb’’’) but I am not sure how to combine them without sacrificing readability.  I definitely want to combine the i and i’ into 1 statement…

I am not sure why Scan is returning 4 items in an array when I am passing in an array that has a length of 3.  I am running out of time today so I just hacked in a Seq.Take.

I’ll continue this exercise in my blog next week.

 

Kaplan-Meier Survival Analysis Using F#

I was reading the most recent issue of MSDN a couple of days ago when I came across this article on doing a Kaplan-Meier survival analysis.  I thought the article was great and I am excited that MSDN is starting to publish articles on data analytics.  However, I did notice that there wasn’t any code in the article, which is odd, so I went to the on-line article and others had a similar question:

image

I decided to implement a Kaplan-Meier survival (KMS) analysis using F#.  After reading the article a couple of times, I was still a bit unclear on how the KMS is implemented and there does not seem to be any pre-rolled in the standard .NET stat libraries out there.  I went on over to this site where there was an excellent description of how the survival probability is calculated.  I went ahead and built an Excel spreadsheet to match the nih one and then compare to what Topol is doing:

image

Notice that Topol censored the data for the article.  If we only cared about the probability of crashes, then we would not censor the data for when the device was turned off.

So then I was ready to start coding so spun up a solution with an F# project for the analysis and a C# project for the testing. 

image

I then loaded into the unit test project the datasets that Topol used:

  1. [TestMethod]
  2. public void EstimateForApplicationX_ReturnsExpected()
  3. {
  4.     var appX = new CrashMetaData[]
  5.     {
  6.         new CrashMetaData(0,1,false),
  7.         new CrashMetaData(1,5,true),
  8.         new CrashMetaData(2,5,false),
  9.         new CrashMetaData(3,8,false),
  10.         new CrashMetaData(4,10,false),
  11.         new CrashMetaData(5,12,true),
  12.         new CrashMetaData(6,15,false),
  13.         new CrashMetaData(7,18,true),
  14.         new CrashMetaData(8,21,false),
  15.         new CrashMetaData(9,22,true),
  16.     };
  17. }

I could then wire up the unit tests to compare the output to the article and what I had come up with.

  1. public void EstimateForApplicationX_ReturnsExpected()
  2. {
  3.     var appX = new CrashMetaData[]
  4.     {
  5.         new CrashMetaData(0,1,false),
  6.         new CrashMetaData(1,5,true),
  7.         new CrashMetaData(2,5,false),
  8.         new CrashMetaData(3,8,false),
  9.         new CrashMetaData(4,10,false),
  10.         new CrashMetaData(5,12,true),
  11.         new CrashMetaData(6,15,false),
  12.         new CrashMetaData(7,18,true),
  13.         new CrashMetaData(8,21,false),
  14.         new CrashMetaData(9,22,true),
  15.     };
  16.  
  17.     var expected = new SurvivalProbabilityData[]
  18.     {
  19.         new SurvivalProbabilityData(0,1.000),
  20.         new SurvivalProbabilityData(5,.889),
  21.         new SurvivalProbabilityData(12,.711),
  22.         new SurvivalProbabilityData(18,.474),
  23.         new SurvivalProbabilityData(22,.000)
  24.     };
  25.  
  26.     KaplanMeierEstimator estimator = new KaplanMeierEstimator();
  27.     var actual = estimator.CalculateSurvivalProbability(appX);
  28.  
  29.     Assert.AreSame(expected, actual);
  30. }

 

However, one of the neat features of F# is the REPL so I don’t need to keep running unit tests to prove correctness when I am proving out a concept.  So I added equivalent test code in the beginning of the F# project so I could run in the REPL my ideas:

  1. type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
  2.  
  3. type KapalanMeierAnalysis() =
  4.     member this.GenerateXAppData ()=
  5.                     [|  {userId=0; crashTime=1; crashed=false};{userId=1; crashTime=5; crashed=true};
  6.                         {userId=2; crashTime=5; crashed=false};{userId=3; crashTime=8; crashed=false};
  7.                         {userId=4; crashTime=10; crashed=false};{userId=5; crashTime=12; crashed=true};
  8.                         {userId=6; crashTime=15; crashed=false};{userId=7; crashTime=18; crashed=true};
  9.                         {userId=8; crashTime=21; crashed=false};{userId=9; crashTime=22; crashed=true}|]
  10.     
  11.     member this.RunAnalysis(crashMetaData: array<CrashMetaData>) =

The first thing I did was duplicate the 1st 3 columns of the Excel spreadsheet:

  1. let crashSequence = crashMetaData
  2.                         |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  3.                                                                                 | true -> 1
  4.                                                                                 | false -> 0),
  5.                                                                  (match crash.crashed with
  6.                                                                                 | true -> 0
  7.                                                                                 | false -> 1))

 

In the REPL:

image

The forth column is tricky because it is a cumulative calculation.  Instead of for..eaching in an imperative style, I took advantage of the functional language constructs to make the code much more readable.   Once I calculated that column outside of the base Sequence, I added it back in via Seq.Zip

  1. let cumulativeDevices = crashMetaData.Length
  2.  
  3. let crashSequence = crashMetaData
  4.                         |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  5.                                                                                 | true -> 1
  6.                                                                                 | false -> 0),
  7.                                                                  (match crash.crashed with
  8.                                                                                 | true -> 0
  9.                                                                                 | false -> 1))
  10. let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
  11.  
  12. let crashSequence' = Seq.zip crashSequence availableDeviceSequence
  13.                             |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)

 

In the REPL:

image

The next two columns were a snap –> they were just calculations based on the existing values:

  1. let cumulativeDevices = crashMetaData.Length
  2.  
  3. let crashSequence = crashMetaData
  4.                         |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  5.                                                                                 | true -> 1
  6.                                                                                 | false -> 0),
  7.                                                                  (match crash.crashed with
  8.                                                                                 | true -> 0
  9.                                                                                 | false -> 1))
  10. let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
  11.  
  12. let crashSequence' = Seq.zip crashSequence availableDeviceSequence
  13.                             |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
  14.  
  15. let crashSequence'' = crashSequence'
  16.                             |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld))

 

The last column was another cumulative calculation so I added another accumulator and used Seq.scan and Seq.Zip.

  1. let cumulativeDevices = crashMetaData.Length
  2. let cumulativeSurvivalProbability = 1.
  3.  
  4. let crashSequence = crashMetaData
  5.                         |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  6.                                                                                 | true -> 1
  7.                                                                                 | false -> 0),
  8.                                                                  (match crash.crashed with
  9.                                                                                 | true -> 0
  10.                                                                                 | false -> 1))
  11. let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
  12.  
  13. let crashSequence' = Seq.zip crashSequence availableDeviceSequence
  14.                             |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
  15.  
  16. let crashSequence'' = crashSequence'
  17.                             |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld))
  18.  
  19. let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
  20. let survivalProbabilitySequence' = survivalProbabilitySequence
  21.                                             |> Seq.skip 1

The last step was to map all of the columns and only output what was in the article.  The final answer is:

  1. namespace ChickenSoftware.SurvivalAnalysis
  2.  
  3. type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
  4. type public SurvivalProbabilityData = {crashTime: int; survivalProbaility: float}
  5.  
  6. type KaplanMeierEstimator() =
  7.     member this.CalculateSurvivalProbability(crashMetaData: array<CrashMetaData>) =
  8.             let cumulativeDevices = crashMetaData.Length
  9.             let cumulativeSurvivalProbability = 1.
  10.  
  11.             let crashSequence = crashMetaData
  12.                                     |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  13.                                                                                             | true -> 1
  14.                                                                                             | false -> 0),
  15.                                                                              (match crash.crashed with
  16.                                                                                             | true -> 0
  17.                                                                                             | false -> 1))
  18.             let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
  19.  
  20.             let crashSequence' = Seq.zip crashSequence availableDeviceSequence
  21.                                         |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
  22.  
  23.             let crashSequence'' = crashSequence'
  24.                                         |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld))
  25.  
  26.             let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
  27.             let survivalProbabilitySequence' = survivalProbabilitySequence
  28.                                                         |> Seq.skip 1
  29.  
  30.             let crashSequence''' = Seq.zip crashSequence'' survivalProbabilitySequence'
  31.                                         |> Seq.map(fun ((t,c,nc,cumld,dp,sp),cumlsp) -> t,c,nc,cumld,dp,sp,cumlsp)
  32.             crashSequence'''
  33.                     |> Seq.filter(fun (t,c,nc,cumld,dp,sp,cumlsp) -> c=1 )
  34.                     |> Seq.map(fun (t,c,nc,cumld,dp,sp,cumlsp) -> t,System.Math.Round(cumlsp,3))

image

And this matches the article (almost exactly).  The article also has a row for iteration zero, which I did not bake in.  Instead of fixing my code, I changed the unit test and removed that 1st column.  In any event, I ran the test and it ran red –> but the values are identical so I assume it is a problem with the Assert.AreSame()  function.  I would take the time to figure it out but it is 75 degrees on a Sunday afternoon and I want to go play catch with my kids…

image

Note it also matches the other data set Topol has in the article:

image

In any event, this code reads pretty much the way I was thinking about the problem – each column of the Excel spreadsheet has a 1 to 1 correspondence to the F# code block.   I did use explanatory variables liberally which might offend the more advanced functional programmers but taking each step in turn really helped me focus on getting each step correct before going to the next one.

1) I had to offset the cumulativeSurvivalProabability by one because the calculation is how many crashed on a day compared to how many were working at the start of the day.  The Seq.Scan increments the counter for the next row of the sequence and I need it for the current row.  Perhaps there is an overload for Seq.Scan?

2) I adopted the functional convention of using ticks to denote different physical manifestations of the same logic concept (crashedDeviceSequence “became” crashedDeviceSequence’, etc…).  Since everything is immutable by default in F#, this kind of naming convention makes a lot of sense to me.  However, I can see it quickly becoming unwieldy.

3) I could not figure out how to operate on the base tuple so instead  I used a couple of supporting Sequences and then put everything together using Seq.Zip.  I assume there is a more efficient way to do that.

4) One of the knocks against functional/scientific programming is that values are named poorly.  To combat that, I used the full names in my tuples  to start.  After a certain point though, the names got too unwieldy so I resorted to their initials.  I am not sure what the right answer is here, or even if there is right answer.

.

 

 

 

 

Apriori Algorithm and F# Using Elevator Inspection Data

Now that I have the elevator dataset in a workable state, I wanted to see what I could see with the data.  I was reading Machine Learning In Action and the authors suggested that an Apriori Algorithm as a way to quantify associations among data points.  I read both Harrington’s code and Wikipedia’s description and I found both the be impenetrable – the former because their code was unreadable and the later because  the mathematical formulas depended on a level of algebra that I don’t have.

Fortunately, I found a C# project on Codeproject that had both an excellent example/introduction and C# code.  I used the examples on the website to formulate my F# implementation.

The first thing I did was create a class that matched the 1st grid in the example

image

  1. namespace ChickenSoftware.ElevatorChicken.Analysis
  2.  
  3. open System.Collections.Generic
  4.  
  5. type Transaction = {TID: string; Items: List<string> }
  6.  
  7. type Apriori(database: List<Transaction>, support: float, confidence: float) =
  8.     member this.Database = database
  9.     member this.Support = support
  10.     member this.Confidence = confidence

Note that because F# is immutable by default, the properties are read-only.  I then created a unit test project that makes sure the constructor works without exceptions.  The data matches the example:

  1. public AprioriTests()
  2. {
  3.     var database = new List<Transaction>();
  4.     database.Add(new Transaction("100", new List<string>() { "A", "C", "D" }));
  5.     database.Add(new Transaction("200", new List<string>() { "B", "C", "E" }));
  6.     database.Add(new Transaction("300", new List<string>() { "A", "B", "C", "E" }));
  7.     database.Add(new Transaction("400", new List<string>() { "B", "E" }));
  8.  
  9.     _apriori = new Apriori(database, .5, .80);
  10.  
  11. }
  12.  
  13. [TestMethod]
  14. public void ConstructorUsingValidArguments_ReturnsExpected()
  15. {
  16.     Assert.IsNotNull(_apriori);
  17. }

I then need a function to count up all of the items in the Itemsets.  I refused to use loops, so I first started using Seq.Fold, but I was having zero luck because I was trying to fold a Seq of List.  I then started experimenting with other functions when I found Seq.Collect – which was perfect.  So I created a function like this:

  1. member this.GetC1() =
  2.     database
  3.  
  4. member this.GetL1() =
  5.     let numberOfTransactions = this.GetC1().Count
  6.  
  7.     this.GetC1()
  8.         |> Seq.collect(fun d -> d.Items)
  9.         |> Seq.countBy(fun i -> i)
  10.         |> Seq.map(fun (t,i) -> t, i, float i/ float numberOfTransactions)
  11.         |> Seq.filter(fun (t,i,p) -> p >= support)
  12.         |> Seq.map(fun (t,i,p) -> t,i)
  13.         |> Seq.sort
  14.         |> Seq.toList

Note that the numberOfTransactions is for the database, not the individual items in the List<Item>.  And the results match the example:

imageimage

So this is great.  My next stop was to build a list of pair combinations of the remaining values

image

The trick is that is not a Cartesian join of the original sets – it is only the surviving sets that are needed.  My first attempt looked like:

  1. let C1 = database
  2.  
  3. let L1 = C1
  4.         |> Seq.map(fun t -> t.Items)
  5.         |> Seq.collect(fun i -> i)
  6.         |> Seq.countBy(fun i -> i)
  7.         |> Seq.map(fun (t,i) -> t, i, float i/ float numberOftransactions)
  8.         |> Seq.filter(fun (t,i,p) -> p >= support)
  9.         |> Seq.toArray
  10. let C2A = L1
  11.             |> Seq.map(fun (x,y,z) -> x)
  12.             |> Seq.toArray
  13. let C2B = L1
  14.             |> Seq.map(fun (x,y,z) -> x)
  15.             |> Seq.toArray
  16. let C2 = C2A |> Seq.collect(fun x -> C2B |> Seq.map(fun y -> x+y))
  17. C2   

With the output like this:

image

I was running out of Saturday morning so I went over to stack overflow and got a couple of responses.  I was on the right track with the concat, but I didn’t think about the List.Filter(), which would prune my list.  With this in mind, I copied Mark’s code and got what I was looking for

  1. member this.GetC2() =
  2.     let l1Itemset = this.GetL1()
  3.                     |> Seq.map(fun (i,s) -> i)
  4.  
  5.     let itemset =
  6.         l1Itemset
  7.             |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
  8.             |> Seq.concat
  9.             |> Seq.filter(fun (x,y) -> x < y)
  10.             |> Seq.sort
  11.             |> Seq.toList         
  12.     
  13.     let listContainsItem(l:List<string>, a,b) =
  14.             l.Contains(a) && l.Contains(b)
  15.     
  16.     let someFunctionINeedToRename(l1:List<string>, l2)=
  17.             l2 |> Seq.map(fun (x,y) -> listContainsItem(l1,x,y))
  18.  
  19.     let itemsetMatches = this.GetC1()
  20.                             |> Seq.map(fun t -> t.Items)
  21.                             |> Seq.map(fun i -> someFunctionINeedToRename(i,itemset))
  22.  
  23.     let itemSupport = itemsetMatches
  24.                             |> Seq.map(Seq.map(fun i -> if i then 1 else 0))
  25.                             |> Seq.reduce(Seq.map2(+))
  26.  
  27.     itemSupport
  28.         |> Seq.zip(itemset)
  29.         |> Seq.toList

So now I have C2 filling correctly:

image

 

Taking the results, I needed to get L2.

image

That was much simpler that getting C2 –> here is the code:

  1. member this.GetL2() =
  2.     let numberOfTransactions = this.GetC1().Count
  3.     
  4.     this.GetC2()
  5.             |> Seq.map(fun (i,n) -> i,n,float n/float numberOfTransactions)
  6.             |> Seq.filter(fun (i,n,p) -> p >= support)
  7.             |> Seq.map(fun (t,i,p) -> t,i)
  8.             |> Seq.sort
  9.             |> Seq.toList    

And when I run it – it matches this example exactly:

image

Finally, I added in a C# and L3.  This code is identical to the C2/L2 code with one exception: mapping a triple and not a tuple:  The C2 code maps like this

  1. let itemset =
  2.     l1Itemset
  3.         |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
  4.         |> Seq.concat
  5.         |> Seq.filter(fun (x,y) -> x < y)
  6.         |> Seq.sort
  7.         |> Seq.toList     

and the C3 code looks like this (took me 15 minutes to figure out line 3 below):

  1. let itemset =
  2.     l2Itemset
  3.         |> Seq.map(fun x -> l2Itemset |> Seq.map(fun y-> l2Itemset |> Seq.map(fun z->(fst x,fst y,snd z))))
  4.         |> Seq.concat
  5.         |> Seq.collect(fun d -> d)
  6.         |> Seq.filter(fun (x,y,z) -> x < y && y < z)
  7.         |> Seq.distinct
  8.         |> Seq.sort
  9.         |> Seq.toList    

With the C3 and L3 matching the example also:

image

image

 

I was now ready to put in the elevator data into the analysis.  I think I am getting better at F# because I did the mapping, filtering, and transformation of the data from the server without looking at any other material and it look only 15 minutes.

  1. type public ElevatorBuilder() =
  2.     let connectionString = ConfigurationManager.ConnectionStrings.["localData2"].ConnectionString;
  3.  
  4.     member public this.GetElevatorTransactions() =
  5.         let transactions = this.GetElevators()
  6.                               |> Seq.map(fun e ->this.ConvertElevatorToTransaction(e))
  7.         let transactionsList = new System.Collections.Generic.List<Transaction>(transactions)
  8.         transactionsList
  9.  
  10.     member public this.ConvertElevatorToTransaction(i: string, t:string, c:string, s:string) =
  11.         let items = new System.Collections.Generic.List<String>()
  12.         items.Add(t)
  13.         items.Add(c)
  14.         items.Add(s)
  15.         let transaction = {TID=i; Items=items}
  16.         transaction
  17.  
  18.     member public this.GetElevators () =
  19.         SqlConnection.GetDataContext(connectionString).ElevatorData201402
  20.             |> Seq.map(fun e -> e.ID, e.EquipType,e.Capacity,e.Speed)
  21.             |> Seq.filter(fun (i,et,c,s) -> not(String.IsNullOrEmpty(et)))
  22.             |> Seq.filter(fun (i,et,c,s) -> c.HasValue)
  23.             |> Seq.filter(fun (i,et,c,s) -> s.HasValue)
  24.             |> Seq.map(fun (i,t,c,s) -> i, this.CatagorizeEquipmentType(t),c,s)
  25.             |> Seq.map(fun (i,t,c,s) -> i,t,this.CatagorizeCapacity(c.Value),s)
  26.             |> Seq.map(fun (i,t,c,s) -> i,t,c,this.CatagorizeSpeed(s.Value))
  27.             |> Seq.map(fun (i,t,c,s) -> i.ToString(),t,c,s)

The longest part was aggregating the free-form text of the Equipment Type field (here is partial snip, you get the idea…)

  1. member public this.CatagorizeEquipmentType(et: string) =
  2.     match et.Trim() with
  3.         | "OTIS" -> "OTIS"
  4.         | "OTIS (1-2)" -> "OTIS"
  5.         | "OTIS (2-1)" -> "OTIS"
  6.         | "OTIS hydro" -> "OTIS"
  7.         | "OTIS, HYD" -> "OTIS"
  8.         | "OTIS/ ASHEVILLE " -> "OTIS"
  9.         | "OTIS/ MOUNTAIN " -> "OTIS"
  10.         | "OTIS/#1" -> "OTIS"
  11.         | "OTIS/#19 " -> "OTIS"

Assigning categories for speed and capacity was a snap using F#

  1. member public this.CatagorizeCapacity(c: int) =
  2.     let lowerBound = (c/25 * 25) + 1
  3.     let upperBound = lowerBound + 24
  4.     lowerBound.ToString() + "-" + upperBound.ToString()        
  5.  
  6. member public this.CatagorizeSpeed(s: int) =
  7.     let lowerBound = (s/50 * 50) + 1
  8.     let upperBound = lowerBound + 49
  9.     lowerBound.ToString() + "-" + upperBound.ToString()    

With this in hand, I created a Console app that takes the 27K records and pushes them though the apriori algorithm:

  1. private static void RunElevatorAnalysis()
  2. {
  3.     Stopwatch stopwatch = new Stopwatch();
  4.     stopwatch.Start();
  5.     ElevatorBuilder builder = new ElevatorBuilder();
  6.     var transactions = builder.GetElevatorTransactions();
  7.     stopwatch.Stop();
  8.     Console.WriteLine("Building " + transactions.Count + " transactions took: " + stopwatch.Elapsed.TotalSeconds);
  9.     var apriori = new Apriori(transactions, .1, .75);
  10.     var c2 = apriori.GetC2();
  11.     stopwatch.Reset();
  12.     stopwatch.Start();
  13.     var l1 = apriori.GetL1();
  14.     Console.WriteLine("Getting L1 took: " + stopwatch.Elapsed.TotalSeconds);
  15.     var l2 = apriori.GetL2();
  16.     Console.WriteLine("Getting L2 took: " + stopwatch.Elapsed.TotalSeconds);
  17.     var l3 = apriori.GetL3();
  18.     Console.WriteLine("Getting L3 took: " + stopwatch.Elapsed.TotalSeconds);
  19.     stopwatch.Stop();
  20.     Console.WriteLine("–L1");
  21.     foreach (var t in l1)
  22.     {
  23.         Console.WriteLine(t.Item1 + ":" + t.Item2);
  24.     }
  25.     Console.WriteLine("–L2");
  26.     foreach (var t in l2)
  27.     {
  28.         Console.WriteLine(t.Item1 + ":" + t.Item2);
  29.     }
  30.     Console.WriteLine("–L3");
  31.     foreach (var t in l3)
  32.     {
  33.         Console.WriteLine(t.Item1 + ":" + t.Item2);
  34.     }
  35. }

I then made an offering to the F# Gods and hit F5:

image

Doh!  The gods were not pleased.  I then went back to my initial filtering function and added a Seq.Take(25000) and the results:

image

So there a couple of things to draw from this exercise.

1) Apriori Algorithm is the wrong classification technique for this dataset.  I had to bring the support way down (10%) to even get any readings.  Also, there is too much dispersion of the values.  This kind of algorithm is much better with N number of a smaller set of data values versus a fixed number of large values.

2) Even so, how cool is this?  Compare the files just to make the C#/OO work versus with F#

imageimage

And the Total LOC is 539 for C# versus 120 for F# – and the F# can be optimized using a better way to create search and itemsets.  Hard-coding each level was a hack I did to get thing working and give me an understanding of how AA works.  I bet this can be consolidated to well under 75 lines without sacrificing readability

3) I think the StackOverflow exception is because I am doing a Cartesian join and then paring the result.  Using one of the other techniques suggested on SO will give much better results.

I any event, what a fun project!  I can’t wait to optimize this and perhaps throw a different algorithm at the dataset in the coming weeks.

 

 

 

Analysis of Health Inspection Data using F#

As part of the TRINUG F#/Analytics SIG, I did a public records request from Wake County for all of the restaurant inspections in 2013.  If you are not familiar, the inspectors go out and then give a score to the restaurant.  The restaurant then has to display their score like this:

image

After some back and forth, I got the data as an Excel spreadsheet that looks like this

image

I then loaded the spreadsheet into a sql server and exposed it as some OData endpoints.

  1. // GET odata/Restaurant
  2. [Queryable]
  3. public IQueryable<Restaurant> GetRestaurant()
  4. {
  5.     return db.Restaurants;
  6. }
  7.  
  8. // GET odata/Restaurant(5)
  9. [Queryable]
  10. public SingleResult<Restaurant> GetRestaurant([FromODataUri] int key)
  11. {
  12.     return SingleResult.Create(db.Restaurants.Where(restaurant => restaurant.Id == key));
  13. }

I then dove into the data to see if there were any interesting conclusions to be found.  Following my pattern of doing analytics using F# and unit testing using C#, I created a project with the following code:

  1. namespace ChickenSoftware.RestraurantChicken.Analysis
  2.  
  3. open System.Linq
  4. open System.Configuration
  5. open Microsoft.FSharp.Linq
  6. open Microsoft.FSharp.Data.TypeProviders
  7.  
  8. type internal SqlConnection = SqlEntityConnection<ConnectionStringName="azureData">
  9.  
  10. type public RestaurantAnalysis () =
  11.     
  12.     let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;

Note that I am using the connection string in two places – the 1st for the type provider to do its magic at design time and the second for actually accessing the data at run time.  With that set up, the 1st question I had was “ is there seasonality in inspection scores like there are in traffic tickets?”  To that end, I created the following function:

  1. member public x.GetAverageScoreByMonth () =
  2.     SqlConnection.GetDataContext(connectionString).Restaurants
  3.         |> Seq.map(fun x -> x.InspectionDate.Value.Month, x.InspectionScore.Value)
  4.         |> Seq.groupBy(fun x -> fst x)
  5.         |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
  6.         |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
  7.         |> Seq.toArray
  8.         |> Array.sort

This is pretty vanilla F# code, with the tricky part being the average by month (lines 4 and 5 here).  What the code is doing is grouping up the 4,000 or so tuples that were created on line 3 into another tuple – with the fst being the groupBy value (in this case month) and then the second tuple being a tuple with the month and score.  Then, by averaging up the score of the second tuple, we get an average for each month.  I create a unit (really integration) test like so:

  1. [TestMethod]
  2. public void GetAverageScoreByMonth_ReturnsTwelveItems()
  3. {
  4.     var analysis = new RestaurantAnalysis();
  5.     var scores = analysis.GetAverageScoreByMonth();
  6.     Int32 expected = 12;
  7.     Int32 actual = scores.Length;
  8.     Assert.AreEqual(expected, actual);
  9. }

And the result ran green. 

image

Putting a break on the Assert and a watch on scores, you can see the values:

image

A couple of things stand out

1) The overall average is around 96 and change

2) There does not seem to be any significant variance among months.

Since I am trying to also teach myself D3, I then added a MVC5 project to my solution and added an analysis controller that calls the function in the analysis module and serves the results as json:

  1. public JsonResult AverageScoreByMonth()
  2. {
  3.     var analysis = new RestaurantAnalysis();
  4.     var scores = analysis.GetAverageScoreByMonth();
  5.     return Json(scores,JsonRequestBehavior.AllowGet);
  6. }

I then made a page with a simple D3 chart that calls this controller

  1. @{
  2.     Layout = "~/Views/Shared/_Layout.cshtml";
  3. }
  4.  
  5. <svg class="chart"></svg>
  6.  
  7. <style>
  8.     .bar {
  9.         fill: steelblue;
  10.     }
  11.  
  12.         .bar:hover {
  13.             fill: brown;
  14.         }
  15.  
  16.     .axis {
  17.         font: 10px sans-serif;
  18.     }
  19.  
  20.         .axis path,
  21.         .axis line {
  22.             fill: none;
  23.             stroke: #000;
  24.             shape-rendering: crispEdges;
  25.         }
  26.  
  27.     .x.axis path {
  28.         display: none;
  29.     }
  30. </style>
  31.  
  32.  
  33.  
  34. <script>
  35.  
  36.     var margin = { top: 20, right: 20, bottom: 30, left: 40 },
  37.         width = 960 – margin.left – margin.right,
  38.         height = 500 – margin.top – margin.bottom;
  39.  
  40.     var x = d3.scale.ordinal()
  41.         .rangeRoundBands([0, width], .1);
  42.  
  43.     var y = d3.scale.linear()
  44.         .range([height, 0]);
  45.  
  46.     var xAxis = d3.svg.axis()
  47.         .scale(x)
  48.         .orient("bottom");
  49.  
  50.     var yAxis = d3.svg.axis()
  51.         .scale(y)
  52.         .orient("left")
  53.         .ticks(10, "%");
  54.  
  55.     var svg = d3.select("body").append("svg")
  56.         .attr("width", width + margin.left + margin.right)
  57.         .attr("height", height + margin.top + margin.bottom)
  58.       .append("g")
  59.         .attr("transform", "translate(" + margin.left + "," + margin.top + ")");
  60.  
  61.  
  62.  
  63.     $.ajax({
  64.         url: "http://localhost:3057/Analysis/AverageScoreByMonth/&quot;,
  65.         dataType: "json",
  66.         success: function (data) {
  67.             x.domain(data.map(function (d) { return d.Item1; }));
  68.             y.domain([0, d3.max(data, function (d) { return d.Item2; })]);
  69.  
  70.             svg.append("g")
  71.                 .attr("class", "x axis")
  72.                 .attr("transform", "translate(0," + height + ")")
  73.                 .call(xAxis);
  74.  
  75.             svg.append("g")
  76.                 .attr("class", "y axis")
  77.                 .call(yAxis)
  78.               .append("text")
  79.                 .attr("transform", "rotate(-90)")
  80.                 .attr("y", 6)
  81.                 .attr("dy", ".71em")
  82.                 .style("text-anchor", "end")
  83.                 .text("Frequency");
  84.  
  85.             svg.selectAll(".bar")
  86.                 .data(data)
  87.               .enter().append("rect")
  88.                 .attr("class", "bar")
  89.                 .attr("x", function (d) { return x(d.Item1); })
  90.                 .attr("width", x.rangeBand())
  91.                 .attr("y", function (d) { return y(d.Item2); })
  92.                 .attr("height", function (d) { return height – y(d.Item2); });
  93.  
  94.         },
  95.         error: function (e) {
  96.             alert("error");
  97.         }
  98.     });
  99.  
  100.     function type(d) {
  101.         d.Item2 = +d.Item2;
  102.         return d;
  103.     }
  104. </script>

And when I run it, a run-of-the mill barchart (I did have to adjust the F# to shift the decimal to the left two positions so that I could match the scale of the chart’s template.  For me, it is easier to alter the F# than the javascript:

image

Following this pattern, I did some other season analysis like average by DayOfMonth

image

DayOf Week.

image

So there does not seem to be any seasonality in inspection scores.

I then did an average of inspectors

image

And there looks to be some variance, but it is getting lost of the scale of the map.  The problem is that the range of the scores is not 0 to 100

Here is a function that counts the number of scores (rounded to 0)

  1. member public x.CountOfRoundedScores () =
  2.     SqlConnection.GetDataContext(connectionString).Restaurants
  3.         |> Seq.map(fun x -> System.Math.Round(x.InspectionScore.Value,0), x.InspectionID)
  4.         |> Seq.groupBy(fun x -> fst x)
  5.         |> Seq.map(fun (x,y) -> (x,y |> Seq.countBy snd))
  6.         |> Seq.map(fun (x,y) -> (x,y |> Seq.sumBy snd))
  7.         |> Seq.toArray

That graphically looks like:

image

So back to inspectors, I needed to adjust the scale from 0 to 100 to 80 to 100.  I also needed to remove the null inspection Ids and the records that were for the ‘test facility’ and the 6 records that were below 80.

  1. member public x.AverageScoreByInspector () =
  2.     SqlConnection.GetDataContext(connectionString).Restaurants
  3.         |> Seq.filter(fun x -> x.EstablishmentName <> "Test Facility")
  4.         |> Seq.filter(fun x -> x.InspectionScore.Value > 80.)
  5.         |> Seq.filter(fun x -> x.InspectionID <> null)
  6.         |> Seq.map(fun x -> x.InspectorID, x.InspectionScore.Value)
  7.         |> Seq.groupBy(fun x -> fst x)
  8.         |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
  9.         |> Seq.map(fun (x,y) -> x, y/100.)
  10.         |> Seq.map(fun (x,y) -> x, System.Math.Round(y,4))
  11.         |> Seq.toArray
  12.         |> Array.sort

I then adjusted the scale of the inspector graph to have to domain from 80 to 100 (versus 0 to 100) and the scale of the y axis.  This was a good article explaining Scales and Domains in D3.

  1. var yAxis = d3.svg.axis()
  2.     .scale(y)
  3.     .orient("left")
  4.     .ticks(10);

  1. $.ajax({
  2.     url: "http://localhost:3057/Analysis/AverageScoreByInspector/&quot;,
  3.     dataType: "json",
  4.     success: function (data) {
  5.         x.domain(data.map(function (d) { return d.Item1; }));
  6.         y.domain([80, d3.max(data, function (d) { return d.Item2; })]);

and now there is pretty good graph showing the variance among inspectors:

image

So the interesting this is that #1168 is 2 below the average – which of a domain of 10 is pretty significant.  Interestingly, 1168 is also the inspector who has all of the “Test facility” records – so they are probably the trainer and/or lead inspector.  With this analysis in the back pocket, ran a function that did the inspection score by establishment type:

image

This is kinda interesting (esp that pushcarts got the highest scores) but I wanted to see if there was any truth the the common perception that Chinese restaurants are less sanitary than other kinds of restaurants.  To that end, I created a rudimentary classifier that searched the name of the establishment to see if it had a name that is typically associated with fast-food Chinese:

  1. member public x.IsEstablishmentAChineseRestraurant (establishmentName:string) =
  2.     let upperCaseEstablishmentName = establishmentName.ToUpper()
  3.     let numberOfMatchedWords = upperCaseEstablishmentName.Split(' ')
  4.                                 |> Seq.map(fun x -> match x with
  5.                                                         | "ASIA" -> 1
  6.                                                         | "ASIAN" -> 1
  7.                                                         | "CHINA" -> 1
  8.                                                         | "CHINESE" -> 1
  9.                                                         | "PANDA" -> 1
  10.                                                         | "PEKING" -> 1
  11.                                                         | "WOK" -> 1
  12.                                                         | _ -> 0)
  13.                                 |> Seq.sum
  14.     match numberOfMatchedWords with
  15.         | 0 -> false
  16.         | _ -> true

I then created a function that returned the average and ran my unit tests.

  1. [TestMethod]
  2. public void IsEstablishmentAChineseRestraurantUsingWOK_ReturnsTrue()
  3. {
  4.     var analysis = new RestaurantAnalysis();
  5.     String establishmentName = "JAMIE'S WOK";
  6.  
  7.     var expected = true;
  8.     var actual = analysis.IsEstablishmentAChineseRestraurant(establishmentName);
  9.     Assert.AreEqual(expected, actual);
  10. }
  11.  
  12. [TestMethod]
  13. public void IsEstablishmentAChineseRestraurantUsingWok_ReturnsTrue()
  14. {
  15.     var analysis = new RestaurantAnalysis();
  16.     String establishmentName = "Jamie's Wok";
  17.  
  18.     var expected = true;
  19.     var actual = analysis.IsEstablishmentAChineseRestraurant(establishmentName);
  20.     Assert.AreEqual(expected, actual);
  21. }
  22.  
  23. [TestMethod]
  24. public void AverageScoreForChineseRestaurants_ReturnsExpected()
  25. {
  26.     var analysis = new RestaurantAnalysis();
  27.     var actual = analysis.AverageScoreForChineseRestaurants();
  28.     Assert.IsNotNull(actual);
  29. }

When a break was put on the value of the average, it was apparent that Chinese restaurants scored significantly lower than the average of 96

image

So then I applied 1 more segmentation: Chinese versus Non-Chinese scores by inspector:

  1. member public x.AverageScoresOfChineseAndNonChineseByInspector () =
  2.     let dataSet = SqlConnection.GetDataContext(connectionString).Restaurants
  3.                     |> Seq.map(fun x -> x.EstablishmentName, x.InspectorID,x.InspectionScore.Value)
  4.     let chineseRestraurants = dataSet
  5.                                 |> Seq.filter(fun (a,b,c) -> x.IsEstablishmentAChineseRestraurant(a))
  6.                                 |> Seq.map(fun (a,b,c) -> b,c)
  7.                                 |> Seq.groupBy(fun x -> fst x)
  8.                                 |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
  9.                                 |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
  10.                                 |> Seq.toArray
  11.                                 |> Array.sort
  12.     let nonChineseRestraurants = dataSet
  13.                                 |> Seq.filter(fun (a,b,c) -> not(x.IsEstablishmentAChineseRestraurant(a)))
  14.                                 |> Seq.map(fun (a,b,c) -> b,c)
  15.                                 |> Seq.groupBy(fun x -> fst x)
  16.                                 |> Seq.map(fun (x,y) -> (x,y |> Seq.averageBy snd))
  17.                                 |> Seq.map(fun (x,y) -> x, System.Math.Round(y,2))
  18.                                 |> Seq.toArray
  19.                                 |> Array.sort
  20.     Seq.zip chineseRestraurants nonChineseRestraurants
  21.            |> Seq.map(fun ((a,b),(c,d)) -> a,b,d)
  22.            |> Seq.toList

And in graphics using a double-bar chart:

image

So this is kinda interesting.  The lead inspector (1168) who grades everyone lower actually gives Chinese restaurants higher marks.  Everyone else pretty much grades Chinese restaurants lower except for 1 inspector.  Also, 1708 must really not like Chinese restaurants – or their inspection list has a series of really bad Chinese restaurants.

Note that this may not be statistically significant (I didn’t control for sample size, etc..) – but further analysis might be warranted, no?  If you are interested, here is the endpoint: http://restaurantchicken.cloudapp.net/odata/Restaurant

Finally, when I presented this analysis to TRINUG last week, lots of people became interested in F# and analytics (ok, maybe 3).  You can see the comments here.  Also, I now have an appointment with the head of the health department department and the CIO of Wake County later this week – let’s see what they say…

 

 

Traffic Stop Disposition: Classification Using F# and KNN

I have already looked at the summary statistics of the traffic stop data I received from the town here.  My next stop was to try and do a machine learning exercise with the data.  One of the more interesting questions I want to answer is what factors into weather a person gets a warning or a ticket (called disposition)?  Of all of the factors that may be involved, the dataset that I have is fairly limited:

image_thumb1

Using dispositionId as the result variable, there is StopDateTime and Location (Latitude/Longitude).  Fortunately, DateTime can be decomposed into several input variables.  For this exercise, I wanted to use the following:

  • TimeOfDay
  • DayOfWeek
  • DayOfMonth
  • MonthOfYear
  • Location (Latitude:Longitude)

And the resulting variable being disposition.  To make it easier for analysis, I limited the analysis set to finalDisposition as either “verbal warning” or “citation”  I decided to do a K-Nearest Neighbor because it is regarded as an easy machine learning algorithm to learn and the question does seem to be a classification problem.

My first step was to decide weather to write or borrow the KNN algorithm.  After looking at what kind of code would be needed to write my own and then looking at some other libraries, I decided to use Accord.Net.

My next first step was to get the data via the web service I spun up here.

  1. namespace ChickenSoftware.RoadAlert.Analysis
  2.  
  3. open FSharp.Data
  4. open Microsoft.FSharp.Data.TypeProviders
  5. open Accord.MachineLearning
  6.  
  7. type roadAlert2 = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
  8. type MachineLearningEngine =
  9.     static member RoadAlertDoc = roadAlert2.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)

My next first step was to filter the data to only verbal warnings (7) or citations (15). 

  1.   static member BaseDataSet =
  2.       MachineLearningEngine.RoadAlertDoc
  3.             |> Seq.filter(funx -> x.DispositionId = 7 || x.DispositionId = 15)
  4.           |> Seq.map(fun x -> x.Id, x.StopDateTime, x.Latitude, x.Longitude, x.DispositionId)
  5.           |> Seq.map(fun (a,b,c,d,e) -> a, b, System.Math.Round(c,3), System.Math.Round(d,3), e)
  6.           |> Seq.map(fun (a,b,c,d,e) -> a, b, c.ToString() + ":" + d.ToString(), e)
  7.           |> Seq.map(fun (a,b,c,d) -> a,b,c, match d with
  8.                                               |7 -> 0
  9.                                               |15 -> 1
  10.                                               |_ -> 1)
  11.           |> Seq.map(fun (a,b,c,d) -> a, b.Hour, b.DayOfWeek.GetHashCode(), b.Day, b.Month, c, d)
  12.           |> Seq.toList

You will notice that I had to transform the dispositionIds from 7 and 15 to 1 and 0.  The reason why is that the KNN method in Accord.Net assumes that the values match the index position in the array.  I had to dig into the source code of Accord.Net to figure that one out.

My next step was to divide the dataset in half: one half being the training sample and the other the validation sample:

  1. static member TrainingSample =
  2.     let midNumber = MachineLearningEngine.NumberOfRecords/ 2
  3.     MachineLearningEngine.BaseDataSet
  4.         |> Seq.filter(fun (a,b,c,d,e,f,g) -> a < midNumber)
  5.         |> Seq.toList
  6.  
  7. static member ValidationSample =
  8.     let midNumber = MachineLearningEngine.NumberOfRecords/ 2
  9.     MachineLearningEngine.BaseDataSet
  10.         |> Seq.filter(fun (a,b,c,d,e,f,g) -> a > midNumber)
  11.         |> Seq.toList

The next step was to actually run the KKN.  Before I could do that though, I had to create the distance function.  Since this was my 1st time, I dropped the geocoordinates and focused only on the time of day derivatives.

  1. static member RunKNN inputs outputs input =
  2.     let distanceFunction (a:int,b:int,c:int,d:int) (e:int,f:int,g:int,h:int) =  
  3.       let b1 = b * 4
  4.       let f1 = f * 4
  5.       let d1 = d * 2
  6.       let h1 = h * 2
  7.       float((pown(a-e) 2) + (pown(b1-f1) 2) + (pown(c-g) 2) + (pown(d1-h1) 2))
  8.  
  9.     let distanceDelegate =
  10.           System.Func<(int * int * int * int),(int * int * int * int),float>(distanceFunction)
  11.     
  12.     let knn = new KNearestNeighbors<int*int*int*int>(10,2,inputs,outputs,distanceDelegate)
  13.     knn.Compute(input)

You will notice I  tried to normalize the values so that they all had the same basis.  They are not exact, but they are close.  You will also notice that I had to create a delegate from for the distanceFunction (thanks to Mimo on SO).  This is because Accord.NET was written in C# with C# consumers in mind and F# has a couple of places where the interfaces are not as seemless as one would hope.

In any event, once the KKN function was written, I wrote a function that to the validation sample, made a guess via KKN, and then reported the result:

  1. static member GetValidationsViaKKN  =
  2.     let inputs = MachineLearningEngine.TrainingInputClass
  3.     let outputs = MachineLearningEngine.TrainingOutputClass
  4.     let validations = MachineLearningEngine.ValidationClass
  5.  
  6.     validations
  7.         |> Seq.map(fun (a,b,c,d,e) -> e, MachineLearningEngine.RunKNN inputs outputs (a,b,c,d))
  8.         |> Seq.toList
  9.  
  10. static member GetSuccessPercentageOfValidations =
  11.     let validations = MachineLearningEngine.GetValidationsViaKKN
  12.     let matches = validations
  13.                     |> Seq.map(fun (a,b) -> match (a=b) with
  14.                                                 | true -> 1
  15.                                                 | false -> 0)
  16.  
  17.     let recordCount =  validations |> Seq.length
  18.     let numberCorrect = matches |> Seq.sum
  19.     let successPercentage = double(numberCorrect) / double(recordCount)
  20.     recordCount, numberCorrect, successPercentage

I then hopped over to my UI console app and looked that the success percentage.

 

  1. private static void GetSuccessPercentageOfValidations()
  2. {
  3.     var output = MachineLearningEngine.GetSuccessPercentageOfValidations;
  4.     Console.WriteLine(output.Item1.ToString() + ":" + output.Item2.ToString() + ":" + output.Item3.ToString());
  5. }

image

So there are 12,837 records in the validation sample and the classifier guessed the correct disposition 9,001 times – a success percentage of 70%

So it looks like there is something there.  However, it is not clear that this is a good classifier without further tests – specifically seeing if the how to most common case results when pushing though the classifier.  Also, I would assume to make this a true ‘machine learning’ algorithm I would have to feed the results back to the distance function to see if I can alter it to get the success percentage higher.

One quick note about methodology – I used unit tests pretty extensively to understand how the KKN works.  I created a series of tests with some sample data to see who the function reacted. 

  1. [TestMethod]
  2. public void TestKKN_ReturnsExpected()
  3. {
  4.  
  5.     Tuple<int, int, int, int>[] inputs = {
  6.         new Tuple<int, int, int, int>(1, 0, 15, 1),
  7.         new Tuple<int,int,int,int>(1,0,11,1)};
  8.     int[] outputs = { 1, 1 };
  9.  
  10.     var input = new Tuple<int, int, int, int>(1, 1, 1, 1);
  11.  
  12.     var output = MachineLearningEngine.RunKNN(inputs, outputs, input);
  13.  
  14. }

This was a big help to get me up and running (walking, really..)…

Traffic Stop Analysis Using F#

Now that I have the traffic stop services up and running, it is time to actually do something with the data.  The data set is all traffic stops in my town for 2012 with some limited information: date/time of the stop, the geolocation of the stop, and the final disposition of the stop.  The data looks like this:

image

My 1st step was to look at the Date/Time and see if there are any patterns in DayOfMonth, MonthOfYear, And TimeOfDay.  To that end, I spun up a F# project and added my 1st method that determines the total number of records in the dataset:

  1. type roadAlert = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
  2. type AnalysisEngine =
  3.     static member RoadAlertDoc = roadAlert.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)
  4.  
  5.     static member NumberOfRecords =
  6.         AnalysisEngine.RoadAlertDoc
  7.             |> Seq.length

Since I am a TDDer more than a REPLer, I went and wrote a covering unit test.

  1. [TestMethod]
  2. public void NumberOfRecords_ReturnsExpected()
  3. {
  4.     Int32 notEpected = 0;
  5.     Int32 actual = AnalysisEngine.NumberOfRecords;
  6.     Assert.AreNotEqual(notEpected, actual);
  7. }

A couple of things to note about this:

1) This is really an integration test, not a unit test.  I could have written the test like this:

  1. [TestMethod]
  2. public void NumberOfRecordsFor2012DataSet_ReturnsExpected()
  3. {
  4.     Int32 expected = 27778;
  5.     Int32 actual = AnalysisEngine.NumberOfRecords;
  6.     Assert.AreEqual(expected, actual);
  7. }

But that means I am tying the test to the specific data sample (in its current state) – and I don’t want to do that.

2) I am finding that my F# code has many more functions than the code written by other people – esp data scientists.  I think it has to do with contrasting methodologies.  Instead of spending time in the REPL with a small piece of code to get it right and then adding the code into the larger code base, I am writing very small piece of code in the class and then using unit tests to get it right.  The upshot of that is that there are lots of small, independently testable pieces of code – I think this stems from my background of writing production apps that are for business problems and not for academic papers.  Also, I use classes in source files versus script files because I plan to plug the code into larger .NET applications that will be written in C# and/or VB.NET.

In any event, once I has the total number of records, I went to see how they broke down into month:

  1. static member ActualTrafficStopsByMonth =
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> x.StopDateTime.Month)
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.toList

  1. [TestMethod]
  2. public void ActualTrafficStopsByMonth_ReturnsExpected()
  3. {
  4.     Int32 notExpected = 0;
  5.     var stops = AnalysisEngine.ActualTrafficStopsByMonth;
  6.     Assert.AreNotEqual(notExpected, stops.Length);
  7.  
  8. }

 

I then created a function that shows the expected number of stops by month.  Pattern matching with F# makes creating the month list a snap.  Note that is is a true unit test because I am not dependent on external data:

  1. static member Months =
  2.     let monthList = [1..12]
  3.     Seq.map (fun x ->
  4.             match x with
  5.                 | 1 | 3 | 5 | 7 | 8 | 10 | 12 -> x,31,31./365.
  6.                 | 2 -> x,28,28./365.
  7.                 | 4 | 6 | 9 | 11 -> x,30, 30./365.
  8.                 | _ -> x,0,0.                    
  9.         ) monthList
  10.     |> Seq.toList   

  1. static member ExpectedTrafficStopsByMonth numberOfStops =
  2.     AnalysisEngine.Months
  3.         |> Seq.map(fun (x,y,z) ->
  4.             x, int(z*numberOfStops))
  5.         |> Seq.toList

  1. [TestMethod]
  2. public void ExpectedTrafficStopsByMonth_ReturnsExpected()
  3. {
  4.     var stops = AnalysisEngine.ExpectedTrafficStopsByMonth(27778);
  5.     double expected = 2359;
  6.     double actual =stops[0].Item2;
  7.  
  8.     Assert.AreEqual(expected, actual);
  9. }

With the actual and expected ready to go, I then put the two side by side:

  1. static member TrafficStopsByMonth =
  2.     let numberOfStops = float(AnalysisEngine.NumberOfRecords)
  3.     let monthlyExpected = AnalysisEngine.ExpectedTrafficStopsByMonth numberOfStops
  4.     let monthlyActual = AnalysisEngine.ActualTrafficStopsByMonth
  5.     Seq.zip monthlyExpected monthlyActual
  6.         |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
  7.         |> Seq.toList

  1. [TestMethod]
  2. public void TrafficStopsByMonth_ReturnsExpected()
  3. {
  4.     var output = AnalysisEngine.TrafficStopsByMonth;
  5.     Assert.IsNotNull(output);
  6.  
  7. }

All of my unit tests ran green

image

so now I am ready to roll.  I created a quick console UI

  1. static void Main(string[] args)
  2. {
  3.     Console.WriteLine("Start");
  4.  
  5.     foreach (var tuple in AnalysisEngine.TrafficStopsByMonth)
  6.     {
  7.         Console.WriteLine(tuple.Item1 + ":" + tuple.Item2 + ":" + tuple.Item3 + ":" + tuple.Item4 + ":" + tuple.Item5);
  8.     }
  9.  
  10.     Console.WriteLine("End");
  11.     Console.ReadKey();
  12. }

image

With the output.  Obviously, a UX person could put some real pizzaz front of this data, but that is something to do another day.  If you didn’t see it in the code above, the tuple is constructed as: Month,ExpectedStops,ActualStops,Difference,%Difference.  So the real interesting thing is that September was 47% higher than expected with December 26% less.  That kind of wide variation begs for more analysis.

I then did a similar analysis by DayOfMonth:

  1. static member ActualTrafficStopsByDay =
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> x.StopDateTime.Day)
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.toList
  6.  
  7. static member Days =
  8.     let dayList = [1..31]
  9.     Seq.map (fun x ->
  10.             match x with
  11.                 | x when x < 29 -> x, 12, 12./365.
  12.                 | 29 | 30 -> x, 11, 11./365.
  13.                 | 31 -> x, 7, 7./365.
  14.                 | _ -> x, 0, 0.                 
  15.         ) dayList
  16.     |> Seq.toList     
  17.  
  18. static member ExpectedTrafficStopsByDay numberOfStops =
  19.     AnalysisEngine.Days
  20.         |> Seq.map(fun (x,y,z) ->
  21.             x, int(z*numberOfStops))
  22.         |> Seq.toList    
  23.  
  24. static member TrafficStopsByDay =
  25.     let numberOfStops = float(AnalysisEngine.NumberOfRecords)
  26.     let dailyExpected = AnalysisEngine.ExpectedTrafficStopsByDay numberOfStops
  27.     let dailyActual = AnalysisEngine.ActualTrafficStopsByDay
  28.     Seq.zip dailyExpected dailyActual
  29.         |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
  30.         |> Seq.toList

image

The interesting thing is that there are higher than expected traffic stops in the last half of the month (esp the 25th and 26th) and much lower in the 1st part of the month.

And by TimeOfDay

  1. static member ActualTrafficStopsByHour =
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> x.StopDateTime.Hour)
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.toList
  6.  
  7. static member Hours =
  8.     let hourList = [1..24]
  9.     Seq.map (fun x ->
  10.                 x,1, 1./24.
  11.         ) hourList
  12.     |> Seq.toList     
  13.  
  14. static member ExpectedTrafficStopsByHour numberOfStops =
  15.     AnalysisEngine.Hours
  16.         |> Seq.map(fun (x,y,z) ->
  17.             x, int(z*numberOfStops))
  18.         |> Seq.toList    
  19.  
  20. static member TrafficStopsByHour =
  21.     let numberOfStops = float(AnalysisEngine.NumberOfRecords)
  22.     let hourlyExpected = AnalysisEngine.ExpectedTrafficStopsByHour numberOfStops
  23.     let hourlyActual = AnalysisEngine.ActualTrafficStopsByHour
  24.     Seq.zip hourlyExpected hourlyActual
  25.         |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
  26.         |> Seq.toList

image

 

The interesting thing here is that there are much higher than expected number of traffic stops from 1-2 AM (61% and 123%) with significantly less between 8PM and midnight.  Finally, I looked at GPS location for the stops.

  1. static member ActualTrafficStopsByGPS =  
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> System.Math.Round(x.Latitude,3).ToString() + ":" + System.Math.Round(x.Longitude,3).ToString())
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.sortBy snd
  6.         |> Seq.toList
  7.         |> List.rev
  8.  
  9. static member GetVarianceOfTrafficStopsByGPS =
  10.     let trafficStopList = AnalysisEngine.ActualTrafficStopsByGPS
  11.                             |> Seq.map(fun x -> double(snd x))
  12.                             |> Seq.toList
  13.     AnalysisEngine.Variance(trafficStopList)
  14.  
  15. static member GetAverageOfTrafficStopsByGPS =
  16.     AnalysisEngine.ActualTrafficStopsByGPS
  17.         |> Seq.map(fun x -> double(snd x))
  18.         |> Seq.average

 

You can see that I rounded the Latitude and Longitude to 3 decimal places.  Using Wikipedia, saying that 4 decimals at 23N is 10.24M and 45N it is 7.87M for latitude, I imputed that 35 is 8.94M.  With 1 M = 3.28 feet, that means that 4 decimals is with 30 feet and 3 decimals is within 300 feet and 2 decimals is within 3,000 feet.  300 feet seems like a good compromise so I ran with that.

So running the average and variance and the top GPS locations:

image

With an average of 11 stops per GPS location (less than 1 a month) and a variance of 725, there does not seem be a strong relationship between GPS location and traffic stops.

The upshot of all of this analysis seems to point to avoid getting stopped it is less important where you are than when you are.  This is confirmed anecdotally too – the Town actually broadcasts when they will have heightened traffic surveillance on Twitter and the like.  Ignore open data at your own risk.  

In any event, I my next step is to run this data though a machine-learning algorithm to see if there is anything else to uncover.

Correlation Between Recruit Rankings and Final Standings in Big Ten Football

Following up on my last post about screen scraping college football in F#, I took the next step and analyzed the data that I scraped.  I am a big believer in Domain Specific Language so ‘Rankings’ means the ranking assigned by Rivals about how well a school recruits players.  ‘Standings’ means the final position in the Big Ten after the games have been played.  Ranking is for recruiting and standings is for actually playing the games.

Going back to the code, the 1st thing I did was to separate the Standings call from the search for a given school – so that the XmlDocument is loaded once and then searched several times versus loading it for each search.  This improved performance dramatically:

  1. static member getAnnualConferenceStandings(year:int)=
  2.     let url = "http://espn.go.com/college-football/conferences/standings/_/id/5/year/&quot;+year.ToString()+"/big-ten-conference";         
  3.     let request = WebRequest.Create(Uri(url))
  4.     use response = request.GetResponse()
  5.     use stream = response.GetResponseStream()
  6.     use reader = new IO.StreamReader(stream)
  7.     let htmlString = reader.ReadToEnd()
  8.     let divMarkerStartPosition = htmlString.IndexOf("my-teams-table");
  9.     let tableStartPosition = htmlString.IndexOf("<table",divMarkerStartPosition);
  10.     let tableEndPosition = htmlString.IndexOf("</table",tableStartPosition);
  11.     let data = htmlString.Substring(tableStartPosition, tableEndPosition- tableStartPosition+8)
  12.     let xmlDocument = new XmlDocument();
  13.     xmlDocument.LoadXml(data);
  14.     xmlDocument        
  15.  
  16. static member getSchoolStanding(xmlDocument: XmlDocument,school) =
  17.     let keyNode = xmlDocument.GetElementsByTagName("td")
  18.                         |> Seq.cast<XmlNode>
  19.                         |> Seq.find (fun node -> node.InnerText = school)
  20.     let valueNode = keyNode.NextSibling
  21.     let returnValue = (keyNode.InnerText, valueNode.InnerText)
  22.     returnValue
  23.  
  24. static member getConferenceStandings(year:int) =
  25.     let xmlDocument = RankingProvider.getAnnualConferenceStandings(year)
  26.     Seq.map(fun school -> RankingProvider.getSchoolStanding(xmlDocument,school)) RankingProvider.schools
  27.         |> Seq.sortBy snd
  28.         |> Seq.toList
  29.         |> List.rev
  30.         |> Seq.mapi(fun index (school,ranking) -> school, index+1)
  31.         |> Seq.sortBy fst
  32.         |> Seq.toList

Thanks for Valera Kolupaev for showing me how to use mapi to create a tuple from the list of schools and what rank they were in the list in getConferenceStandings().

I then went to the rankings call and added a way to parse down only the schools I am interested in.  That way I can compare individual schools, groups of schools, or the entire conference:

  1. static member getConferenceRankings(year) =
  2.     RankingProvider.schools
  3.             |> Seq.map(fun schoolName -> RankingProvider.getSchoolInSequence(year, schoolName))
  4.             |> Seq.toList
  5.     
  6.  
  7. static member getSchoolInSequence(year, schoolName) =
  8.     RankingProvider.getRecrutRankings(year)
  9.                     |> Seq.find(fun (school,rank) -> school = schoolName)

After these two refactorings, my unit tests still ran green so I was ready to do the analysis.

image

I went out to my project of a couple of weeks ago for correlation and copied in the module.  The Correlation function takes in two lists of doubles.  The first list would be a school’s ranking and the second would be the standings:

  1. static member getCorrelationBetweenRankingsAndStandings(year, rankings, standings ) =
  2.     let ranks = Seq.map(fun (school,rank) -> rank) rankings
  3.     let stands = Seq.map(fun (school,standing) -> standing) standings
  4.     Calculations.Correlation(ranks,stands)
  5.  
  6. static member getCorrelation(year:int) =
  7.     let rankings = RankingProvider.getConferenceRankings year
  8.                     |> Seq.map(fun (school,rank) -> school,Convert.ToDouble(rank))
  9.     let standings = RankingProvider.getConferenceStandings(year+RankingProvider.yearDifferenceBetwenRankingsAndStandings)
  10.                     |> Seq.map(fun (school, standing) -> school, Convert.ToDouble(standing))
  11.     let correlation = RankingProvider.getCorrelationBetweenRankingsAndStandings(year,rankings, standings)
  12.     (year, correlation)

A couple of things to note:

1) This function assumes that both the rankings and the standings are the same length and are in order by school name.  A production application would check this as part of standard argument validation.

2) I used Convert.ToDouble() to change the Int32 of the ranking to Double of the correlation function.  Having these .NET assemblies available at key points in the application really moved things along.

In any event, all that was left was to list the Big Ten schools to analyze, the number of years to analyze, and the year difference between the recruit rankings and the standings from the games they played in.

As a first step, I did all original big ten schools with 7 years of recruiting and a 1,2,3,4 years difference (2002 ranking compared to 2003, 2004,2005,2006 standings ,etc…):

imageimage

imageimage

The average is .3303/.2650/.5138/.6065

And so yeah – there is a really strong correlation between a recruit ranking and the outcome on the field.  Also, the most impact the class has seems to be senior year – which makes sense.  I don’t have a hypothesis on why it drops sophomore year – perhaps the ‘impact freshmen’ leave after 1 year?

Also of interest, the correlation does not seem to follow a normal distribution.  If you only look at the schools that have an emphasis on academics, the correlation drops significantly – to a negative correlation! 

imageimage

imageimage

The average is .1485/-.1446/-.2817/-.0381

So another great reason to create the new big ten – sometimes there is a really good recruit class does not do well on the field and other times a poorly-ranked recruiting class does well on the field.  This kind of unpredictability is both exciting and probably much more likely to bring in the casual fans.

Based on this analysis, here is what is going to happen in the Big Ten next year:

  • Michigan State and Ohio State will be the leaders
  • Michigan and Penn State are in the best position beat Michigan State and Ohio State

But you didn’t need a statistical analysis to tell you that.  The one key surprise that this analysis tells you is that

  • Nebraska will have a significant improvement in the standings in 2014
  • Indiana will have a significant improvement in the standings in 2015 and 2016

As a final note, I got this after doing a bunch of requests to Yahoo:

image

image

 

So I wonder if I hit the page too many times and my IP was flagged a as a bot?  I waited a day for the server to reset to finish my analysis.  Perhaps this is a case where I should get the data when the getting is good and take their pages and bring them locally?

 

The Big Ten and F#

I was talking to fellow TRINUGer David Green about football schools a couple of weeks ago.  He went to Northwestern and I went to Michigan and we were discussing the relative merits of universities doing well in football.  Assuming Goro was counting, on one hand, it is great to have a sport that can bring in tons of money to the school to fund non-football sports and activities, on the second hand it keeps alumni interested in their school, on the 3rd hand it can give locals a source of pride in the school, and on the last hand it can take the focus away from the other parts of the academic institution.

I then was talking to a professor at Ohio State University – she cares absolutely zero about the football team.  I made the comment that the smartest kids in Ohio don’t go to OSU.  They will go and root for their gladiators on Saturday but when it comes down to their academic and subsequent professional success, they look elsewhere.  She agreed.

Putting those two conversations together, it put OSU and MSU’s continued success in the Big Ten in context – as the inevitable bellyaching that those teams get the short stick when compared to the SEC.  For example, OSU and MSU both would be undefeated in the Ivy League in 2013– does that mean they should be considered in the same conversation as Alabama and Auburn for the national championship?  I think the biggest problem that OSU and MSU have is that they are in the Big Ten – which historically has been about geography, academic success, and athletic competition (in that order). 

Looking at the Big Ten Schools, I pulled their most recent academic ranking for US News and World Report and their BCS Ranking.  I then went over to MathIsFun to get the recipe for correlation:

image

I then went over to Visual Studio and created a solution like so:

image

Learning from my last project, I created my unit test first to verify that the calculation is correct:

  1. [TestMethod]
  2. public void FindCorrelationUsingStandardInput_ReturnsExpectedValue()
  3. {
  4.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  5.     Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
  6.  
  7.     Double expected = .9575;
  8.     Double actual = Calculations.Correlation(tempatures, sales);
  9.     Assert.AreEqual(expected, actual);
  10. }

I then hopped over to my working code and started coding:

  1. type Calculations() =
  2.     static member Correlation(x:IEnumerable<double>, y:IEnumerable<double>) =
  3.         let meanX = Seq.average x
  4.         let meanY = Seq.average y
  5.         
  6.         let a = Seq.map(fun x -> x-meanX) x
  7.         let b = Seq.map(fun y -> y-meanY) y
  8.  
  9.         let ab = Seq.zip a b
  10.         let abProduct = Seq.map(fun (a,b) -> a * b) ab
  11.  
  12.         let aSquare = Seq.map(fun a -> a * a) a
  13.         let bSquare = Seq.map(fun b -> b * b) b
  14.         
  15.         let abSum = Seq.sum abProduct
  16.         let aSquareSum = Seq.sum aSquare
  17.         let bSquareSum = Seq.sum bSquare
  18.  
  19.         let sums = aSquareSum * bSquareSum
  20.         let squareRootOfSums = sqrt(sums)
  21.  
  22.         abSum/squareRootOfSums

What I noticed is that those intermediate variables make the code much more wordy than they need to be  – so a mathematician might think that the code is too verbose– but a developer might appreciate that each step is laid out.  In fact, I would argue that a better component design would be to break out each of the steps into their own function that can be independently testable (and perhaps reused by other functions):

  1. [TestMethod]
  2. public void GetMeanUsingStandardInputReturnsExpectedValue()
  3. {
  4.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  5.     Double expected = 18.675;
  6.     Double actual = Calculations.Mean(tempatures);
  7.     Assert.AreEqual(expected, actual);
  8. }
  9.  
  10. [TestMethod]
  11. public void GetBothMeansProductUsingStandardInputReturnsExpectedValue()
  12. {
  13.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  14.     Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
  15.  
  16.     Double expected = 5325;
  17.     Double actual = Calculations.MeanProduct(tempatures);
  18.     Assert.AreEqual(expected, actual);
  19. }
  20.  
  21. [TestMethod]
  22. public void GetMeanSquareUsingStandardInputReturnsExpectedValue()
  23. {
  24.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  25.     Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
  26.  
  27.     Double expected = 177;
  28.     Double actual = Calculations.MeanSquared(tempatures);
  29.     Assert.AreEqual(expected, actual);
  30. }

I’ll leave that implementation for another day as it is already getting late.  In any event, I ran the unit test and I got red (pink, really):

image

The spreadsheet rounded and my calculation does not.  I adjusted the unit test appropriately:

  1. [TestMethod]
  2. public void FindCorrelationUsingStandardInput_ReturnsExpectedValue()
  3. {
  4.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  5.     Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
  6.  
  7.     Double correlation = Calculations.Correlation(tempatures, sales);
  8.     Double expected = .9575;
  9.  
  10.     Double actual = Math.Round(correlation, 4);
  11.     Assert.AreEqual(expected, actual);
  12. }

And now I am green:

image

So going back to the original question, I took the current Big Ten Schools and put their academic rankings and football rankings side by side:

image

 

I then made a revised Big Ten that had a much higher academic ranking based on schools that play in a power football conference but still maintain high academics.

image

Note that I left Penn State out of both of these lists b/c they have a NaN for their football ranking – but they certainly have a high enough academic score to be part of the revised Big Ten.

And then when I put those values through the correlation function via a Console UI:

  1. static void Main(string[] args)
  2. {
  3.     Console.WriteLine("Start");
  4.  
  5.     Double[] academicRanking = new Double[12] { 12,28,41,41,52,62,68,69,73,73,75,101 };
  6.     Double[] footballRanking = new Double[12] { 65,41,82,19,7,61,105,36,4,34,63,37 };
  7.  
  8.     Double originalCorrelation = Calculations.Correlation(academicRanking, footballRanking);
  9.     Console.WriteLine("Original BigTen Correlation {0}", originalCorrelation);
  10.  
  11.     academicRanking = new Double[10] { 7,12,17,18,23,23,28,30,41,41 };
  12.     footballRanking = new Double[10] { 24, 65, 32, 26, 94, 84, 41, 58, 82, 19 };
  13.     Double revisedCorrelation = Calculations.Correlation(academicRanking, footballRanking);
  14.     Console.WriteLine("Revised BigTen Correlation {0}", revisedCorrelation);
  15.  
  16.     
  17.     Console.WriteLine("End");
  18.     Console.ReadKey();
  19. }

I get:

image

And just looking at the data seems to support this.  There is a negative correlation between academics and football success in the current Big Ten – Higher the academics = lower the football ranking and vice versa.  In the revised Big Ten, there is positive correlation of the same magnitude – higher academics and higher (relative) football rankings.  Put another way, the new Big Ten has a much stronger academic ranking and pretty much the same football ranking.

Looking at a map, this new conference is like a doughnut with Ohio, West Virginia, and Kentucky in the middle.  Perhaps they can have a football championship sponsored by Krispie Kreeme?  In any event, OSU and MSU are much closer academically and football-wise to the Alabamas and Auburns than the Northwesterns and Michigans of the world.  In terms of geographic proximity, Columbus, Ohio is closer to Tuscalosa, AL than Lincoln, NB.  So perhaps the OSU and MSU fans would be better served in a conference that is more aligned with their University’s priorities?  If they went undefeated or even 1 loss, they would still be in the national championship discussion.