Jamie Dixon's Home

Association Rule Learning Via F# (Part 2)

May 20, 2014 1 Comment

Continuing on the path of re-writing the association rule learning code found in this month’s MSDN, I started with the next function on list:

MakeAntecedent

Here is the original code C#

public static int[] MakeAntecedent(int[] itemSet, int[] comb)
{
  // if item-set = (1 3 4 6 8) and combination = (0 2) 
  // then antecedent = (1 4)
  int[] result = new int[comb.Length];
  for (int i = 0; i < comb.Length; ++i)
  {
    int idx = comb[i];
    result[i] = itemSet[idx];
  }
  return result;
}

and the F# code:

static member MakeAntecedent(itemSet:int[] , comb:int[]) =
    comb |> Array.map(fun x -> itemSet.[x])

It is much easier to figure out what is going on via the F# code. The function takes in 2 arrays. Array #1 has values, Array #2 has the index of array #1 that is needed. Using the Array.Map, I return an array where the index number is swapped out to the actual value. The unit tests run green:

[TestMethod]
public void MakeAntecedentCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[2] { 1, 4 };
    var actual = CS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
}
 
[TestMethod]
public void MakeAntecedentFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[2] { 1, 4 };
    var actual = FS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
}

MakeConsequent

Here is the original C# code:

public static int[] MakeConsequent(int[] itemSet, int[] comb)
{
  // if item-set = (1 3 4 6 8) and combination = (0 2) 
  // then consequent = (3 6 8)
  int[] result = new int[itemSet.Length – comb.Length];
  int j = 0; // ptr into combination
  int p = 0; // ptr into result
  for (int i = 0; i < itemSet.Length; ++i)
  {
    if (j < comb.Length && i == comb[j]) // we are at an antecedent
      ++j; // so continue
    else
      result[p++] = itemSet[i]; // at a consequent so add it
  }
  return result;
}

Here is the F# Code:

static member MakeConsequent(itemSet:int[] , comb:int[])=   
    let isNotInComb x = not(Array.exists(fun elem -> elem = x) comb)
    itemSet 
        |> Array.mapi(fun indexer value -> value,indexer )
        |> Array.filter(fun (value,indexer) -> isNotInComb indexer)
        |> Array.map(fun x -> fst x)

Again, it is easier to look at the F# code to figure out what is going on. In this case, we have to take all of the items in the first array that are not in the second array. The trick is that the second array does not contain values to be checked, rather the index position. If you add the Antecedent and the Consequent, you have the total original array.

This code code me a bit of time to figure out because I kept trying to use the out of the box Array features (including slicing) for F# when it hit me that it would be much easier to create a tuple from the original array –> the value and the index. I would then look up the index in the second array and confirm it is not there and then filter the ones that are not there. The map function at the end removes the index part of the tuple because it is not needed anymore.

Sure enough, my unit tests ran green:

[TestMethod]
public void MakeConsequentCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[3] { 3, 6, 8 };
    var actual = CS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
    Assert.AreEqual(expected[2], actual[2]);
}
 
[TestMethod]
public void MakeConsequentFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[3] { 3, 6, 8 };
    var actual = FS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
    Assert.AreEqual(expected[2], actual[2]);
}

IndexOf

I then decided to tackle the remaining three functions in reverse because they depend on each other (CountInTrans –> IsSubSetOf –> IndexOf). IndexOf did not have any code comments of example cases, but the C# code is clear

public static int IndexOf(int[] array, int item, int startIdx)
{
  for (int i = startIdx; i < array.Length; ++i)
  {
    if (i > item) return -1; // i is past where the target could possibly be
    if (array[i] == item) return i;
  }
  return -1;
}

What is even clearer is the F# code that does the same thing (yes, I am happy that FindIndex returns a –1 when not found and so did McCaffey):

static member IndexOf(array:int[] , item:int, startIdx:int) =
    Array.FindIndex(array, fun x -> x=item)

And I built some unit tests that run green that I think reflect McCaffey’s intent:

[TestMethod]
public void IndexOfCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[4] { 0, 1, 4, 5 };
    Int32 item = 1;
    Int32 startIndx = 1;
 
    int expected = 1;
    int actual = CS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
 
    Assert.AreEqual(expected, actual);
}
public void IndexOfFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[4] { 0, 1, 4, 5 };
    Int32 item = 1;
    Int32 startIndx = 1;
 
    int expected = 1;
    int actual = FS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
 
    Assert.AreEqual(expected, actual);
}

IsSubsetOf

In the C# implementation, IndexOf is called to keep track of where the search is currently pointed.

public static bool IsSubsetOf(int[] itemSet, int[] trans)
{
  // 'trans' is an ordered transaction like [0 1 4 5 8]
  int foundIdx = -1;
  for (int j = 0; j < itemSet.Length; ++j)
  {
    foundIdx = IndexOf(trans, itemSet[j], foundIdx + 1);
    if (foundIdx == -1) return false;
  }
  return true;
}

In The F# one, that is not needed:

static member IsSubsetOf(itemSet:int[] , trans:int[]) =
    let isInTrans x = (Array.exists(fun elem -> elem = x) trans)
    let filteredItemSet = itemSet
                            |> Array.map(fun value -> value, isInTrans value)
                            |> Array.filter(fun (value, trans) -> trans = false)
    if filteredItemSet.Length = 0 then true
        else false

CountInTrans

Here is the original C# code uses the IsSubsetOf function.

public static int CountInTrans(int[] itemSet, List<int[]> trans, Dictionary<int[], int> countDict)
{
   //number of times itemSet occurs in transactions, using a lookup dict
 
    if (countDict.ContainsKey(itemSet) == true)
    return countDict[itemSet]; // use already computed count
 
  int ct = 0;
  for (int i = 0; i < trans.Count; ++i)
    if (IsSubsetOf(itemSet, trans[i]) == true)
      ++ct;
  countDict.Add(itemSet, ct);
  return ct;
}

And here is the F# Code that also uses that subfunction

static member CountInTrans(itemSet: int[], trans: List<int[]>, countDict: Dictionary<int[], int>) =
    let trans' = trans |> Seq.map(fun value -> value, AssociationRuleProgram.IsSubsetOf (itemSet,value))
    trans' |> Seq.filter(fun item -> snd item = true)
           |> Seq.length

GetHighConfRules

With the subfunctions created and running green, I then tackled the point of the exercise –> GetHighConfRules. The C# implementation is pretty verbose and there are lots of things happening:

    public static List<Rule> GetHighConfRules(List<int[]> freqItemSets, List<int[]> trans, double minConfidencePct)
    {
      // generate candidate rules from freqItemSets, save rules that meet min confidence against transactions
      List<Rule> result = new List<Rule>();
 
      Dictionary<int[], int> itemSetCountDict = new Dictionary<int[], int>(); // count of item sets
 
      for (int i = 0; i < freqItemSets.Count; ++i) // each freq item-set generates multiple candidate rules
      {
        int[] currItemSet = freqItemSets[i]; // for clarity only
        int ctItemSet = CountInTrans(currItemSet, trans, itemSetCountDict); // needed for each candidate rule
 
        for (int len = 1; len <= currItemSet.Length – 1; ++len) // antecedent len = 1, 2, 3, . .
        {
          int[] c = NewCombination(len); // a mathematical combination
 
          while (c != null) // each combination makes a candidate rule
          {
            int[] ante = MakeAntecedent(currItemSet, c);
            int[] cons = MakeConsequent(currItemSet, c); // could defer this until known if needed
          
            int ctAntecendent = CountInTrans(ante, trans, itemSetCountDict); // use lookup if possible 
            double confidence = (ctItemSet * 1.0) / ctAntecendent;
 
            if (confidence >= minConfidencePct) // we have a winner!
            {
              Rule r = new Rule(ante, cons, confidence); 
              result.Add(r); // if freq item-sets are distinct, no dup rules ever created
            }
            c = NextCombination(c, currItemSet.Length);
          } // while each combination
        } // len each possible antecedent for curr item-set
      } // i each freq item-set
 
      return result;
    } // GetHighConfRules

In the F# code, It decided to work inside out and get the rule for 1 itemset. I think the code reads pretty clear where each step is laid out

static member GetHighConfRules(freqItemSets:List<int[]>, trans:List<int[]>,  minConfidencePct:float) =
    let returnValue = new List<Rule>()
    freqItemSets 
        |> Seq.map (fun i -> i, AssociationRuleProgram.CountInTrans'(i,trans))
        |> Seq.filter(fun (i,c) -> (float)c > minConfidencePct)
        |> Seq.map(fun (i,mcp) -> i,mcp,AssociationRuleProgram.MakeAntecedent(i, trans.[0]))
        |> Seq.map(fun (i,mcp,a) -> i,mcp, a, AssociationRuleProgram.MakeConsequent(i, trans.[0]))
        |> Seq.iter(fun (i,mcp,a,c) -> returnValue.Add(new Rule(a,c,mcp)))
    returnValue

I then attempted to put this block into a larger block (trans.[0]) but then I realized that I was going about this the wrong way. Instead of using the C# code as a my base line, I need to approach the problem from a functional viewpoint. That will be the subject of my blog next week…

Filed under Analytics, F#

Association Rule Learning Via F# (Part 1)

May 13, 2014 1 Comment

I was reading the most recent MSDN when I came across this article. How awesome is this? McCaffrey did a great job explaining a really interesting area of analytics and I am loving the fact that MSDN is including articles about data analytics. When I was reading the article, I ran across this sentence “The demo program is coded in C# but you should be able to refactor the code to other .NET languages such as Visual Basic or Iron Python without too much difficulty” Iron Python? Iron Python! What about F#, the language that matches analytics the way peanut butter goes with chocolate? Challenge accepted!

The first thing I did was to download his source code from here. When I first opened the source code, I realized that the code would be a little bit hard to port because it is written from a scientific angle, not a business application point of view. 34 FxCop errors in 259 lines of code confirmed this:

Also, there are tons of comments which is very distracting. I generally hate comments, but I figure that since it is a MSDN article and it is supposed to explain what is going on, comments are OK. However, many of the comments can be refactored into more descriptive variables and method names. For example:

In any event, let’s look at the code. The first thing I did was change the CS project from a console app to a library and move the test data into an another project . I then moved the console code to the UI. I also moved the Rule class code into its own file, made sure the namespaces matched, and made the AssociationRuleProgram public. Yup it still runs:

So then I created a FSharp library in the solution and set up the class with the single method:

A couple of things to note:

1) I left the parameter naming the same, even though it is not particularly intention-revealing

2) F# is typed inferred, so I don’t have to assign the types to the parameters

Next, started looking at the supporting functions to GetHighConfRules. Up first was the function call NextCombination. Here is the side by side between the imperative style and the functional style:

The next function was NextCombination was more difficult for me to understand. I stopped what I was doing and built a unit test project that proved correctness using the commented examples as the expected values. I used 1 test project for both the C# and F# project so I could see both side by side. An interesting side not is that the unit test naming is different than usual –> instead of naming the class XXXXTests where XXXX is the name of another class, XXXX is the function name that both classes are implementing:

So going back to the example,

I wrote two unit tests that match the two comments

When I ran the tests, the 1st test passed but the second did not:

The problem with the failing test is that null is not being returned, rather {3,4,6}. So now I have a problem, do I base the F# implementation on the code comments or the code itself? I decided to base it on the code, because comments often lie but CODE DON”T LIE (thanks ‘sheed). I adjusted the unit test, got green.

One of the reasons the code is pretty hard to read/understand is because of the use of ‘i’,’j’,’k’,’n’ variables. I went back to the article and McCaffrey explains what is going on at the bottom left of page 60. Another name for the function ‘NextCombination’ could be called ‘GetLexicographicalSuccessor’ and the variable ‘n’ could be called ‘numberOfPossibleItems’. With that mental vocabulary in place, I went through the functional and divided it into 4 parts:

1Checking to see if the value of the first element is of a certain length

2) Creating a result array that is seeded with the values of the input array

3) Looping backwards to identify the 1st number in the array that will be adjusted

4) From that target element, looping forward and adjusting all subsequent items

#1 I will not worry about now and #2 is not needed in F#, so #3 is the first place to start. What I need is a way of splitting the array into two parts. Part 1 has the original values that will not change and part 2 has the values that will change. Seq.Take and Seq.Skip are perfect for this:

let i = Array.LastIndexOf(comb,n)
let i' = if i = – 1 then 0 else i
let comb' = comb |> Seq.take(i') |> Seq.toArray
let comb'' = comb |> Seq.skip(i') |> Seq.toArray

Looking at #4, I now need to increment the values in part 2 by 1. Seq.take will work:

And then putting part 1 and part 2 back together via Array.Append, we have equivalence*:

*Equivalence is defined by my unit tests, which both pass green. I have no idea if other inputs will not work. Note that the second unit test runs red, so I really think that the code is wrong and that the comment to return null is correct. The value I am getting for (3;4;5)(5) is (3;4;1) which seems to make sense.

I am not crazy about these explanatory variables (comb’, comb’’, and comb’’’) but I am not sure how to combine them without sacrificing readability. I definitely want to combine the i and i’ into 1 statement…

I am not sure why Scan is returning 4 items in an array when I am passing in an array that has a length of 3. I am running out of time today so I just hacked in a Seq.Take.

I’ll continue this exercise in my blog next week.

Filed under Analytics, F#

Kaplan-Meier Survival Analysis Using F#

May 6, 2014 5 Comments

I was reading the most recent issue of MSDN a couple of days ago when I came across this article on doing a Kaplan-Meier survival analysis. I thought the article was great and I am excited that MSDN is starting to publish articles on data analytics. However, I did notice that there wasn’t any code in the article, which is odd, so I went to the on-line article and others had a similar question:

I decided to implement a Kaplan-Meier survival (KMS) analysis using F#. After reading the article a couple of times, I was still a bit unclear on how the KMS is implemented and there does not seem to be any pre-rolled in the standard .NET stat libraries out there. I went on over to this site where there was an excellent description of how the survival probability is calculated. I went ahead and built an Excel spreadsheet to match the nih one and then compare to what Topol is doing:

Notice that Topol censored the data for the article. If we only cared about the probability of crashes, then we would not censor the data for when the device was turned off.

So then I was ready to start coding so spun up a solution with an F# project for the analysis and a C# project for the testing.

I then loaded into the unit test project the datasets that Topol used:

[TestMethod]
public void EstimateForApplicationX_ReturnsExpected()
{
    var appX = new CrashMetaData[]
    {
        new CrashMetaData(0,1,false),
        new CrashMetaData(1,5,true),
        new CrashMetaData(2,5,false),
        new CrashMetaData(3,8,false),
        new CrashMetaData(4,10,false),
        new CrashMetaData(5,12,true),
        new CrashMetaData(6,15,false),
        new CrashMetaData(7,18,true),
        new CrashMetaData(8,21,false),
        new CrashMetaData(9,22,true),
    };
}

I could then wire up the unit tests to compare the output to the article and what I had come up with.

public void EstimateForApplicationX_ReturnsExpected()
{
    var appX = new CrashMetaData[]
    {
        new CrashMetaData(0,1,false),
        new CrashMetaData(1,5,true),
        new CrashMetaData(2,5,false),
        new CrashMetaData(3,8,false),
        new CrashMetaData(4,10,false),
        new CrashMetaData(5,12,true),
        new CrashMetaData(6,15,false),
        new CrashMetaData(7,18,true),
        new CrashMetaData(8,21,false),
        new CrashMetaData(9,22,true),
    };
 
    var expected = new SurvivalProbabilityData[]
    {
        new SurvivalProbabilityData(0,1.000),
        new SurvivalProbabilityData(5,.889),
        new SurvivalProbabilityData(12,.711),
        new SurvivalProbabilityData(18,.474),
        new SurvivalProbabilityData(22,.000)
    };
 
    KaplanMeierEstimator estimator = new KaplanMeierEstimator();
    var actual = estimator.CalculateSurvivalProbability(appX);
 
    Assert.AreSame(expected, actual);
}

However, one of the neat features of F# is the REPL so I don’t need to keep running unit tests to prove correctness when I am proving out a concept. So I added equivalent test code in the beginning of the F# project so I could run in the REPL my ideas:

type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
 
type KapalanMeierAnalysis() = 
    member this.GenerateXAppData ()= 
                    [|  {userId=0; crashTime=1; crashed=false};{userId=1; crashTime=5; crashed=true};
                        {userId=2; crashTime=5; crashed=false};{userId=3; crashTime=8; crashed=false};
                        {userId=4; crashTime=10; crashed=false};{userId=5; crashTime=12; crashed=true};
                        {userId=6; crashTime=15; crashed=false};{userId=7; crashTime=18; crashed=true};
                        {userId=8; crashTime=21; crashed=false};{userId=9; crashTime=22; crashed=true}|]
    
    member this.RunAnalysis(crashMetaData: array<CrashMetaData>) = 

The first thing I did was duplicate the 1st 3 columns of the Excel spreadsheet:

let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))

In the REPL:

The forth column is tricky because it is a cumulative calculation. Instead of for..eaching in an imperative style, I took advantage of the functional language constructs to make the code much more readable. Once I calculated that column outside of the base Sequence, I added it back in via Seq.Zip

let cumulativeDevices = crashMetaData.Length
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)

In the REPL:

The next two columns were a snap –> they were just calculations based on the existing values:

let cumulativeDevices = crashMetaData.Length
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
let crashSequence'' = crashSequence'
                            |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 

The last column was another cumulative calculation so I added another accumulator and used Seq.scan and Seq.Zip.

let cumulativeDevices = crashMetaData.Length
let cumulativeSurvivalProbability = 1.
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
let crashSequence'' = crashSequence'
                            |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 
 
let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
let survivalProbabilitySequence' = survivalProbabilitySequence
                                            |> Seq.skip 1

The last step was to map all of the columns and only output what was in the article. The final answer is:

namespace ChickenSoftware.SurvivalAnalysis
 
type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
type public SurvivalProbabilityData = {crashTime: int; survivalProbaility: float}
 
type KaplanMeierEstimator() = 
    member this.CalculateSurvivalProbability(crashMetaData: array<CrashMetaData>) = 
            let cumulativeDevices = crashMetaData.Length
            let cumulativeSurvivalProbability = 1.
 
            let crashSequence = crashMetaData 
                                    |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                            | true -> 1
                                                                                            | false -> 0),
                                                                             (match crash.crashed with
                                                                                            | true -> 0
                                                                                            | false -> 1))
            let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
            let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                                        |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
            let crashSequence'' = crashSequence'
                                        |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 
 
            let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
            let survivalProbabilitySequence' = survivalProbabilitySequence
                                                        |> Seq.skip 1
 
            let crashSequence''' = Seq.zip crashSequence'' survivalProbabilitySequence'
                                        |> Seq.map(fun ((t,c,nc,cumld,dp,sp),cumlsp) -> t,c,nc,cumld,dp,sp,cumlsp)
            crashSequence'''
                    |> Seq.filter(fun (t,c,nc,cumld,dp,sp,cumlsp) -> c=1 )
                    |> Seq.map(fun (t,c,nc,cumld,dp,sp,cumlsp) -> t,System.Math.Round(cumlsp,3))

And this matches the article (almost exactly). The article also has a row for iteration zero, which I did not bake in. Instead of fixing my code, I changed the unit test and removed that 1st column. In any event, I ran the test and it ran red –> but the values are identical so I assume it is a problem with the Assert.AreSame() function. I would take the time to figure it out but it is 75 degrees on a Sunday afternoon and I want to go play catch with my kids…

Note it also matches the other data set Topol has in the article:

In any event, this code reads pretty much the way I was thinking about the problem – each column of the Excel spreadsheet has a 1 to 1 correspondence to the F# code block. I did use explanatory variables liberally which might offend the more advanced functional programmers but taking each step in turn really helped me focus on getting each step correct before going to the next one.

1) I had to offset the cumulativeSurvivalProabability by one because the calculation is how many crashed on a day compared to how many were working at the start of the day. The Seq.Scan increments the counter for the next row of the sequence and I need it for the current row. Perhaps there is an overload for Seq.Scan?

2) I adopted the functional convention of using ticks to denote different physical manifestations of the same logic concept (crashedDeviceSequence “became” crashedDeviceSequence’, etc…). Since everything is immutable by default in F#, this kind of naming convention makes a lot of sense to me. However, I can see it quickly becoming unwieldy.

3) I could not figure out how to operate on the base tuple so instead I used a couple of supporting Sequences and then put everything together using Seq.Zip. I assume there is a more efficient way to do that.

4) One of the knocks against functional/scientific programming is that values are named poorly. To combat that, I used the full names in my tuples to start. After a certain point though, the names got too unwieldy so I resorted to their initials. I am not sure what the right answer is here, or even if there is right answer.

Filed under Analytics, F#

Microsoft Language Stack Analogy

April 29, 2014 2 Comments

I am getting ready for my presentations at Charlotte Code Camp next Saturday. My F# session is a business-case driven one: reasons why the average C# developer might want to take a look at F#. I break the session down into 5 sections: F# is integrated, fast, expressive, bug-resistant, and analytical. In the fast piece, I am going to make the analogy of Visual Studio to a garage.

Consider a man who lives in a nice house in a suburban neighborhood with a three car garage. Every morning when he gets ready for his morning commute to work, he opens the door that goes from their house into the their garage and there sitting in the 1st bay is a minivan.

Now there is nothing wrong with the minivan – it is dependable, all of the neighbors drive it, it does many things pretty well. However, consider that right next to the minivan, never been used, is a Ferrari. Our suburban programmer has heard about a Ferrari, and has perhaps even glanced at it curiously when he pulls out in the morning , but he:

Doesn’t see the point of driving it because the minivan suits him just fine
Is afraid to try driving it because he doesn’t drive stick and taking the time to learn would slow him down
Don’t want to drive it because then he would have to explain to his ~~project manager~~ wife why he are driving around town in such a car

So the Ferrari sits unused. To round out the analogy, in the 3rd bay is a helicopter that no one in their right mind will touch. Finally, there is a junked car around back that no one uses anymore that he has to keep around because it is too expensive to haul it to the junkyard.

So this is what happens to a majority of .NET developers when they open their garage called visual studio. The go with the comfortable language of the C# minivan, ignoring the power and expressiveness of the F# Ferrari and certainly not touching the C++ helicopter. I picked helicopter for C++ b/c helicopters can go places cars can not, is notoriously difficult to pilot, and when they crash, it is often spectacular and brings down others with them. The junked car is VB.NET, which makes me sad on certain days….

Also, since C# 2.0, the minivan has tried to becomes more Ferrari-like. It has added turbo engine called linq, added the var keyword, anonymous types, the dynamic keyword, all in the attempt to become the one minivan that shall rule all.

I don’t know much about Roslyn but what I have seen, I think I can take and remove language syntax and it will still compile. If so, I will try and write a C# program that removes all curly-braces and semi-colons and replaces the var keyword with let. Is it still C# then?

OT: can you tell which session I am doing at the Hartford Code Camp in 2 weeks?

(And no, I did not submit in all caps. I guess the organizer is very excited about the topic?)

Filed under .NET Languages, C#, C++, F#

F# and List manipulations

April 22, 2014 2 Comments

I am preparing for a Beginning F# dojo for TRINUG tomorrow and I decided to do a presentation of Seq.GroupBy, Seq.CountBy, and Seq.SumBy for tuples. It is not apparent by the same the difference among these constructs and I think having a knowledge of them is indispensible when doing any kind of list analysis.

I started with a basic list like so:

let data = [("A",1);("A",3);("B",2);("C",1)]

I then ran a GroupBy through the REPL and got the following results:

let grouping = data 
                |> Seq.groupBy(fun (letter,number) -> letter)
                |> Seq.iter (printfn "%A")

("A", seq [("A", 1); ("A", 3)])
("B", seq [("B", 2)])
("C", seq [("C", 1)])

I then ran a CountBy through the REPL and got the following results:

let counting = data 
                |> Seq.countBy(fun (letter,number) -> letter)
                |> Seq.iter (printfn "%A")

("A", 2)
("B", 1)
("C", 1)

I then ran a SumBy through the REPL and got the following results:

let summing = data 
                |> Seq.sumBy(fun (letter,number) -> number)
                |> printfn "%A"

7

Now the fun begins. I combined a GroupBy and a CountBy through the REPL and got the following results:

let groupingAndCounting = data
                        |> Seq.groupBy(fun (letter,number) -> letter)
                        |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.countBy snd))
                        |> Seq.iter (printfn "%A")

("A", seq [(1, 1); (3, 1)])
("B", seq [(2, 1)])
("C", seq [(1, 1)])

Next I combined a GroupBy and a SumBy through the REPL and got the following results:

let groupingAndSumming = data
                            |> Seq.groupBy(fun (letter,number) -> letter)
                            |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
                            |> Seq.iter (printfn "%A")

("A", 4)
("B", 2)
("C", 1)

I then combined all three:

let groupingAndCountingSummed = data
                                |> Seq.groupBy(fun (letter,number) -> letter)
                                |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.countBy snd))
                                |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
                                |> Seq.iter (printfn "%A")

("A", 2)
("B", 1)
("C", 1)

With this in hand, I created a way of both counting and summing the second value of a tuple, which is a pretty common task:

let revisedData = 
    let summed = data
                    |> Seq.groupBy(fun (letter,number) -> letter)
                    |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
    let counted = data
                    |> Seq.groupBy(fun (letter,number) -> letter)
                    |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.countBy snd))
                    |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
    Seq.zip summed counted
                    |> Seq.map(fun ((letter,summed),(letter,counted)) -> letter,summed,counted)
                    |> Seq.iter (printfn "%A")

("A", 4, 2)
("B", 2, 1)
("C", 1, 1)

Finally, Mathias pointed out that I could use this as an entry to Deddle. Which is a really good idea….

Filed under F#

F# and the Open/Closed Principle

April 15, 2014 2 Comments

One of the advantages of using F# is that it is a .NET language. Although F# is a functional-first language, it also supports object-oriented constructs. One of the most powerful (indeed, the most powerful) technique in OO programming is using interfaces to follow the Open/Closed principle. If you are not familiar, a good explanation of Open/Closed principle is found here.

As part of the F# for beginners dojo I am putting on next week, we are consuming and then analyzing Twitter. The problem with always making calls to Twitter is that

1) The data changes every call

2) You might get throttled

Therefore, it makes good sense to have an in-memory representation of the data for testing and some Twitter data on disk so that different experiments can be run on the same data to see the result. Using Interfaces in F# makes this a snap.

First, I created an interface:

namespace NewCo.TwitterAnalysis
 
open System
open System.Collections.Generic
 
type ITweeetProvider =
   abstract member GetTweets : string -> IEnumerable<DateTime * int * string>

Next, I created the actual Twitter feed. Note I am using TweetInvi (available on Nuget) and that this file has to be below the interface in the solution explorer:

namespace NewCo.TwitterAnalysis
 
open System
open System.Configuration
open Tweetinvi
 
type TwitterProvider() = 
    interface ITweeetProvider with 
        member this.GetTweets(stockSymbol: string) =
            let consumerKey = ConfigurationManager.AppSettings.["consumerKey"]
            let consumerSecret = ConfigurationManager.AppSettings.["consumerSecret"]
            let accessToken = ConfigurationManager.AppSettings.["accessToken"]
            let accessTokenSecret = ConfigurationManager.AppSettings.["accessTokenSecret"]
        
            TwitterCredentials.SetCredentials(accessToken, accessTokenSecret, consumerKey, consumerSecret)
            let tweets = Search.SearchTweets(stockSymbol);
            tweets 
                |> Seq.map(fun t -> t.CreatedAt, t.RetweetCount, t.Text)

I then hooked up a unit (integration, really) test

[TestClass]
public class UnitTest1
{
    [TestMethod]
    public void GetTweetsUsingIBM_returnsExpectedValue()
    {
        ITweeetProvider provider = new TwitterProvider();
        var actual = provider.GetTweets("IBM");
        Assert.IsNotNull(actual);
    }
}

Sure enough, it ran green with actual Twitter data coming back:

I then created an In-Memory Tweet provider that can be used to:

1) Provide repeatable results

2) Have 0 external dependencies so that I can monkey with the code and a red unit test really does mean red

Here is its implementation:

namespace NewCo.TwitterAnalysis
 
open System
open System.Collections.Generic
 
type InMemoryProvider() = 
    interface ITweeetProvider with 
        member this.GetTweets(stockSymbol: string) =
            let list = new List<(DateTime*int*string)>()
            list.Add(DateTime.Now, 1,"Test1")
            list.Add(DateTime.Now, 0,"Test2")
            list :> IEnumerable<(DateTime*int*string)>

The only really interesting thing is the smiley/bird character (: >). F# implements interfaces a bit differently than what I was used to –> F# implements interfaces explicitly. I then fired up a true unit test and it also ran green:

[TestClass]
public class InMemoryProviderTests
{
    [TestMethod]
    public void GetTweetsUsingValidInput_ReturnsExpectedValue()
    {
        ITweeetProvider provider = new InMemoryProvider();
        var tweets = provider.GetTweets("TEST");
        var tweetList = tweets.ToList();
        Int32 expected = 2;
        Int32 actual = tweetList.Count;
        Assert.AreEqual(expected, actual);
    }
}

Finally, I created a file-system bound provider so that I can download and then hold static a large dataset. Based on past experience dealing with on-line data sources, getting data local to run multiple tests against is generally a good idea. Here is the implementation:

namespace NewCo.TwitterAnalysis
 
open System
open System.Collections.Generic
open System.IO
 
type FileSystemProvider(filePath: string) = 
    interface ITweeetProvider with 
        member this.GetTweets(stockSymbol: string) =
            let fileContents = File.ReadLines(filePath)
                                |> Seq.map(fun line -> line.Split([|'\t'|]))
                                |> Seq.map(fun values -> DateTime.Parse(values.[0]),int values.[1], string values.[2])
            fileContents

And the covering unit (integration really) tests look like this:

[TestClass]
public class FileSystemProviderTests
{
    [TestMethod]
    public void GetTweetsUsingValidInput_ReturnsExpectedValue()
    {
        var baseDir = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
        var testFile = Path.Combine(baseDir, "TweetData.csv");
        ITweeetProvider provider = new FileSystemProvider(testFile);
        var tweets = provider.GetTweets("TEST");
        var tweetList = tweets.ToList();
        Int32 expected = 2;
        Int32 actual = tweetList.Count;
        Assert.AreEqual(expected, actual);
    }
}

Note that I had to add the actual file in the test project.

Finally, the F# code needs to include try..catches for the external calls (web service and disk) and some argument validation for the strings come in.

In any event, I now have 3 different implementations that I can swap out depending on my needs. I love having the power of Interfaces combined with benefits of using a functional-first language.

Filed under F#, SOLID

Consuming Twitter With F#

April 8, 2014 7 Comments

I set up a meetup for TRINUG’s F#/data analytics SIG to center around consuming and analyzing Tweets. Since Twitter is just JSON, I assumed it would be easy enough to search Tweets for a given subjects in a given time period. How wrong I was. I spent several hours research different ways to consume Twitter to varying degrees of success. My 1st stop was to investigate some of the more common libraries that C# developers use to consume Twitter. Here is my survey of some of the more popular ones:

Twitterizer: No longer maintained

// Install-Package twitterizer -Version 2.4.2
// Update-Package Newtonsoft.Json -Reinstall
 open Twitterizer
 
 type public TwitterProvider() =
    member this.GetTweetsForDateRange(ticker:string, startDate: DateTime, endDate: DateTime) =
        let consumerKey = ConfigurationManager.AppSettings.["consumerKey"]
        let consumerSecret = ConfigurationManager.AppSettings.["consumerSecret"]
        let accessToken = ConfigurationManager.AppSettings.["accessToken"]
        let accessTokenSecret = ConfigurationManager.AppSettings.["accessTokenSecret"]
        
        let tokens = new OAuthTokens()
        tokens.set_ConsumerKey(consumerKey)
        tokens.set_ConsumerSecret(consumerSecret)
        tokens.set_AccessToken(accessToken)
        tokens.set_AccessTokenSecret(accessTokenSecret)
 
        let searchOptions = new SearchOptions()
        searchOptions.SinceDate <- startDate
        searchOptions.UntilDate <- endDate
        let results = TwitterSearch.Search(tokens, ticker,searchOptions)
        results.ResponseObject
                    |> Seq.map(fun r -> r.CreatedDate, r.Text)

TweetSharp: No longer maintained

open TweetSharp
 
 type public TwitterProvider() =
    member this.GetTweetsForDateRange(ticker:string, startDate: DateTime, endDate: DateTime) =
        let consumerKey = ConfigurationManager.AppSettings.["consumerKey"]
        let consumerSecret = ConfigurationManager.AppSettings.["consumerSecret"]
        let accessToken = ConfigurationManager.AppSettings.["accessToken"]
        let accessTokenSecret = ConfigurationManager.AppSettings.["accessTokenSecret"]
        
        let service = new TwitterService(consumerKey, consumerSecret)
        service.AuthenticateWith(accessToken, accessTokenSecret)
 
        let searchOptions = new SearchOptions()
        searchOptions.Q <- "IBM%20since%3A2014-03-01&src=typd"
        service.Search(searchOptions).Statuses
                                        |> Seq.map(fun s -> s.CreatedDate, s.Text)

Note that I did try and add a date range the way the Twitter API instructs, but it still came back with only 20 tweets.

LinqToTwitter: Active but nave to use Linq syntax. Ugh!

Twitterinvi: Active but does not have date range functionality

open System
open System.Configuration
open Tweetinvi
 
type public TwitterProvider() = 
    member this.GetTodaysTweets(ticker: string) = 
        let consumerKey = ConfigurationManager.AppSettings.["consumerKey"]
        let consumerSecret = ConfigurationManager.AppSettings.["consumerSecret"]
        let accessToken = ConfigurationManager.AppSettings.["accessToken"]
        let accessTokenSecret = ConfigurationManager.AppSettings.["accessTokenSecret"]
 
        TwitterCredentials.SetCredentials(accessToken, accessTokenSecret, consumerKey, consumerSecret)
        let tweets = Search.SearchTweets(ticker);
        tweets |> Seq.map(fun t -> t.CreatedAt, t.RetweetCount)
 
    member this.GetTweetsForDateRange(ticker: string, startDate: DateTime)=
        let consumerKey = ConfigurationManager.AppSettings.["consumerKey"]
        let consumerSecret = ConfigurationManager.AppSettings.["consumerSecret"]
        let accessToken = ConfigurationManager.AppSettings.["accessToken"]
        let accessTokenSecret = ConfigurationManager.AppSettings.["accessTokenSecret"]
 
        TwitterCredentials.SetCredentials(accessToken, accessTokenSecret, consumerKey, consumerSecret)
        let searchParameter = Search.GenerateSearchTweetParameter(ticker)
        searchParameter.Until <- startDate;
        let tweets = Search.SearchTweets(searchParameter);
        tweets |> Seq.map(fun t -> t.CreatedAt, t.RetweetCount)

So without an out of the box API to use, I thought about using a Json Type Provider the way Lincoln Atkinson did. The problem is that is example is for V1 of Twitter and V 1.1 uses Oauth. If you run his code, you get

I then thought about a 3rd party API that captures Tweets. I ran across gnip ($500!) and Topsy (no longer accepting new licenses b/c Apple bought them) so I am back to square one.

So finally I thought about rolling my own (with OAuth being the hard part) but I am quickly running out of time to get ready for the SIG and I don’t want to spend the time on only this part.

Why isn’t there a Twitter type provider? I’ll add it to the list….

Filed under F#, Twitter

JavaScript Signature Capture Panel

March 25, 2014 Leave a comment

I am attempting to teach myself some more JavaScript. To that end I decided to replicate some of the projects I did in WPF/C# in HTML5/JavaScript. One of the 1st ‘hello world’ projects I did in WPF was creating a signature panel – so it seemed like a good place to start. The original blog post is here. The original WPF project took advantage of the InkCanvas class. Below is a code snippet of the how the events were captured in the original project:

private void inkSignature_MouseDown(object sender, MouseButtonEventArgs e)
{
    IsCapturing = true;
    glyph = new Glyph();
 
}
 
private void inkSignature_MouseUp(object sender, MouseButtonEventArgs e)
{
    IsCapturing = false;
    _signature.Glyphs.Add(glyph);
    startPoint = new Point();
    endPoint = new Point();
 
}
 
private void inkSignature_MouseMove(object sender, MouseEventArgs e)
{
    if (IsCapturing)
    {
        if (startPoint.X == 0 && startPoint.Y == 0 && endPoint.X == 0 && endPoint.Y == 0)
        {
            endPoint = new Point(e.GetPosition(this).X, e.GetPosition(this).Y);
        }
        else
        {
            startPoint = endPoint;
            endPoint = new Point(e.GetPosition(this).X, e.GetPosition(this).Y);
            Line line = new Line(startPoint, endPoint);
            glyph.Lines.Add(line);
        }
 
    }
 
}

To have the same effect in the browser, I swapped out the InkCanvas with the Canvas tag.

<canvas id="myCanvas" width="578" height="200" style="border:solid"></canvas>
<br />
<button id="resultButton" onclick="showSignature()"></button>

I then stubbed out the ‘mousedown’, ‘mouseup’, and ‘mousemove’ events to see if I was hooked up to them correctly and they were firing as expected:

<body>
 
    <script>
        canvas.addEventListener('mousemove', function (event) {
        }, false);
 
        canvas.addEventListener('mousedown', function (event) {
            alert("mousedown");
        }, false);
 
        canvas.addEventListener('mouseup', function (event) {
            alert("mouseup");
        }, false);
    </script>
 
</body>

I then thought about how to implement the InkCanvas code in JavaScript so I added some variables that all of the examples of StackOverflow use:

var canvas = document.getElementById('myCanvas');
var context = canvas.getContext('2d');

I then needed the function to calculate the mouse position relative to the signature panel (versus the screen). This was also pretty common on StackOverflow:

function getMousePosition(canvas, event) {
    var rectangle = canvas.getBoundingClientRect();
    return {
        x: event.clientX – rectangle.left,
        y: event.clientY – rectangle.top
    };
};

Finally, I could implement the WPF-equivalent logic. First was the variables to maintain state:

var isCapturing = false;
var startX = 0;
var startY = 0;
var endX = 0;
var endY = 0;
var signature = [];
var glyph = [];

And then the 3 event handlers:

canvas.addEventListener('mousemove', function (event) {
    if (isCapturing) {
        var mousePosition = getMousePosition(canvas, event);
 
        if (startX === 0 && startY === 0 && endX === 0 && endY === 0) {
            endX = mousePosition.x;
            endY = mousePosition.y;
        }
        else {
            startX = endX;
            startY = endY;
            endX = mousePosition.x;
            endY = mousePosition.y;
 
            context.beginPath();
            context.moveTo(startX, startY);
            context.lineTo(endX, endY);
            context.stroke()
 
            glyph.push(startX, startY, endX, endY);
        }
    }
}, false);
 
canvas.addEventListener('mousedown', function (event) {
    isCapturing = true;
    glyph = [];
}, false);
 
canvas.addEventListener('mouseup', function (event) {
    isCapturing = false;
    signature.push(glyph);
    var startX = 0;
    var startY = 0;
    var endX = 0;
    var endY = 0;
}, false);

When I ran it, I <almost> got it right:

The problem is that the mouseup event was not resetting the starting value of the next point to 0, so the signature was coming out as 1 long line. After sleeping on it (my pattern is write bugs at night, fix them in the morning), I realized I just had to reset the start and end coordinates on mouseup and then inspect in the mousemove. Here is the complete final code:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
<head>
    <title></title>
</head>
<body>
    <canvas id="myCanvas" width="578" height="200" style="border:solid"></canvas>
    <br />
    <button id="resultButton" onclick="showSignature()"></button>
 
 
    <script>
        function showSignature() {
            alert(signature.length);
        };
    </script>
 
    <script>
        var canvas = document.getElementById('myCanvas');
        var context = canvas.getContext('2d');
        var isCapturing = false;
        var startX = 0;
        var startY = 0;
        var endX = 0;
        var endY = 0;
        var signature = [];
        var glyph = [];
 
        function getMousePosition(canvas, event) {
            var rectangle = canvas.getBoundingClientRect();
            return {
                x: event.clientX – rectangle.left,
                y: event.clientY – rectangle.top
            };
        };
 
        canvas.addEventListener('mousemove', function (event) {
            if (isCapturing) {
                var mousePosition = getMousePosition(canvas, event);
 
                if (endX === 0 && endY === 0) {
                    endX = mousePosition.x;
                    endY = mousePosition.y;
                }
                else {
                    startX = endX;
                    startY = endY;
                    endX = mousePosition.x;
                    endY = mousePosition.y;
 
                    context.beginPath();
                    context.moveTo(startX, startY);
                    context.lineTo(endX, endY);
                    context.stroke()
 
                    glyph.push(startX, startY, endX, endY);
                }
            }
        }, false);
 
        canvas.addEventListener('mousedown', function (event) {
            isCapturing = true;
            glyph = [];
 
            var mousePosition = getMousePosition(canvas, event);
            var startX = mousePosition.x;
            var startY = mousePosition.y;
        }, false);
 
        canvas.addEventListener('mouseup', function (event) {
            isCapturing = false;
            signature.push(glyph);
 
            startX = 0;
            startY = 0;
            endX = 0;
            endY = 0;
        }, false);
    </script>
 
</body>
</html>

And here it is in action:

Now all I have to do is put the points into the same data structures that I used in the WPF project: Signature –> Glyphs[] –> Lines[] –> Line.StartPoint && Line.EndPoint.

Filed under JavaScript

Apriori Algorithm and F# Using Elevator Inspection Data

March 18, 2014 1 Comment

Now that I have the elevator dataset in a workable state, I wanted to see what I could see with the data. I was reading Machine Learning In Action and the authors suggested that an Apriori Algorithm as a way to quantify associations among data points. I read both Harrington’s code and Wikipedia’s description and I found both the be impenetrable – the former because their code was unreadable and the later because the mathematical formulas depended on a level of algebra that I don’t have.

Fortunately, I found a C# project on Codeproject that had both an excellent example/introduction and C# code. I used the examples on the website to formulate my F# implementation.

The first thing I did was create a class that matched the 1st grid in the example

namespace ChickenSoftware.ElevatorChicken.Analysis
 
open System.Collections.Generic
 
type Transaction = {TID: string; Items: List<string> }
 
type Apriori(database: List<Transaction>, support: float, confidence: float) = 
    member this.Database = database
    member this.Support = support
    member this.Confidence = confidence

Note that because F# is immutable by default, the properties are read-only. I then created a unit test project that makes sure the constructor works without exceptions. The data matches the example:

public AprioriTests()
{
    var database = new List<Transaction>();
    database.Add(new Transaction("100", new List<string>() { "A", "C", "D" }));
    database.Add(new Transaction("200", new List<string>() { "B", "C", "E" }));
    database.Add(new Transaction("300", new List<string>() { "A", "B", "C", "E" }));
    database.Add(new Transaction("400", new List<string>() { "B", "E" }));
 
    _apriori = new Apriori(database, .5, .80);
 
}
 
[TestMethod]
public void ConstructorUsingValidArguments_ReturnsExpected()
{
    Assert.IsNotNull(_apriori);
}

I then need a function to count up all of the items in the Itemsets. I refused to use loops, so I first started using Seq.Fold, but I was having zero luck because I was trying to fold a Seq of List. I then started experimenting with other functions when I found Seq.Collect – which was perfect. So I created a function like this:

member this.GetC1() =
    database
 
member this.GetL1() =
    let numberOfTransactions = this.GetC1().Count
 
    this.GetC1()
        |> Seq.collect(fun d -> d.Items)
        |> Seq.countBy(fun i -> i)
        |> Seq.map(fun (t,i) -> t, i, float i/ float numberOfTransactions)
        |> Seq.filter(fun (t,i,p) -> p >= support)
        |> Seq.map(fun (t,i,p) -> t,i)
        |> Seq.sort
        |> Seq.toList

Note that the numberOfTransactions is for the database, not the individual items in the List<Item>. And the results match the example:

So this is great. My next stop was to build a list of pair combinations of the remaining values

The trick is that is not a Cartesian join of the original sets – it is only the surviving sets that are needed. My first attempt looked like:

let C1 = database
 
let L1 = C1
        |> Seq.map(fun t -> t.Items)
        |> Seq.collect(fun i -> i)
        |> Seq.countBy(fun i -> i)
        |> Seq.map(fun (t,i) -> t, i, float i/ float numberOftransactions)
        |> Seq.filter(fun (t,i,p) -> p >= support)
        |> Seq.toArray
let C2A = L1 
            |> Seq.map(fun (x,y,z) -> x)
            |> Seq.toArray
let C2B = L1 
            |> Seq.map(fun (x,y,z) -> x)
            |> Seq.toArray
let C2 = C2A |> Seq.collect(fun x -> C2B |> Seq.map(fun y -> x+y))
C2   

With the output like this:

I was running out of Saturday morning so I went over to stack overflow and got a couple of responses. I was on the right track with the concat, but I didn’t think about the List.Filter(), which would prune my list. With this in mind, I copied Mark’s code and got what I was looking for

member this.GetC2() =
    let l1Itemset = this.GetL1() 
                    |> Seq.map(fun (i,s) -> i)
 
    let itemset = 
        l1Itemset
            |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
            |> Seq.concat
            |> Seq.filter(fun (x,y) -> x < y)
            |> Seq.sort
            |> Seq.toList         
    
    let listContainsItem(l:List<string>, a,b) =
            l.Contains(a) && l.Contains(b)
    
    let someFunctionINeedToRename(l1:List<string>, l2)=
            l2 |> Seq.map(fun (x,y) -> listContainsItem(l1,x,y))
 
    let itemsetMatches = this.GetC1()
                            |> Seq.map(fun t -> t.Items)
                            |> Seq.map(fun i -> someFunctionINeedToRename(i,itemset))
 
    let itemSupport = itemsetMatches
                            |> Seq.map(Seq.map(fun i -> if i then 1 else 0))
                            |> Seq.reduce(Seq.map2(+))
 
    itemSupport
        |> Seq.zip(itemset)
        |> Seq.toList

So now I have C2 filling correctly:

Taking the results, I needed to get L2.

That was much simpler that getting C2 –> here is the code:

member this.GetL2() = 
    let numberOfTransactions = this.GetC1().Count
    
    this.GetC2()
            |> Seq.map(fun (i,n) -> i,n,float n/float numberOfTransactions)
            |> Seq.filter(fun (i,n,p) -> p >= support)
            |> Seq.map(fun (t,i,p) -> t,i)
            |> Seq.sort
            |> Seq.toList    

And when I run it – it matches this example exactly:

Finally, I added in a C# and L3. This code is identical to the C2/L2 code with one exception: mapping a triple and not a tuple: The C2 code maps like this

let itemset = 
    l1Itemset
        |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
        |> Seq.concat
        |> Seq.filter(fun (x,y) -> x < y)
        |> Seq.sort
        |> Seq.toList     

and the C3 code looks like this (took me 15 minutes to figure out line 3 below):

let itemset = 
    l2Itemset
        |> Seq.map(fun x -> l2Itemset |> Seq.map(fun y-> l2Itemset |> Seq.map(fun z->(fst x,fst y,snd z))))
        |> Seq.concat
        |> Seq.collect(fun d -> d)
        |> Seq.filter(fun (x,y,z) -> x < y && y < z)
        |> Seq.distinct
        |> Seq.sort
        |> Seq.toList    

With the C3 and L3 matching the example also:

I was now ready to put in the elevator data into the analysis. I think I am getting better at F# because I did the mapping, filtering, and transformation of the data from the server without looking at any other material and it look only 15 minutes.

type public ElevatorBuilder() = 
    let connectionString = ConfigurationManager.ConnectionStrings.["localData2"].ConnectionString;
 
    member public this.GetElevatorTransactions() =
        let transactions = this.GetElevators() 
                              |> Seq.map(fun e ->this.ConvertElevatorToTransaction(e))
        let transactionsList = new System.Collections.Generic.List<Transaction>(transactions)
        transactionsList
 
    member public this.ConvertElevatorToTransaction(i: string, t:string, c:string, s:string) =
        let items = new System.Collections.Generic.List<String>()
        items.Add(t)
        items.Add(c)
        items.Add(s)
        let transaction = {TID=i; Items=items}
        transaction
 
    member public this.GetElevators () =
        SqlConnection.GetDataContext(connectionString).ElevatorData201402
            |> Seq.map(fun e -> e.ID, e.EquipType,e.Capacity,e.Speed)
            |> Seq.filter(fun (i,et,c,s) -> not(String.IsNullOrEmpty(et)))
            |> Seq.filter(fun (i,et,c,s) -> c.HasValue)
            |> Seq.filter(fun (i,et,c,s) -> s.HasValue)
            |> Seq.map(fun (i,t,c,s) -> i, this.CatagorizeEquipmentType(t),c,s)
            |> Seq.map(fun (i,t,c,s) -> i,t,this.CatagorizeCapacity(c.Value),s)
            |> Seq.map(fun (i,t,c,s) -> i,t,c,this.CatagorizeSpeed(s.Value))
            |> Seq.map(fun (i,t,c,s) -> i.ToString(),t,c,s)

The longest part was aggregating the free-form text of the Equipment Type field (here is partial snip, you get the idea…)

member public this.CatagorizeEquipmentType(et: string) =
    match et.Trim() with 
        | "OTIS" -> "OTIS"
        | "OTIS (1-2)" -> "OTIS"
        | "OTIS (2-1)" -> "OTIS"
        | "OTIS hydro" -> "OTIS"
        | "OTIS, HYD" -> "OTIS"
        | "OTIS/ ASHEVILLE " -> "OTIS"
        | "OTIS/ MOUNTAIN " -> "OTIS"
        | "OTIS/#1" -> "OTIS"
        | "OTIS/#19 " -> "OTIS"

Assigning categories for speed and capacity was a snap using F#

member public this.CatagorizeCapacity(c: int) =
    let lowerBound = (c/25 * 25) + 1
    let upperBound = lowerBound + 24
    lowerBound.ToString() + "-" + upperBound.ToString()        
 
member public this.CatagorizeSpeed(s: int) =
    let lowerBound = (s/50 * 50) + 1
    let upperBound = lowerBound + 49
    lowerBound.ToString() + "-" + upperBound.ToString()    

With this in hand, I created a Console app that takes the 27K records and pushes them though the apriori algorithm:

private static void RunElevatorAnalysis()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    ElevatorBuilder builder = new ElevatorBuilder();
    var transactions = builder.GetElevatorTransactions();
    stopwatch.Stop();
    Console.WriteLine("Building " + transactions.Count + " transactions took: " + stopwatch.Elapsed.TotalSeconds);
    var apriori = new Apriori(transactions, .1, .75);
    var c2 = apriori.GetC2();
    stopwatch.Reset();
    stopwatch.Start();
    var l1 = apriori.GetL1();
    Console.WriteLine("Getting L1 took: " + stopwatch.Elapsed.TotalSeconds);
    var l2 = apriori.GetL2();
    Console.WriteLine("Getting L2 took: " + stopwatch.Elapsed.TotalSeconds);
    var l3 = apriori.GetL3();
    Console.WriteLine("Getting L3 took: " + stopwatch.Elapsed.TotalSeconds);
    stopwatch.Stop();
    Console.WriteLine("–L1");
    foreach (var t in l1)
    {
        Console.WriteLine(t.Item1 + ":" + t.Item2);
    }
    Console.WriteLine("–L2");
    foreach (var t in l2)
    {
        Console.WriteLine(t.Item1 + ":" + t.Item2);
    }
    Console.WriteLine("–L3");
    foreach (var t in l3)
    {
        Console.WriteLine(t.Item1 + ":" + t.Item2);
    }
}

I then made an offering to the F# Gods and hit F5:

Doh! The gods were not pleased. I then went back to my initial filtering function and added a Seq.Take(25000) and the results:

So there a couple of things to draw from this exercise.

1) Apriori Algorithm is the wrong classification technique for this dataset. I had to bring the support way down (10%) to even get any readings. Also, there is too much dispersion of the values. This kind of algorithm is much better with N number of a smaller set of data values versus a fixed number of large values.

2) Even so, how cool is this? Compare the files just to make the C#/OO work versus with F#

And the Total LOC is 539 for C# versus 120 for F# – and the F# can be optimized using a better way to create search and itemsets. Hard-coding each level was a hack I did to get thing working and give me an understanding of how AA works. I bet this can be consolidated to well under 75 lines without sacrificing readability

3) I think the StackOverflow exception is because I am doing a Cartesian join and then paring the result. Using one of the other techniques suggested on SO will give much better results.

I any event, what a fun project! I can’t wait to optimize this and perhaps throw a different algorithm at the dataset in the coming weeks.

Filed under Analytics, F#, Open Data

Elevator App: Part 1 – Data Layer Using F#

March 11, 2014 2 Comments

At Open Data Day, fellow TRINUGER Elaine Cahill told me about a website where you can get all of the elevator inspection data for the state. It is found here. She went ahead and put the Wake County data onto Socrata. I wanted to look at the entire state so I went to the report page like so:

Unfortunately, when you try and pull down the entire state, you cause a server exception:

So I split the download in half. I then Imported it into Access and then SSISed it into Azure Sql. I then created a project to server the data and I decided to use F# type providers as a replacement for Entity Framework for my ORM. I could either use the SqlEntity TP or the SqlDataConnection TP to access the Sql Database on Azure. Both do not work out of the box.

SqlDataConnection

I could not get SqlDataConnection to work at all. When I hooked it up to a standard connection string in the config file, I got:

So when I copy and paste the connection string into the TP directly, it does make the connection to Azure, but then it comes back with this exception:

Without looking at the source. my guess is that the TP has hard-coded a reference to ‘syscomments’ and alas, Azure does not have that table.

SqlEntity

I then headed over to the SlqEntityTP to see if I could have better luck. Fortunately, the SqlEntity does work with both an Azure connection string in the .config file and can make a connection to an Azure database.

The problem I ran into was when I wanted to expose the SqlConnection the the WebAPI project that I wrote in C#. You can not mark SqlEntityTPs as public:

Note that the SqlDataConnection can be marked as public. <sigh>. I marked the SqlEntityTP as internal and then created a POCO to map between the SqlEntity type and a type that can be consumed by the outside world:

type public Elevator ={
        ID: int
        County: string
        StateId: string
        Type: string
        Operation: string
        Owner: string
        O_Address1: string
        O_Address2: string
        O_City: string
        O_State: string
        O_Zip: string
        User: string
        U_Address1: string
        U_Address2: string
        U_City: string
        U_State: string
        U_Zip: string
        U_Lat: double
        U_Long: double
        Installed: DateTime
        Complied: DateTime
        Capacity: int
        CertStatus: int
        EquipType: string
        Drive: string
        Volts: string
        Speed: int
        FloorTo: string
        FloorFrom: string
        Landing: string
        Entrances: string
        Ropes: string
        RopeSize: string
    }
 
type public DataRepository() = 
    let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;
 
    member public this.GetElevators () =
        SqlConnection.GetDataContext(connectionString).ElevatorData201402
        |> Seq.map(fun x -> this.GetElevatorFromElevatorData(x))
 
    member public this.GetElevator (id: int) =
        SqlConnection.GetDataContext(connectionString).ElevatorData201402
        |> Seq.where(fun x -> x.ID = id)
        |> Seq.map(fun x -> this.GetElevatorFromElevatorData(x))
        |> Seq.head
 
    member internal this.GetElevatorFromElevatorData(elevatorData: SqlConnection.ServiceTypes.ElevatorData201402) =
        let elevator = {ID= elevatorData.ID;
            County=elevatorData.County;
            StateId=elevatorData.StateID;
            Type=elevatorData.Type;
            Operation=elevatorData.Operation;
            Owner=elevatorData.Owner;
            O_Address1=elevatorData.O_Address1;
            O_Address2=elevatorData.O_Address2;
            O_City=elevatorData.O_City;
            O_State=elevatorData.O_St;
            O_Zip=elevatorData.O_Zip;
            User=elevatorData.User;
            U_Address1=elevatorData.U_Address1;
            U_Address2=elevatorData.U_Address2;
            U_City=elevatorData.U_City;
            U_State=elevatorData.U_St;
            U_Zip=elevatorData.U_Zip;
            U_Lat=elevatorData.U_lat;
            U_Long=elevatorData.U_long;
            Installed=elevatorData.Installed.Value;
            Complied=elevatorData.Complied.Value;
            Capacity=elevatorData.Capacity.Value;
            CertStatus=elevatorData.CertStatus.Value;
            EquipType=elevatorData.EquipType;
            Drive=elevatorData.Drive;
            Volts=elevatorData.Volts;
            Speed=int elevatorData.Speed;
            FloorTo=elevatorData.FloorTo;
            FloorFrom=elevatorData.FloorFrom;
            Landing=elevatorData.Landing;
            Entrances=elevatorData.Entrances;
            Ropes=elevatorData.Ropes;
            RopeSize=elevatorData.RopeSize
        }
        elevator

I am not happy about writing any of this code. I have 84 lines of code for a single class. I might have well used the code code gen of EF. I could have taken the performance hit and used System.Reflection to map field of the same names (I have done that on other projects) , but that also feels like a hack. In any event, I then added a reference to my F# project in my C# WebAPI project. I did have to add a reference to FSharp.Core in the C# project (which further vexed me), but then I created a couple of GET methods to expose the data:

public class ElevatorController : ApiController
{
    // GET api/Elevator
    public IEnumerable<Elevator> Get()
    {
        DataRepository repository = new DataRepository();
        return repository.GetElevators();
    }
 
    // GET api/Elevator/5
    public Elevator Get(int id)
    {
        DataRepository repository = new DataRepository();
        return repository.GetElevator(id);
    }
 
}

When I viewed the JSON from a handy browser, it looks like, well, junk:

So now I have to get rid of that random characters (x0040 suffix)– yet a 3rd POCO, this one in C#:

public class ElevatorController : ApiController
{
    // GET api/Elevator
    public IEnumerable<CS.Elevator> Get()
    {
        List<CS.Elevator> elevators = new List<CS.Elevator>();
        FS.DataRepository repository = new FS.DataRepository();
        var fsElevators = repository.GetElevators();
        foreach (var fsElevator in fsElevators)
        {
            elevators.Add(GetElevatorFromFSharpElevator(fsElevator));
        }
        return elevators;
    }
 
    // GET api/Elevator/5
    public CS.Elevator Get(int id)
    {
        FS.DataRepository repository = new FS.DataRepository();
        return GetElevatorFromFSharpElevator(repository.GetElevator(id));
    }
 
    internal CS.Elevator GetElevatorFromFSharpElevator(FS.Elevator fsElevator)
    {
        CS.Elevator elevator = new CS.Elevator();
        elevator.ID = fsElevator.ID;
        elevator.County = fsElevator.County;
        elevator.StateId = fsElevator.StateId;
        elevator.Type = fsElevator.Type;
        elevator.Operation = fsElevator.Operation;
        elevator.Owner = fsElevator.Owner;
        elevator.O_Address1 = fsElevator.O_Address1;
        elevator.O_Address2 = fsElevator.O_Address2;
        elevator.O_City = fsElevator.O_City;
        elevator.O_State = fsElevator.O_State;
        elevator.O_Zip = fsElevator.O_Zip;
        elevator.User = fsElevator.User;
        elevator.U_Address1 = fsElevator.U_Address1;
        elevator.U_Address2 = fsElevator.U_Address2;
        elevator.U_City = fsElevator.U_City;
        elevator.U_State = fsElevator.U_State;
        elevator.U_Zip = fsElevator.U_Zip;
        elevator.Installed = fsElevator.Installed;
        elevator.Complied = fsElevator.Complied;
        elevator.Capacity = fsElevator.Capacity;
        elevator.CertStatus = fsElevator.CertStatus;
        elevator.EquipType = fsElevator.EquipType;
        elevator.Drive = fsElevator.Drive;
        elevator.Volts = fsElevator.Volts;
        elevator.Speed = fsElevator.Speed;
        elevator.FloorTo = fsElevator.FloorTo;
        elevator.FloorFrom = fsElevator.FloorFrom;
        elevator.Landing = fsElevator.Landing;
        elevator.Entrances = fsElevator.Entrances;
        elevator.Ropes = fsElevator.Ropes;
        elevator.RopeSize = fsElevator.RopeSize;
        return elevator;
    }
 
}

So that gives me that I want…

As a side note, I learned the hard way that the only way to force the SqlEntityTP to update based on a schema change in the DB is to change the connection string in the .config file.

Finally, when I published the WebAPI project to Azure, I got an exception.

<Error><Message>An error has occurred.</Message><ExceptionMessage>Could not load file or assembly 'FSharp.Core, Version=4.3.1.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' or one of its dependencies. The system cannot find the file specified.</ExceptionMessage><ExceptionType>System.IO.FileNotFoundException</ExceptionType><StackTrace> at System.Web.Http.ApiController.<InvokeActionWithExceptionFilters>d__1.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Web.Http.Dispatcher.HttpControllerDispatcher.<SendAsync>d__0.MoveNext()</StackTrace

Turns out you need to not only add a reference to the F# project and FSharp.Core, you have to deploy the .dlls to Azure also. Thanks to hocho on SO for that one.

In conclusion, I love the promise of TPs. I want nothing more than to throw away all of the EF code-gen, .tt files, seeding for code-first nonsense, etc… and replace it with a single line TP. I have done this on a local project, but when I did it with an Azure, things were harder than they should be. Since it is easier to throw hand grenades than catch them, I made a list of the things I want to help the open source FSharp.Data project accomplish in the coming months:

1) SqlDatabaseConnection working with Azure Sql Storage

2) MSAccessConnection needed

3) ActiveDirectoryConnection needed

4) Json and WsdlService ability to handle proxies

5) SqlEntityConnection exposing classes publicly

Regardless of what the open-source community does, MSFT will still have to make a better commitment to F# on Azure, IMHO…

Filed under F#, Open Data

← Older posts

Newer posts →

Jamie Dixon's Home

Association Rule Learning Via F# (Part 2)

Association Rule Learning Via F# (Part 1)

Kaplan-Meier Survival Analysis Using F#

Microsoft Language Stack Analogy

F# and List manipulations

F# and the Open/Closed Principle

Consuming Twitter With F#

JavaScript Signature Capture Panel

Apriori Algorithm and F# Using Elevator Inspection Data

Elevator App: Part 1 – Data Layer Using F#

Categories

Recent Posts

Archives

Blogroll

Meta