May | 2014 | Jamie Dixon's Home

Association Rule Problem: Part 3

May 27, 2014 1 Comment

After spending a couple of weeks working though the imperative code, I decided to approach the problem from a F#/functional point of view. Going back to the original article, there are several steps that McCaffrey walks through:

Get a series of transactions
Get the frequent item-sets for the transactions
For each item-set, get all possible combinations. Each combination is broken into an antecedent and consequent
Apply the frequency of each antecedent in all transactions
If the frequency of the combination is greater than the confidence level, include it in the final set

For the purposes of this article, Step #1 and Step #2 were already done. My code starts with step #3. Instead of for..eaching and if..thening my way though the item-sets, I decided to look at how permutations and combinations are done in F#. Interestingly, one of the first articles on permutations and combinations on Google is from McCaffrey in MSDN from four years ago. Unfortunately, this article was of limited use because the code is decidedly non-functional so it might as well been written in C# (this was pointed out in the comments). So going to Stack Overflow, there are plenty of good examples of getting combinations in F# on SO and elsewhere. After playing with the code samples for a bit (my favorite one was this), it hit me that the ordinal positions are the same for an array of X size. So going back to McCaffrey’s example, there is only item-sets of 2 and 3 length. Therefore, I can hard-code the results and leave the actual calculation for another time.

static member GetCombinationsForDouble(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
    combinations
 
static member GetCombinationsForTriple(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
    combinations

I used a tuple to represent the antecedent array and consequent array values. I then spun up a unit test to compare results based on McCaffrey’s detailed example:

[TestMethod]
public void GetValuesForATriple_ReturnsExpectedValue()
{
    var expected = new List<Tuple<int[], int[]>>();
    expected.Add(Tuple.Create<int[], int[]>(new int[1] { 3 }, new int[2] { 4, 7 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[1] { 4 }, new int[2] { 3, 7 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[1] { 7 }, new int[2] { 3, 4 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 4 }, new int[1] { 7 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 7 }, new int[1] { 4 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[2] { 4, 7 }, new int[1] { 3 }));
 
    var itemSet = new int[3] { 3,4,7};
    var actual = FS.AssociationRuleProgram2.GetCombinationsForTriple(itemSet);
 
    Assert.AreEqual(expected.Count, actual.Count);
}

A couple of things to note about the unit test:

1) The rules about variable naming and whatnot that apply in business application development quickly fall down when applied to scientific computing. For example, there is no way that this

List<Tuple<int[], int[]>> expected = new List<Tuple<int[], int[]>>();

is more readable that this

var expected = new List<Tuple<int[], int[]>>();

In fact, it is less readable. The use of complex data structures and algorithms force a different set of naming conventions. Applying Fx-Cop or other framework naming conventions to scientific programming is as useful as applying scientific naming conventions to framework development. If it is a screw, use a screwdriver. If it is a nail, user a hammer…

2) I don’t have a good way of comparing the results of a tuple of paired arrays for equivalence – there is certainly nothing out of the box in Microsoft.VisualStudio.TestTools.UnitTesting. I toyed (briefly) with creating a method to compare equivalence in arrays but I did not in the interest of time. That would be a welcome additional to the testing namespace.

Sure enough, running the unit test using McCaffrey’s data all run green.

With step 3 knocked out, I now needed to determine the frequency of the antecedent in the transactions list. This step is better broken down into a couple of sub-steps. I used McCaffrey’s detailed example of 3,4,7 as proof of correctness in my unit tests:

I need a way of taking the antecedent of 3, and comparing it to all transactions (which are arrays) to see how often it appears. As an additional layer of complexity, that 3 is not an int, it is an array (all be it an array of one). I could not find a equivalent question on StackOverflow (meaning I probably am asking the wrong question), so I went ahead of made a mental model where I would map the TryFindIndex function against each item of subset and see if that value is in the original set. The result is a tuple with the original value and the ordinal position in the set. The key thing is that if the item was not found, it returns “None”. So I just have to filter on that flag and if the result of the filter is greater than 1, I know that something was not found and the functional can return false

In code, it pretty much looks like the way I just described it:

static member SetContainsSubset(set: int[], subset: int[]) =
    let notIncluded = subset
                        |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
                        |> Seq.filter(fun (i,j) -> j = None ) 
                        |> Seq.toArray
    if notIncluded.Length > 0 then false else true

And I generated my unit tests out of the example too:

[TestMethod]
public void SetContainsSubsetUsingMatched_ReturnsTrue()
{
    var set = new int[4] { 1, 3, 4, 7 };
    var subset = new int[3] { 3, 4, 7 };
 
    Boolean expected = true;
    Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
 
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void SetContainsSubsetUsingUnMatched_ReturnsFalse()
{
    var set = new int[3] { 1, 4, 7 };
    var subset = new int[3] { 3, 4, 7 };
 
    Boolean expected = false;
    Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
 
    Assert.AreEqual(expected, actual);
 
}

With this supporting function ready, I can then apply it to an array and see how many trues I get. That is the Count value in Figure 2 of the article. Seq.Map fits this task perfectly.

static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
    transactions
        |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
        |> Seq.filter(fun (t,f) -> f = true)
        |> Seq.length

And the subsequent unit test also runs green

[TestMethod]
public void CountItemSetInTransactions_ReturnsExpected()
{
    List<int[]> transactions = new List<int[]>();
    transactions.Add(new int[] { 0, 3, 4, 11 });
    transactions.Add(new int[] { 1, 4, 5 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 0, 5 });
    transactions.Add(new int[] { 3, 5, 9 });
    transactions.Add(new int[] { 2, 3, 4, 7 });
    transactions.Add(new int[] { 2, 5, 8 });
    transactions.Add(new int[] { 0, 1, 2, 5, 10 });
    transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
 
    var itemSet = new int[1] { 3 };
 
    Int32 expected = 6;
    Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
 
    Assert.AreEqual(expected, actual);
 
}

So with this in place, I am ready for the next column, the confidence column. McCaffrey used the numerator of 3 which is shown here:

So I assume that this count is the number of times 3,4,7 show up in the the transaction set. If so, the supporting function ItemSetCountInTransactions can also be used. I created a unit test and it ran green

[TestMethod]
public void CountItemSetInTransactionsUsing347_ReturnsThree()
{
    List<int[]> transactions = new List<int[]>();
    transactions.Add(new int[] { 0, 3, 4, 11 });
    transactions.Add(new int[] { 1, 4, 5 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 0, 5 });
    transactions.Add(new int[] { 3, 5, 9 });
    transactions.Add(new int[] { 2, 3, 4, 7 });
    transactions.Add(new int[] { 2, 5, 8 });
    transactions.Add(new int[] { 0, 1, 2, 5, 10 });
    transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
 
    var itemSet = new int[3] { 3,4,7 };
 
    Int32 expected = 3;
    Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
 
    Assert.AreEqual(expected, actual);
 
}

So the last piece was to put it together in the GetHighConfRules method. I did not change the signature

static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
    let returnValue = new List<Rule>()
    let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a)) 
    combinations 
        |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
        |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
        |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
        |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
        |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
    returnValue

Note that I did add a helper function to get Combinations based on the length of the array

static member GetCombinations(itemSet: int[]) =
    if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
    else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)

And when I run that from the console:

So this is pretty close. McCaffrey allows for inversion of the numbers in the array (3:4 is not the same as 4:3) and I do not – but his supporting detail does not show that so I am not sure what is the correct answer. In any event, this is pretty good. The F# code can be refactored so that all combinations can be sent from an array. In the mean time, here is all 43 lines of the program.

open System
open System.Collections.Generic
 
type AssociationRuleProgram2 = 
 
    static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
        let returnValue = new List<Rule>()
        let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a)) 
        combinations 
            |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
            |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
            |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
            |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
            |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
        returnValue
 
    static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
        transactions
            |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
            |> Seq.filter(fun (t,f) -> f = true)
            |> Seq.length
 
    static member SetContainsSubset(set: int[], subset: int[]) =
        let notIncluded = subset
                            |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
                            |> Seq.filter(fun (i,j) -> j = None ) 
                            |> Seq.toArray
        if notIncluded.Length > 0 then false else true
 
    static member GetCombinations(itemSet: int[]) =
        if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
        else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)
 
    static member GetCombinationsForDouble(itemSet: int[]) =
        let combinations =  new List<int[]*int[]*int[]>()
        combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
        combinations
 
    static member GetCombinationsForTriple(itemSet: int[]) =
        let combinations =  new List<int[]*int[]*int[]>()
        combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
        combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
        combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
        combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
        combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
        combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
        combinations

Note how the code in the GetHighConfRules function matches almost one for one to the bullet points at the beginning of the post. F# is a language where the code follows how you think, not the other way around. Also note how the 43 lines of code compares to 136 lines of code in the C# example –> less noise, more signal.

Filed under Analytics, F#

Association Rule Learning Via F# (Part 2)

May 20, 2014 1 Comment

Continuing on the path of re-writing the association rule learning code found in this month’s MSDN, I started with the next function on list:

MakeAntecedent

Here is the original code C#

public static int[] MakeAntecedent(int[] itemSet, int[] comb)
{
  // if item-set = (1 3 4 6 8) and combination = (0 2) 
  // then antecedent = (1 4)
  int[] result = new int[comb.Length];
  for (int i = 0; i < comb.Length; ++i)
  {
    int idx = comb[i];
    result[i] = itemSet[idx];
  }
  return result;
}

and the F# code:

static member MakeAntecedent(itemSet:int[] , comb:int[]) =
    comb |> Array.map(fun x -> itemSet.[x])

It is much easier to figure out what is going on via the F# code. The function takes in 2 arrays. Array #1 has values, Array #2 has the index of array #1 that is needed. Using the Array.Map, I return an array where the index number is swapped out to the actual value. The unit tests run green:

[TestMethod]
public void MakeAntecedentCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[2] { 1, 4 };
    var actual = CS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
}
 
[TestMethod]
public void MakeAntecedentFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[2] { 1, 4 };
    var actual = FS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
}

MakeConsequent

Here is the original C# code:

public static int[] MakeConsequent(int[] itemSet, int[] comb)
{
  // if item-set = (1 3 4 6 8) and combination = (0 2) 
  // then consequent = (3 6 8)
  int[] result = new int[itemSet.Length – comb.Length];
  int j = 0; // ptr into combination
  int p = 0; // ptr into result
  for (int i = 0; i < itemSet.Length; ++i)
  {
    if (j < comb.Length && i == comb[j]) // we are at an antecedent
      ++j; // so continue
    else
      result[p++] = itemSet[i]; // at a consequent so add it
  }
  return result;
}

Here is the F# Code:

static member MakeConsequent(itemSet:int[] , comb:int[])=   
    let isNotInComb x = not(Array.exists(fun elem -> elem = x) comb)
    itemSet 
        |> Array.mapi(fun indexer value -> value,indexer )
        |> Array.filter(fun (value,indexer) -> isNotInComb indexer)
        |> Array.map(fun x -> fst x)

Again, it is easier to look at the F# code to figure out what is going on. In this case, we have to take all of the items in the first array that are not in the second array. The trick is that the second array does not contain values to be checked, rather the index position. If you add the Antecedent and the Consequent, you have the total original array.

This code code me a bit of time to figure out because I kept trying to use the out of the box Array features (including slicing) for F# when it hit me that it would be much easier to create a tuple from the original array –> the value and the index. I would then look up the index in the second array and confirm it is not there and then filter the ones that are not there. The map function at the end removes the index part of the tuple because it is not needed anymore.

Sure enough, my unit tests ran green:

[TestMethod]
public void MakeConsequentCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[3] { 3, 6, 8 };
    var actual = CS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
    Assert.AreEqual(expected[2], actual[2]);
}
 
[TestMethod]
public void MakeConsequentFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[3] { 3, 6, 8 };
    var actual = FS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
    Assert.AreEqual(expected[2], actual[2]);
}

IndexOf

I then decided to tackle the remaining three functions in reverse because they depend on each other (CountInTrans –> IsSubSetOf –> IndexOf). IndexOf did not have any code comments of example cases, but the C# code is clear

public static int IndexOf(int[] array, int item, int startIdx)
{
  for (int i = startIdx; i < array.Length; ++i)
  {
    if (i > item) return -1; // i is past where the target could possibly be
    if (array[i] == item) return i;
  }
  return -1;
}

What is even clearer is the F# code that does the same thing (yes, I am happy that FindIndex returns a –1 when not found and so did McCaffey):

static member IndexOf(array:int[] , item:int, startIdx:int) =
    Array.FindIndex(array, fun x -> x=item)

And I built some unit tests that run green that I think reflect McCaffey’s intent:

[TestMethod]
public void IndexOfCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[4] { 0, 1, 4, 5 };
    Int32 item = 1;
    Int32 startIndx = 1;
 
    int expected = 1;
    int actual = CS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
 
    Assert.AreEqual(expected, actual);
}
public void IndexOfFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[4] { 0, 1, 4, 5 };
    Int32 item = 1;
    Int32 startIndx = 1;
 
    int expected = 1;
    int actual = FS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
 
    Assert.AreEqual(expected, actual);
}

IsSubsetOf

In the C# implementation, IndexOf is called to keep track of where the search is currently pointed.

public static bool IsSubsetOf(int[] itemSet, int[] trans)
{
  // 'trans' is an ordered transaction like [0 1 4 5 8]
  int foundIdx = -1;
  for (int j = 0; j < itemSet.Length; ++j)
  {
    foundIdx = IndexOf(trans, itemSet[j], foundIdx + 1);
    if (foundIdx == -1) return false;
  }
  return true;
}

In The F# one, that is not needed:

static member IsSubsetOf(itemSet:int[] , trans:int[]) =
    let isInTrans x = (Array.exists(fun elem -> elem = x) trans)
    let filteredItemSet = itemSet
                            |> Array.map(fun value -> value, isInTrans value)
                            |> Array.filter(fun (value, trans) -> trans = false)
    if filteredItemSet.Length = 0 then true
        else false

CountInTrans

Here is the original C# code uses the IsSubsetOf function.

public static int CountInTrans(int[] itemSet, List<int[]> trans, Dictionary<int[], int> countDict)
{
   //number of times itemSet occurs in transactions, using a lookup dict
 
    if (countDict.ContainsKey(itemSet) == true)
    return countDict[itemSet]; // use already computed count
 
  int ct = 0;
  for (int i = 0; i < trans.Count; ++i)
    if (IsSubsetOf(itemSet, trans[i]) == true)
      ++ct;
  countDict.Add(itemSet, ct);
  return ct;
}

And here is the F# Code that also uses that subfunction

static member CountInTrans(itemSet: int[], trans: List<int[]>, countDict: Dictionary<int[], int>) =
    let trans' = trans |> Seq.map(fun value -> value, AssociationRuleProgram.IsSubsetOf (itemSet,value))
    trans' |> Seq.filter(fun item -> snd item = true)
           |> Seq.length

GetHighConfRules

With the subfunctions created and running green, I then tackled the point of the exercise –> GetHighConfRules. The C# implementation is pretty verbose and there are lots of things happening:

    public static List<Rule> GetHighConfRules(List<int[]> freqItemSets, List<int[]> trans, double minConfidencePct)
    {
      // generate candidate rules from freqItemSets, save rules that meet min confidence against transactions
      List<Rule> result = new List<Rule>();
 
      Dictionary<int[], int> itemSetCountDict = new Dictionary<int[], int>(); // count of item sets
 
      for (int i = 0; i < freqItemSets.Count; ++i) // each freq item-set generates multiple candidate rules
      {
        int[] currItemSet = freqItemSets[i]; // for clarity only
        int ctItemSet = CountInTrans(currItemSet, trans, itemSetCountDict); // needed for each candidate rule
 
        for (int len = 1; len <= currItemSet.Length – 1; ++len) // antecedent len = 1, 2, 3, . .
        {
          int[] c = NewCombination(len); // a mathematical combination
 
          while (c != null) // each combination makes a candidate rule
          {
            int[] ante = MakeAntecedent(currItemSet, c);
            int[] cons = MakeConsequent(currItemSet, c); // could defer this until known if needed
          
            int ctAntecendent = CountInTrans(ante, trans, itemSetCountDict); // use lookup if possible 
            double confidence = (ctItemSet * 1.0) / ctAntecendent;
 
            if (confidence >= minConfidencePct) // we have a winner!
            {
              Rule r = new Rule(ante, cons, confidence); 
              result.Add(r); // if freq item-sets are distinct, no dup rules ever created
            }
            c = NextCombination(c, currItemSet.Length);
          } // while each combination
        } // len each possible antecedent for curr item-set
      } // i each freq item-set
 
      return result;
    } // GetHighConfRules

In the F# code, It decided to work inside out and get the rule for 1 itemset. I think the code reads pretty clear where each step is laid out

static member GetHighConfRules(freqItemSets:List<int[]>, trans:List<int[]>,  minConfidencePct:float) =
    let returnValue = new List<Rule>()
    freqItemSets 
        |> Seq.map (fun i -> i, AssociationRuleProgram.CountInTrans'(i,trans))
        |> Seq.filter(fun (i,c) -> (float)c > minConfidencePct)
        |> Seq.map(fun (i,mcp) -> i,mcp,AssociationRuleProgram.MakeAntecedent(i, trans.[0]))
        |> Seq.map(fun (i,mcp,a) -> i,mcp, a, AssociationRuleProgram.MakeConsequent(i, trans.[0]))
        |> Seq.iter(fun (i,mcp,a,c) -> returnValue.Add(new Rule(a,c,mcp)))
    returnValue

I then attempted to put this block into a larger block (trans.[0]) but then I realized that I was going about this the wrong way. Instead of using the C# code as a my base line, I need to approach the problem from a functional viewpoint. That will be the subject of my blog next week…

Filed under Analytics, F#

Association Rule Learning Via F# (Part 1)

May 13, 2014 1 Comment

I was reading the most recent MSDN when I came across this article. How awesome is this? McCaffrey did a great job explaining a really interesting area of analytics and I am loving the fact that MSDN is including articles about data analytics. When I was reading the article, I ran across this sentence “The demo program is coded in C# but you should be able to refactor the code to other .NET languages such as Visual Basic or Iron Python without too much difficulty” Iron Python? Iron Python! What about F#, the language that matches analytics the way peanut butter goes with chocolate? Challenge accepted!

The first thing I did was to download his source code from here. When I first opened the source code, I realized that the code would be a little bit hard to port because it is written from a scientific angle, not a business application point of view. 34 FxCop errors in 259 lines of code confirmed this:

Also, there are tons of comments which is very distracting. I generally hate comments, but I figure that since it is a MSDN article and it is supposed to explain what is going on, comments are OK. However, many of the comments can be refactored into more descriptive variables and method names. For example:

In any event, let’s look at the code. The first thing I did was change the CS project from a console app to a library and move the test data into an another project . I then moved the console code to the UI. I also moved the Rule class code into its own file, made sure the namespaces matched, and made the AssociationRuleProgram public. Yup it still runs:

So then I created a FSharp library in the solution and set up the class with the single method:

A couple of things to note:

1) I left the parameter naming the same, even though it is not particularly intention-revealing

2) F# is typed inferred, so I don’t have to assign the types to the parameters

Next, started looking at the supporting functions to GetHighConfRules. Up first was the function call NextCombination. Here is the side by side between the imperative style and the functional style:

The next function was NextCombination was more difficult for me to understand. I stopped what I was doing and built a unit test project that proved correctness using the commented examples as the expected values. I used 1 test project for both the C# and F# project so I could see both side by side. An interesting side not is that the unit test naming is different than usual –> instead of naming the class XXXXTests where XXXX is the name of another class, XXXX is the function name that both classes are implementing:

So going back to the example,

I wrote two unit tests that match the two comments

When I ran the tests, the 1st test passed but the second did not:

The problem with the failing test is that null is not being returned, rather {3,4,6}. So now I have a problem, do I base the F# implementation on the code comments or the code itself? I decided to base it on the code, because comments often lie but CODE DON”T LIE (thanks ‘sheed). I adjusted the unit test, got green.

One of the reasons the code is pretty hard to read/understand is because of the use of ‘i’,’j’,’k’,’n’ variables. I went back to the article and McCaffrey explains what is going on at the bottom left of page 60. Another name for the function ‘NextCombination’ could be called ‘GetLexicographicalSuccessor’ and the variable ‘n’ could be called ‘numberOfPossibleItems’. With that mental vocabulary in place, I went through the functional and divided it into 4 parts:

1Checking to see if the value of the first element is of a certain length

2) Creating a result array that is seeded with the values of the input array

3) Looping backwards to identify the 1st number in the array that will be adjusted

4) From that target element, looping forward and adjusting all subsequent items

#1 I will not worry about now and #2 is not needed in F#, so #3 is the first place to start. What I need is a way of splitting the array into two parts. Part 1 has the original values that will not change and part 2 has the values that will change. Seq.Take and Seq.Skip are perfect for this:

let i = Array.LastIndexOf(comb,n)
let i' = if i = – 1 then 0 else i
let comb' = comb |> Seq.take(i') |> Seq.toArray
let comb'' = comb |> Seq.skip(i') |> Seq.toArray

Looking at #4, I now need to increment the values in part 2 by 1. Seq.take will work:

And then putting part 1 and part 2 back together via Array.Append, we have equivalence*:

*Equivalence is defined by my unit tests, which both pass green. I have no idea if other inputs will not work. Note that the second unit test runs red, so I really think that the code is wrong and that the comment to return null is correct. The value I am getting for (3;4;5)(5) is (3;4;1) which seems to make sense.

I am not crazy about these explanatory variables (comb’, comb’’, and comb’’’) but I am not sure how to combine them without sacrificing readability. I definitely want to combine the i and i’ into 1 statement…

I am not sure why Scan is returning 4 items in an array when I am passing in an array that has a length of 3. I am running out of time today so I just hacked in a Seq.Take.

I’ll continue this exercise in my blog next week.

Filed under Analytics, F#

Kaplan-Meier Survival Analysis Using F#

May 6, 2014 5 Comments

I was reading the most recent issue of MSDN a couple of days ago when I came across this article on doing a Kaplan-Meier survival analysis. I thought the article was great and I am excited that MSDN is starting to publish articles on data analytics. However, I did notice that there wasn’t any code in the article, which is odd, so I went to the on-line article and others had a similar question:

I decided to implement a Kaplan-Meier survival (KMS) analysis using F#. After reading the article a couple of times, I was still a bit unclear on how the KMS is implemented and there does not seem to be any pre-rolled in the standard .NET stat libraries out there. I went on over to this site where there was an excellent description of how the survival probability is calculated. I went ahead and built an Excel spreadsheet to match the nih one and then compare to what Topol is doing:

Notice that Topol censored the data for the article. If we only cared about the probability of crashes, then we would not censor the data for when the device was turned off.

So then I was ready to start coding so spun up a solution with an F# project for the analysis and a C# project for the testing.

I then loaded into the unit test project the datasets that Topol used:

[TestMethod]
public void EstimateForApplicationX_ReturnsExpected()
{
    var appX = new CrashMetaData[]
    {
        new CrashMetaData(0,1,false),
        new CrashMetaData(1,5,true),
        new CrashMetaData(2,5,false),
        new CrashMetaData(3,8,false),
        new CrashMetaData(4,10,false),
        new CrashMetaData(5,12,true),
        new CrashMetaData(6,15,false),
        new CrashMetaData(7,18,true),
        new CrashMetaData(8,21,false),
        new CrashMetaData(9,22,true),
    };
}

I could then wire up the unit tests to compare the output to the article and what I had come up with.

public void EstimateForApplicationX_ReturnsExpected()
{
    var appX = new CrashMetaData[]
    {
        new CrashMetaData(0,1,false),
        new CrashMetaData(1,5,true),
        new CrashMetaData(2,5,false),
        new CrashMetaData(3,8,false),
        new CrashMetaData(4,10,false),
        new CrashMetaData(5,12,true),
        new CrashMetaData(6,15,false),
        new CrashMetaData(7,18,true),
        new CrashMetaData(8,21,false),
        new CrashMetaData(9,22,true),
    };
 
    var expected = new SurvivalProbabilityData[]
    {
        new SurvivalProbabilityData(0,1.000),
        new SurvivalProbabilityData(5,.889),
        new SurvivalProbabilityData(12,.711),
        new SurvivalProbabilityData(18,.474),
        new SurvivalProbabilityData(22,.000)
    };
 
    KaplanMeierEstimator estimator = new KaplanMeierEstimator();
    var actual = estimator.CalculateSurvivalProbability(appX);
 
    Assert.AreSame(expected, actual);
}

However, one of the neat features of F# is the REPL so I don’t need to keep running unit tests to prove correctness when I am proving out a concept. So I added equivalent test code in the beginning of the F# project so I could run in the REPL my ideas:

type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
 
type KapalanMeierAnalysis() = 
    member this.GenerateXAppData ()= 
                    [|  {userId=0; crashTime=1; crashed=false};{userId=1; crashTime=5; crashed=true};
                        {userId=2; crashTime=5; crashed=false};{userId=3; crashTime=8; crashed=false};
                        {userId=4; crashTime=10; crashed=false};{userId=5; crashTime=12; crashed=true};
                        {userId=6; crashTime=15; crashed=false};{userId=7; crashTime=18; crashed=true};
                        {userId=8; crashTime=21; crashed=false};{userId=9; crashTime=22; crashed=true}|]
    
    member this.RunAnalysis(crashMetaData: array<CrashMetaData>) = 

The first thing I did was duplicate the 1st 3 columns of the Excel spreadsheet:

let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))

In the REPL:

The forth column is tricky because it is a cumulative calculation. Instead of for..eaching in an imperative style, I took advantage of the functional language constructs to make the code much more readable. Once I calculated that column outside of the base Sequence, I added it back in via Seq.Zip

let cumulativeDevices = crashMetaData.Length
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)

In the REPL:

The next two columns were a snap –> they were just calculations based on the existing values:

let cumulativeDevices = crashMetaData.Length
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
let crashSequence'' = crashSequence'
                            |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 

The last column was another cumulative calculation so I added another accumulator and used Seq.scan and Seq.Zip.

let cumulativeDevices = crashMetaData.Length
let cumulativeSurvivalProbability = 1.
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
let crashSequence'' = crashSequence'
                            |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 
 
let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
let survivalProbabilitySequence' = survivalProbabilitySequence
                                            |> Seq.skip 1

The last step was to map all of the columns and only output what was in the article. The final answer is:

namespace ChickenSoftware.SurvivalAnalysis
 
type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
type public SurvivalProbabilityData = {crashTime: int; survivalProbaility: float}
 
type KaplanMeierEstimator() = 
    member this.CalculateSurvivalProbability(crashMetaData: array<CrashMetaData>) = 
            let cumulativeDevices = crashMetaData.Length
            let cumulativeSurvivalProbability = 1.
 
            let crashSequence = crashMetaData 
                                    |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                            | true -> 1
                                                                                            | false -> 0),
                                                                             (match crash.crashed with
                                                                                            | true -> 0
                                                                                            | false -> 1))
            let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
            let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                                        |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
            let crashSequence'' = crashSequence'
                                        |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 
 
            let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
            let survivalProbabilitySequence' = survivalProbabilitySequence
                                                        |> Seq.skip 1
 
            let crashSequence''' = Seq.zip crashSequence'' survivalProbabilitySequence'
                                        |> Seq.map(fun ((t,c,nc,cumld,dp,sp),cumlsp) -> t,c,nc,cumld,dp,sp,cumlsp)
            crashSequence'''
                    |> Seq.filter(fun (t,c,nc,cumld,dp,sp,cumlsp) -> c=1 )
                    |> Seq.map(fun (t,c,nc,cumld,dp,sp,cumlsp) -> t,System.Math.Round(cumlsp,3))

And this matches the article (almost exactly). The article also has a row for iteration zero, which I did not bake in. Instead of fixing my code, I changed the unit test and removed that 1st column. In any event, I ran the test and it ran red –> but the values are identical so I assume it is a problem with the Assert.AreSame() function. I would take the time to figure it out but it is 75 degrees on a Sunday afternoon and I want to go play catch with my kids…

Note it also matches the other data set Topol has in the article:

In any event, this code reads pretty much the way I was thinking about the problem – each column of the Excel spreadsheet has a 1 to 1 correspondence to the F# code block. I did use explanatory variables liberally which might offend the more advanced functional programmers but taking each step in turn really helped me focus on getting each step correct before going to the next one.

1) I had to offset the cumulativeSurvivalProabability by one because the calculation is how many crashed on a day compared to how many were working at the start of the day. The Seq.Scan increments the counter for the next row of the sequence and I need it for the current row. Perhaps there is an overload for Seq.Scan?

2) I adopted the functional convention of using ticks to denote different physical manifestations of the same logic concept (crashedDeviceSequence “became” crashedDeviceSequence’, etc…). Since everything is immutable by default in F#, this kind of naming convention makes a lot of sense to me. However, I can see it quickly becoming unwieldy.

3) I could not figure out how to operate on the base tuple so instead I used a couple of supporting Sequences and then put everything together using Seq.Zip. I assume there is a more efficient way to do that.

4) One of the knocks against functional/scientific programming is that values are named poorly. To combat that, I used the full names in my tuples to start. After a certain point though, the names got too unwieldy so I resorted to their initials. I am not sure what the right answer is here, or even if there is right answer.

Filed under Analytics, F#

Jamie Dixon's Home

Association Rule Problem: Part 3

Association Rule Learning Via F# (Part 2)

Association Rule Learning Via F# (Part 1)

Kaplan-Meier Survival Analysis Using F#

Categories

Recent Posts

Archives

Blogroll

Meta