Association Rule Problem: Part 3
May 27, 2014 1 Comment
After spending a couple of weeks working though the imperative code, I decided to approach the problem from a F#/functional point of view. Going back to the original article, there are several steps that McCaffrey walks through:
- Get a series of transactions
- Get the frequent item-sets for the transactions
- For each item-set, get all possible combinations. Each combination is broken into an antecedent and consequent
- Apply the frequency of each antecedent in all transactions
- If the frequency of the combination is greater than the confidence level, include it in the final set
For the purposes of this article, Step #1 and Step #2 were already done. My code starts with step #3. Instead of for..eaching and if..thening my way though the item-sets, I decided to look at how permutations and combinations are done in F#. Interestingly, one of the first articles on permutations and combinations on Google is from McCaffrey in MSDN from four years ago. Unfortunately, this article was of limited use because the code is decidedly non-functional so it might as well been written in C# (this was pointed out in the comments). So going to Stack Overflow, there are plenty of good examples of getting combinations in F# on SO and elsewhere. After playing with the code samples for a bit (my favorite one was this), it hit me that the ordinal positions are the same for an array of X size. So going back to McCaffrey’s example, there is only item-sets of 2 and 3 length. Therefore, I can hard-code the results and leave the actual calculation for another time.
- static member GetCombinationsForDouble(itemSet: int[]) =
- let combinations = new List<int[]*int[]*int[]>()
- combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
- combinations
- static member GetCombinationsForTriple(itemSet: int[]) =
- let combinations = new List<int[]*int[]*int[]>()
- combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
- combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
- combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
- combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
- combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
- combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
- combinations
I used a tuple to represent the antecedent array and consequent array values. I then spun up a unit test to compare results based on McCaffrey’s detailed example:
- [TestMethod]
- public void GetValuesForATriple_ReturnsExpectedValue()
- {
- var expected = new List<Tuple<int[], int[]>>();
- expected.Add(Tuple.Create<int[], int[]>(new int[1] { 3 }, new int[2] { 4, 7 }));
- expected.Add(Tuple.Create<int[], int[]>(new int[1] { 4 }, new int[2] { 3, 7 }));
- expected.Add(Tuple.Create<int[], int[]>(new int[1] { 7 }, new int[2] { 3, 4 }));
- expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 4 }, new int[1] { 7 }));
- expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 7 }, new int[1] { 4 }));
- expected.Add(Tuple.Create<int[], int[]>(new int[2] { 4, 7 }, new int[1] { 3 }));
- var itemSet = new int[3] { 3,4,7};
- var actual = FS.AssociationRuleProgram2.GetCombinationsForTriple(itemSet);
- Assert.AreEqual(expected.Count, actual.Count);
- }
A couple of things to note about the unit test:
1) The rules about variable naming and whatnot that apply in business application development quickly fall down when applied to scientific computing. For example, there is no way that this
List<Tuple<int[], int[]>> expected = new List<Tuple<int[], int[]>>();
is more readable that this
var expected = new List<Tuple<int[], int[]>>();
In fact, it is less readable. The use of complex data structures and algorithms force a different set of naming conventions. Applying Fx-Cop or other framework naming conventions to scientific programming is as useful as applying scientific naming conventions to framework development. If it is a screw, use a screwdriver. If it is a nail, user a hammer…
2) I don’t have a good way of comparing the results of a tuple of paired arrays for equivalence – there is certainly nothing out of the box in Microsoft.VisualStudio.TestTools.UnitTesting. I toyed (briefly) with creating a method to compare equivalence in arrays but I did not in the interest of time. That would be a welcome additional to the testing namespace.
Sure enough, running the unit test using McCaffrey’s data all run green.
With step 3 knocked out, I now needed to determine the frequency of the antecedent in the transactions list. This step is better broken down into a couple of sub-steps. I used McCaffrey’s detailed example of 3,4,7 as proof of correctness in my unit tests:
I need a way of taking the antecedent of 3, and comparing it to all transactions (which are arrays) to see how often it appears. As an additional layer of complexity, that 3 is not an int, it is an array (all be it an array of one). I could not find a equivalent question on StackOverflow (meaning I probably am asking the wrong question), so I went ahead of made a mental model where I would map the TryFindIndex function against each item of subset and see if that value is in the original set. The result is a tuple with the original value and the ordinal position in the set. The key thing is that if the item was not found, it returns “None”. So I just have to filter on that flag and if the result of the filter is greater than 1, I know that something was not found and the functional can return false
In code, it pretty much looks like the way I just described it:
- static member SetContainsSubset(set: int[], subset: int[]) =
- let notIncluded = subset
- |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
- |> Seq.filter(fun (i,j) -> j = None )
- |> Seq.toArray
- if notIncluded.Length > 0 then false else true
And I generated my unit tests out of the example too:
- [TestMethod]
- public void SetContainsSubsetUsingMatched_ReturnsTrue()
- {
- var set = new int[4] { 1, 3, 4, 7 };
- var subset = new int[3] { 3, 4, 7 };
- Boolean expected = true;
- Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
- Assert.AreEqual(expected, actual);
- }
- [TestMethod]
- public void SetContainsSubsetUsingUnMatched_ReturnsFalse()
- {
- var set = new int[3] { 1, 4, 7 };
- var subset = new int[3] { 3, 4, 7 };
- Boolean expected = false;
- Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
- Assert.AreEqual(expected, actual);
- }
With this supporting function ready, I can then apply it to an array and see how many trues I get. That is the Count value in Figure 2 of the article. Seq.Map fits this task perfectly.
- static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
- transactions
- |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
- |> Seq.filter(fun (t,f) -> f = true)
- |> Seq.length
And the subsequent unit test also runs green
- [TestMethod]
- public void CountItemSetInTransactions_ReturnsExpected()
- {
- List<int[]> transactions = new List<int[]>();
- transactions.Add(new int[] { 0, 3, 4, 11 });
- transactions.Add(new int[] { 1, 4, 5 });
- transactions.Add(new int[] { 3, 4, 6, 7 });
- transactions.Add(new int[] { 3, 4, 6, 7 });
- transactions.Add(new int[] { 0, 5 });
- transactions.Add(new int[] { 3, 5, 9 });
- transactions.Add(new int[] { 2, 3, 4, 7 });
- transactions.Add(new int[] { 2, 5, 8 });
- transactions.Add(new int[] { 0, 1, 2, 5, 10 });
- transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
- var itemSet = new int[1] { 3 };
- Int32 expected = 6;
- Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
- Assert.AreEqual(expected, actual);
- }
So with this in place, I am ready for the next column, the confidence column. McCaffrey used the numerator of 3 which is shown here:
So I assume that this count is the number of times 3,4,7 show up in the the transaction set. If so, the supporting function ItemSetCountInTransactions can also be used. I created a unit test and it ran green
- [TestMethod]
- public void CountItemSetInTransactionsUsing347_ReturnsThree()
- {
- List<int[]> transactions = new List<int[]>();
- transactions.Add(new int[] { 0, 3, 4, 11 });
- transactions.Add(new int[] { 1, 4, 5 });
- transactions.Add(new int[] { 3, 4, 6, 7 });
- transactions.Add(new int[] { 3, 4, 6, 7 });
- transactions.Add(new int[] { 0, 5 });
- transactions.Add(new int[] { 3, 5, 9 });
- transactions.Add(new int[] { 2, 3, 4, 7 });
- transactions.Add(new int[] { 2, 5, 8 });
- transactions.Add(new int[] { 0, 1, 2, 5, 10 });
- transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
- var itemSet = new int[3] { 3,4,7 };
- Int32 expected = 3;
- Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
- Assert.AreEqual(expected, actual);
- }
So the last piece was to put it together in the GetHighConfRules method. I did not change the signature
- static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
- let returnValue = new List<Rule>()
- let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a))
- combinations
- |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
- |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
- |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
- |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
- |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
- returnValue
Note that I did add a helper function to get Combinations based on the length of the array
- static member GetCombinations(itemSet: int[]) =
- if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
- else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)
And when I run that from the console:
So this is pretty close. McCaffrey allows for inversion of the numbers in the array (3:4 is not the same as 4:3) and I do not – but his supporting detail does not show that so I am not sure what is the correct answer. In any event, this is pretty good. The F# code can be refactored so that all combinations can be sent from an array. In the mean time, here is all 43 lines of the program.
- open System
- open System.Collections.Generic
- type AssociationRuleProgram2 =
- static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
- let returnValue = new List<Rule>()
- let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a))
- combinations
- |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
- |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
- |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
- |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
- |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
- returnValue
- static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
- transactions
- |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
- |> Seq.filter(fun (t,f) -> f = true)
- |> Seq.length
- static member SetContainsSubset(set: int[], subset: int[]) =
- let notIncluded = subset
- |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
- |> Seq.filter(fun (i,j) -> j = None )
- |> Seq.toArray
- if notIncluded.Length > 0 then false else true
- static member GetCombinations(itemSet: int[]) =
- if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
- else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)
- static member GetCombinationsForDouble(itemSet: int[]) =
- let combinations = new List<int[]*int[]*int[]>()
- combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
- combinations
- static member GetCombinationsForTriple(itemSet: int[]) =
- let combinations = new List<int[]*int[]*int[]>()
- combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
- combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
- combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
- combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
- combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
- combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
- combinations
Note how the code in the GetHighConfRules function matches almost one for one to the bullet points at the beginning of the post. F# is a language where the code follows how you think, not the other way around. Also note how the 43 lines of code compares to 136 lines of code in the C# example –> less noise, more signal.