Now that I have the elevator dataset in a workable state, I wanted to see what I could see with the data. I was reading Machine Learning In Action and the authors suggested that an Apriori Algorithm as a way to quantify associations among data points. I read both Harrington’s code and Wikipedia’s description and I found both the be impenetrable – the former because their code was unreadable and the later because the mathematical formulas depended on a level of algebra that I don’t have.
Fortunately, I found a C# project on Codeproject that had both an excellent example/introduction and C# code. I used the examples on the website to formulate my F# implementation.
The first thing I did was create a class that matched the 1st grid in the example

- namespace ChickenSoftware.ElevatorChicken.Analysis
-
- open System.Collections.Generic
-
- type Transaction = {TID: string; Items: List<string> }
-
- type Apriori(database: List<Transaction>, support: float, confidence: float) =
- member this.Database = database
- member this.Support = support
- member this.Confidence = confidence
Note that because F# is immutable by default, the properties are read-only. I then created a unit test project that makes sure the constructor works without exceptions. The data matches the example:
- public AprioriTests()
- {
- var database = new List<Transaction>();
- database.Add(new Transaction("100", new List<string>() { "A", "C", "D" }));
- database.Add(new Transaction("200", new List<string>() { "B", "C", "E" }));
- database.Add(new Transaction("300", new List<string>() { "A", "B", "C", "E" }));
- database.Add(new Transaction("400", new List<string>() { "B", "E" }));
-
- _apriori = new Apriori(database, .5, .80);
-
- }
-
- [TestMethod]
- public void ConstructorUsingValidArguments_ReturnsExpected()
- {
- Assert.IsNotNull(_apriori);
- }
I then need a function to count up all of the items in the Itemsets. I refused to use loops, so I first started using Seq.Fold, but I was having zero luck because I was trying to fold a Seq of List. I then started experimenting with other functions when I found Seq.Collect – which was perfect. So I created a function like this:
- member this.GetC1() =
- database
-
- member this.GetL1() =
- let numberOfTransactions = this.GetC1().Count
-
- this.GetC1()
- |> Seq.collect(fun d -> d.Items)
- |> Seq.countBy(fun i -> i)
- |> Seq.map(fun (t,i) -> t, i, float i/ float numberOfTransactions)
- |> Seq.filter(fun (t,i,p) -> p >= support)
- |> Seq.map(fun (t,i,p) -> t,i)
- |> Seq.sort
- |> Seq.toList
Note that the numberOfTransactions is for the database, not the individual items in the List<Item>. And the results match the example:


So this is great. My next stop was to build a list of pair combinations of the remaining values

The trick is that is not a Cartesian join of the original sets – it is only the surviving sets that are needed. My first attempt looked like:
- let C1 = database
-
- let L1 = C1
- |> Seq.map(fun t -> t.Items)
- |> Seq.collect(fun i -> i)
- |> Seq.countBy(fun i -> i)
- |> Seq.map(fun (t,i) -> t, i, float i/ float numberOftransactions)
- |> Seq.filter(fun (t,i,p) -> p >= support)
- |> Seq.toArray
- let C2A = L1
- |> Seq.map(fun (x,y,z) -> x)
- |> Seq.toArray
- let C2B = L1
- |> Seq.map(fun (x,y,z) -> x)
- |> Seq.toArray
- let C2 = C2A |> Seq.collect(fun x -> C2B |> Seq.map(fun y -> x+y))
- C2
With the output like this:

I was running out of Saturday morning so I went over to stack overflow and got a couple of responses. I was on the right track with the concat, but I didn’t think about the List.Filter(), which would prune my list. With this in mind, I copied Mark’s code and got what I was looking for
- member this.GetC2() =
- let l1Itemset = this.GetL1()
- |> Seq.map(fun (i,s) -> i)
-
- let itemset =
- l1Itemset
- |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
- |> Seq.concat
- |> Seq.filter(fun (x,y) -> x < y)
- |> Seq.sort
- |> Seq.toList
-
- let listContainsItem(l:List<string>, a,b) =
- l.Contains(a) && l.Contains(b)
-
- let someFunctionINeedToRename(l1:List<string>, l2)=
- l2 |> Seq.map(fun (x,y) -> listContainsItem(l1,x,y))
-
- let itemsetMatches = this.GetC1()
- |> Seq.map(fun t -> t.Items)
- |> Seq.map(fun i -> someFunctionINeedToRename(i,itemset))
-
- let itemSupport = itemsetMatches
- |> Seq.map(Seq.map(fun i -> if i then 1 else 0))
- |> Seq.reduce(Seq.map2(+))
-
- itemSupport
- |> Seq.zip(itemset)
- |> Seq.toList
So now I have C2 filling correctly:

Taking the results, I needed to get L2.

That was much simpler that getting C2 –> here is the code:
- member this.GetL2() =
- let numberOfTransactions = this.GetC1().Count
-
- this.GetC2()
- |> Seq.map(fun (i,n) -> i,n,float n/float numberOfTransactions)
- |> Seq.filter(fun (i,n,p) -> p >= support)
- |> Seq.map(fun (t,i,p) -> t,i)
- |> Seq.sort
- |> Seq.toList
And when I run it – it matches this example exactly:

Finally, I added in a C# and L3. This code is identical to the C2/L2 code with one exception: mapping a triple and not a tuple: The C2 code maps like this
- let itemset =
- l1Itemset
- |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
- |> Seq.concat
- |> Seq.filter(fun (x,y) -> x < y)
- |> Seq.sort
- |> Seq.toList
and the C3 code looks like this (took me 15 minutes to figure out line 3 below):
- let itemset =
- l2Itemset
- |> Seq.map(fun x -> l2Itemset |> Seq.map(fun y-> l2Itemset |> Seq.map(fun z->(fst x,fst y,snd z))))
- |> Seq.concat
- |> Seq.collect(fun d -> d)
- |> Seq.filter(fun (x,y,z) -> x < y && y < z)
- |> Seq.distinct
- |> Seq.sort
- |> Seq.toList
With the C3 and L3 matching the example also:


I was now ready to put in the elevator data into the analysis. I think I am getting better at F# because I did the mapping, filtering, and transformation of the data from the server without looking at any other material and it look only 15 minutes.
- type public ElevatorBuilder() =
- let connectionString = ConfigurationManager.ConnectionStrings.["localData2"].ConnectionString;
-
- member public this.GetElevatorTransactions() =
- let transactions = this.GetElevators()
- |> Seq.map(fun e ->this.ConvertElevatorToTransaction(e))
- let transactionsList = new System.Collections.Generic.List<Transaction>(transactions)
- transactionsList
-
- member public this.ConvertElevatorToTransaction(i: string, t:string, c:string, s:string) =
- let items = new System.Collections.Generic.List<String>()
- items.Add(t)
- items.Add(c)
- items.Add(s)
- let transaction = {TID=i; Items=items}
- transaction
-
- member public this.GetElevators () =
- SqlConnection.GetDataContext(connectionString).ElevatorData201402
- |> Seq.map(fun e -> e.ID, e.EquipType,e.Capacity,e.Speed)
- |> Seq.filter(fun (i,et,c,s) -> not(String.IsNullOrEmpty(et)))
- |> Seq.filter(fun (i,et,c,s) -> c.HasValue)
- |> Seq.filter(fun (i,et,c,s) -> s.HasValue)
- |> Seq.map(fun (i,t,c,s) -> i, this.CatagorizeEquipmentType(t),c,s)
- |> Seq.map(fun (i,t,c,s) -> i,t,this.CatagorizeCapacity(c.Value),s)
- |> Seq.map(fun (i,t,c,s) -> i,t,c,this.CatagorizeSpeed(s.Value))
- |> Seq.map(fun (i,t,c,s) -> i.ToString(),t,c,s)
The longest part was aggregating the free-form text of the Equipment Type field (here is partial snip, you get the idea…)
- member public this.CatagorizeEquipmentType(et: string) =
- match et.Trim() with
- | "OTIS" -> "OTIS"
- | "OTIS (1-2)" -> "OTIS"
- | "OTIS (2-1)" -> "OTIS"
- | "OTIS hydro" -> "OTIS"
- | "OTIS, HYD" -> "OTIS"
- | "OTIS/ ASHEVILLE " -> "OTIS"
- | "OTIS/ MOUNTAIN " -> "OTIS"
- | "OTIS/#1" -> "OTIS"
- | "OTIS/#19 " -> "OTIS"
Assigning categories for speed and capacity was a snap using F#
- member public this.CatagorizeCapacity(c: int) =
- let lowerBound = (c/25 * 25) + 1
- let upperBound = lowerBound + 24
- lowerBound.ToString() + "-" + upperBound.ToString()
-
- member public this.CatagorizeSpeed(s: int) =
- let lowerBound = (s/50 * 50) + 1
- let upperBound = lowerBound + 49
- lowerBound.ToString() + "-" + upperBound.ToString()
With this in hand, I created a Console app that takes the 27K records and pushes them though the apriori algorithm:
- private static void RunElevatorAnalysis()
- {
- Stopwatch stopwatch = new Stopwatch();
- stopwatch.Start();
- ElevatorBuilder builder = new ElevatorBuilder();
- var transactions = builder.GetElevatorTransactions();
- stopwatch.Stop();
- Console.WriteLine("Building " + transactions.Count + " transactions took: " + stopwatch.Elapsed.TotalSeconds);
- var apriori = new Apriori(transactions, .1, .75);
- var c2 = apriori.GetC2();
- stopwatch.Reset();
- stopwatch.Start();
- var l1 = apriori.GetL1();
- Console.WriteLine("Getting L1 took: " + stopwatch.Elapsed.TotalSeconds);
- var l2 = apriori.GetL2();
- Console.WriteLine("Getting L2 took: " + stopwatch.Elapsed.TotalSeconds);
- var l3 = apriori.GetL3();
- Console.WriteLine("Getting L3 took: " + stopwatch.Elapsed.TotalSeconds);
- stopwatch.Stop();
- Console.WriteLine("–L1");
- foreach (var t in l1)
- {
- Console.WriteLine(t.Item1 + ":" + t.Item2);
- }
- Console.WriteLine("–L2");
- foreach (var t in l2)
- {
- Console.WriteLine(t.Item1 + ":" + t.Item2);
- }
- Console.WriteLine("–L3");
- foreach (var t in l3)
- {
- Console.WriteLine(t.Item1 + ":" + t.Item2);
- }
- }
I then made an offering to the F# Gods and hit F5:

Doh! The gods were not pleased. I then went back to my initial filtering function and added a Seq.Take(25000) and the results:

So there a couple of things to draw from this exercise.
1) Apriori Algorithm is the wrong classification technique for this dataset. I had to bring the support way down (10%) to even get any readings. Also, there is too much dispersion of the values. This kind of algorithm is much better with N number of a smaller set of data values versus a fixed number of large values.
2) Even so, how cool is this? Compare the files just to make the C#/OO work versus with F#


And the Total LOC is 539 for C# versus 120 for F# – and the F# can be optimized using a better way to create search and itemsets. Hard-coding each level was a hack I did to get thing working and give me an understanding of how AA works. I bet this can be consolidated to well under 75 lines without sacrificing readability
3) I think the StackOverflow exception is because I am doing a Cartesian join and then paring the result. Using one of the other techniques suggested on SO will give much better results.
I any event, what a fun project! I can’t wait to optimize this and perhaps throw a different algorithm at the dataset in the coming weeks.