Consuming Sky Biometry’s Image Recognition API

I was looking at Sky Biometry a couple of days ago to do some facial recognition.  I am not sure how many other companies out there do this, but working with their API from F# was a great experience. 

Sky Biometry uses both Json and Xml for API calls.  They also have an example covering library in C#.  I decided to use F# and use the json type provider to explore their api and the quality of their recognition algorithms.  They have a good documentation page and it didn’t take very long to get my account up and running and then making API calls.

I thought it would be fun to use an example from the first Terminator movie when Ah-nold went around looking for Sarah Connor.  I picked two images from the movie series.  The first is this image of a photograph of Sarah Connor has taken at the end of the first movie.  I know that the terminator didn’t have this photograph in the movie, but working off-script, pretend that the Terminator has that photograph.  This second image is from the first movie when she is in a bar.  So if the Terminator was powered by Sky Biometry and found Sarah in the bar, how close would it match her to the first photo?

image

The first thing I did was to fire up a FSharp project in Visual Studio and start scripting the API calls that I would need to do facial recognition. 

image

With all of the calls working in the REPL, I then moved the code into my module.  I declared the types at the top and then crated a class that could be consumed by external projects.

image

You will notice that the type providers are using a local copy of the json to infer the type in the module.  I did run into the problem where the type provider using the web call was not capturing the full graph in the type definition, so I took the json and made it local.  This led to an interesting problem because a FSharp project out of the box in Visual Studio does not support adding folders.  To get around that, I went to my file system and added a folder

image

I then created files in the folder that correspond to the type providers

image

Next I unloaded the FS project and edited it

imageimage

and in the project, I added those files to the ItemGroup that brings in all of the files from disk

  1. <ItemGroup>
  2.   <Compile Include="SkyBiometryImageComparer.fs" />
  3.   <None Include="Script.fsx" />
  4.   <None Include="packages.config" />
  5.   <None Include="app.config" />
  6.   <None Include="SkyBiometryImageJson/AddTags.json" />
  7.   <None Include="SkyBiometryImageJson/FaceDetection.json" />
  8.   <None Include="SkyBiometryImageJson/FaceRecognition.json" />
  9.   <None Include="SkyBiometryImageJson/FaceTraining.json" />
  10.   <None Include="SkyBiometryImageJson/RemoveTags.json" />
  11. </ItemGroup>

Once the project is re-loaded, the folder and the files show up in Solution Explorer.

image

 

In each of the .json files, I pasted in the results from the service call that I did in the REPL.  For example,

image

So then I swapped them out in the module and now I get the full graph

image

With that out of the way, I implemented a one to one match of supporting methods to the API calls:

  1. member this.DetectFace(imageUri:string)=
  2.     let stringBuilder = new StringBuilder()
  3.     stringBuilder.Append(skybiometryUri) |> ignore
  4.     stringBuilder.Append("/fc/faces/detect.json?urls=") |> ignore
  5.     stringBuilder.Append(imageUri) |> ignore
  6.     stringBuilder.Append("&api_key=") |> ignore
  7.     stringBuilder.Append(apiKey) |> ignore
  8.     stringBuilder.Append("&api_secret=") |> ignore
  9.     stringBuilder.Append(apiSecret) |> ignore
  10.     let faceDetection =  skybiometryFaceDetection.Load(stringBuilder.ToString())
  11.     faceDetection.Photos.[0].Tags.[0].Tid
  12.  
  13. member this.SaveTag(tid:string)=
  14.     let stringBuilder = new StringBuilder()
  15.     stringBuilder.Append(skybiometryUri) |> ignore
  16.     stringBuilder.Append("/fc/tags/save.json?uid=") |> ignore
  17.     stringBuilder.Append(uid) |> ignore
  18.     stringBuilder.Append("&tids=") |> ignore
  19.     stringBuilder.Append(tid) |> ignore
  20.     stringBuilder.Append("&api_key=") |> ignore
  21.     stringBuilder.Append(apiKey) |> ignore
  22.     stringBuilder.Append("&api_secret=") |> ignore
  23.     stringBuilder.Append(apiSecret) |> ignore
  24.     let tags = skybiometryAddTags.Load(stringBuilder.ToString())
  25.     tags.Status
  26.  
  27. member this.TrainFace()=
  28.     let stringBuilder = new StringBuilder()
  29.     stringBuilder.Append(skybiometryUri) |> ignore
  30.     stringBuilder.Append("/fc/faces/train.json?uids=") |> ignore
  31.     stringBuilder.Append(uid) |> ignore
  32.     stringBuilder.Append("&api_key=") |> ignore
  33.     stringBuilder.Append(apiKey) |> ignore
  34.     stringBuilder.Append("&api_secret=") |> ignore
  35.     stringBuilder.Append(apiSecret) |> ignore
  36.     let training = skybiometryFaceTraining.Load(stringBuilder.ToString())
  37.     training.Status
  38.  
  39. member this.RecognizeFace(imageUri:string)=
  40.     let stringBuilder = new StringBuilder()
  41.     stringBuilder.Append(skybiometryUri) |> ignore
  42.     stringBuilder.Append("/fc/faces/recognize.json?uids=") |> ignore
  43.     stringBuilder.Append(uid) |> ignore
  44.     stringBuilder.Append("&urls=") |> ignore
  45.     stringBuilder.Append(imageUri) |> ignore
  46.     stringBuilder.Append("&api_key=") |> ignore
  47.     stringBuilder.Append(apiKey) |> ignore
  48.     stringBuilder.Append("&api_secret=") |> ignore
  49.     stringBuilder.Append(apiSecret) |> ignore
  50.     let recognition = skybiometryFaceRecognition.Load(stringBuilder.ToString())
  51.     if recognition.Photos.[0].Tags |> Seq.length > 0 then
  52.         recognition.Photos.[0].Tags.[0].Attributes.Face.Confidence
  53.     else
  54.         0
  55.  
  56. member this.RemoveTag(tid:string) =
  57.     let stringBuilder = new StringBuilder()
  58.     stringBuilder.Append(skybiometryUri) |> ignore
  59.     stringBuilder.Append("/fc/tags/remove.json?tids=") |> ignore
  60.     stringBuilder.Append(tid) |> ignore
  61.     stringBuilder.Append("&api_key=") |> ignore
  62.     stringBuilder.Append(apiKey) |> ignore
  63.     stringBuilder.Append("&api_secret=") |> ignore
  64.     stringBuilder.Append(apiSecret) |> ignore
  65.     let tagRemoval = skybiometryRemoveTags.Load(stringBuilder.ToString())
  66.     tagRemoval.Status

The StringBuilder makes the code pretty verbose, but I understand that it is the most efficient way to aggregate strings so I went with it. Also note that this is happy path programming with no error checking and I assume that the json coming back is well formed.

In any event, with the supporting methods done, it was just an exercise of calling each one in turn:

  1. member this.CalculateFacialRecognitionConfidence(baseUri:string, comparisionUri:string) =
  2.     let tid = this.DetectFace(baseUri)
  3.     if this.SaveTag(tid) = "success" then
  4.         if this.TrainFace() = "success" then
  5.             let confidence = this.RecognizeFace(comparisionUri)
  6.             this.RemoveTag(tid) |> ignore
  7.             confidence
  8.         else
  9.             0
  10.     else
  11.         0

And hopping over to a C# unit test project, I can call the FSharp and run some tests.  I created a test for each of the supporting methods and then three happy path tests for the CalculateFacialRecognitionConfidence: comparing the same image should be 100, comparing 2 completely unrelated images should be 0, and then our exercise of identifying Sarah Connor should be better than 50/50.  Here is an example of the 100% match use case:

image

And here is the actual test of finding Sarah Connor.:

  1.     var imageComparer = new SkyBiometryImageComparer(skyBiometryUri, uid, apiKey, apiSecret);
  2.     String basePhotoUri = "http://img2.wikia.nocookie.net/__cb20080226002749/terminator/images/thumb/7/75/T2_sarah_polaroid.jpg/300px-T2_sarah_polaroid.jpg&quot;;
  3.     String targetPhotoUri = "http://emandrews2.files.wordpress.com/2013/04/sarah-connor-scared.jpg&quot;;
  4.     Int32 confidence = imageComparer.CalculateFacialRecognitionConfidence(basePhotoUri, targetPhotoUri);
  5.  
  6.     bool expected = true;
  7.     bool actual = confidence > 50;
  8.     Assert.AreEqual(expected, actual);
  9. }

It runs green (the actual value is 58% for identifying Sarah Connor). 

image

My guess is that that Terminator would start shooting… first at Sarah Connor and then at C#.  The C# project that Sky Biometry provides has 2 classes of 1600 and 1420 lines of code to get functional equivalence of the 93 lines of F# code that I wrote (and a vast majority of that code is dealing with the string builder).

 

TRINUG F# Analytics SIG Prep

I am putting together the finishing touches of the lab that the F#/data analytics group will do on Wednesday. We are working through Johan Astborg’s awesome book F# For Quantitative Finance.  Since the group is pretty much all C# with some bi-curious F#ers, I created a solution that mixes both C# and F#.  I put F# in the places where the language really shines: the data layer and the analytical layer.  I put C# in the places where C# has popular support: UI and unit tests.

image

In this lab, I am introducing script files and the REPL for the first time as a way to ‘prove out’ concepts.  Once the idea is sufficiently hashed out in the REPL, the code is moved into a source file and corresponding unit tests are created.  For example, in the data layer, we use the csv type provider to hit up the yahoo data and save and retrieve it from disk:

  1. #r "C:\Users\Jamie\Desktop\NewCo.OptionsTradingProgram.Solution\packages\FSharp.Data.2.0.8\lib\portable-net40+sl5+wp8+win8\FSharp.Data.dll"
  2. #r "C:\Users\Jamie\Desktop\NewCo.OptionsTradingProgram.Solution\packages\Newtonsoft.Json.6.0.3\lib\\net45\Newtonsoft.Json.dll"
  3.  
  4. open System
  5. open FSharp.Data
  6. open Newtonsoft.Json
  7. open System.IO
  8.  
  9. type csvProvider = CsvProvider<"http://ichart.finance.yahoo.com/table.csv?s=MSFT&quot;>
  10.  
  11. type YahooStockProvider() =
  12.     member this.GetData(stockSymbol: string) =
  13.             csvProvider.Load("http://ichart.finance.yahoo.com/table.csv?s=&quot; + stockSymbol).Rows
  14.     
  15. type FileSystemStockProvider() =
  16.     member this.PutData(filePath:string, stockData) =
  17.         let serializedData = stockData
  18.                                 |> Seq.map(fun row -> JsonConvert.SerializeObject(row))
  19.         File.WriteAllLines(filePath,serializedData)
  20.  
  21.     member this.GetData(filePath:string) =
  22.         let serializedData = File.ReadAllLines(filePath)
  23.         serializedData
  24.             |> Seq.map(fun row -> JsonConvert.DeserializeObject<(DateTime*float*float*float*float*int*float)>(row))

In the analytics module, we spend a good amount of time implementing mathematical calculations and getting comfortable with some of the differences between how imperative C# might implement a solution and how F# would

  1. open System.Collections.Generic
  2.  
  3. let testData = [1.0 .. 6.0]
  4. let sum = testData |> Seq.sum
  5. let average = testData |> Seq.average
  6. let min = testData |> Seq.min
  7. let max = testData |> Seq.max
  8. let evens = testData |> Seq.filter(fun number -> number % 2. = 0.)
  9. let addOne = testData |> Seq.map(fun number -> number + 1.)
  10.  
  11.  
  12. //http://www.mathsisfun.com/data/standard-deviation.html
  13. let variance (source:IEnumerable<double>) =
  14.     let mean = Seq.average source
  15.     let deltas = Seq.map(fun x -> pown(x-mean) 2) source
  16.     Seq.average deltas
  17.  
  18. let standardDeviation(values:IEnumerable<double>) =
  19.     sqrt(variance(values))
  20.  
  21. let movingAverage(values:IEnumerable<double>, windowSize:int)=
  22.     values
  23.         |> Seq.windowed (windowSize)
  24.         |> Seq.map(fun window -> window |> Seq.average)
  25.  
  26. let movingStandardDeviation(values:IEnumerable<double>, windowSize:int)=
  27.     values
  28.         |> Seq.windowed (windowSize)
  29.         |> Seq.map(fun window -> window |> standardDeviation)
  30.  
  31. let bollingerBands (values:IEnumerable<double>, windowSize:int)=
  32.     let movingAverage = movingAverage(values,windowSize)
  33.     let movingStandardDeviation = movingStandardDeviation(values,windowSize)
  34.     let movingStandardDeviation' = movingStandardDeviation |> Seq.map(fun window -> window * 2.)
  35.     Seq.zip movingAverage movingStandardDeviation'

Finally, we have a WPF application that does some basic charting and graphing.  The point of this lab is not about building a snazzy UI.  The UI (and in fact, the data layer) are just plug-ins to the heart of the application – the analytical module.  Also, a majority of the TRINUG’ content focuses on the UI so there is plenty of that already.   In any event, here is the code behind in C#.

  1. private void GoButton_Click(object sender, RoutedEventArgs e)
  2. {
  3.     FileSystemStockProvider provider = new FileSystemStockProvider(@"C:\Data\TEST.txt");
  4.     var stockPrices = provider.GetData().Take(20);
  5.     this.StockPriceDataGrid.ItemsSource = stockPrices;
  6.  
  7.     var adjustedClosePrices = from stockPrice in stockPrices
  8.             select stockPrice.Item7;
  9.  
  10.     var dates = from stockPrice in stockPrices.Skip(2)
  11.                              select new { stockPrice.Item1 };
  12.  
  13.     var calculations = new Calculations();
  14.     var movingAverage = calculations.MovingAverage(adjustedClosePrices, 3);
  15.     var movingAverages = dates.Zip(movingAverage, (d, p) => new { date=d.Item1, price=p});
  16.  
  17.     var bollingerBands = calculations.BollingerBands(adjustedClosePrices, 3);
  18.     var upperBandBands = dates.Zip(bollingerBands, (d, bb) => new { date = d.Item1, upperBand = bb.Item1 + (bb.Item2 * 2) });
  19.     var lowerBandBands = dates.Zip(bollingerBands, (d, bb) => new { date = d.Item1, lowerBand = bb.Item1 + (bb.Item2 * 2) * -1 });
  20.  
  21.     this.stockPriceLineGraph.DependentValuePath = "price";
  22.     this.stockPriceLineGraph.IndependentValuePath = "date";
  23.     this.stockPriceLineGraph.ItemsSource = movingAverages;
  24.  
  25.     this.stockPriceLineGraph2.DependentValuePath = "upperBand";
  26.     this.stockPriceLineGraph2.IndependentValuePath = "date";
  27.     this.stockPriceLineGraph2.ItemsSource = upperBandBands;
  28.  
  29.     this.stockPriceLineGraph3.DependentValuePath = "lowerBand";
  30.     this.stockPriceLineGraph3.IndependentValuePath = "date";
  31.     this.stockPriceLineGraph3.ItemsSource = lowerBandBands;
  32. }

 

I am toying of not using Linq and doing this all via imperative code, which would really drive home the point and power of a functional approach.  Here is the UI when running:

image

This was a alot fun to do and I am looking forward to the next lab, which is implementing the remainder of Astborg’s book.

 

 

F# For Quantitative Finance

I picked up Johan Astborg’s F# for Quantitative Finance a couple of days ago and I thought it was great.  It was a practical hands-on way of working with FSharp.  It is so good, I think I am going to use it as a basis for a couple Triangle .NET User group F# SIGs.  One of the great things about the book are the code samples –they expose features of the F# language in a way that allows beginners to understand.  So in the trade-off between clarity and cleverness, Astborg errs on the side of clarity.   The only thing I think I would change would be to remove Chapter 2.  Chapter 2 suffers the same problem that many “intro to X computer language” books have –> pages of language features  with small independent samples.  This kind of book does not help many (if any) people learn the language easily, and it reads about as well as the MSDN component docs.

I did notice that Astborg did use the typical scientific computing approach to variable naming.  For example, on page 141 he creates a function like this:

  1. member this.black_scholes call_put_flag s x t r v =
  2.     let cnd x =
  3.         let a1 = 0.3198253
  4.         let a2 = -0.35653782
  5.         let a3 = 1.781477937
  6.         let a4 = -1.821255978
  7.         let a5 = 1.330274429
  8.         let pi = 3.141592654
  9.         let l = abs(x)
  10.         let k = 1.0/(1.0 + 0.2316419)
  11.         let w = (1.0)
  12.         if x < 0.0 then 1.0 – w else w
  13.  
  14.     let d1=(log(s/x) + (r*v*v*0.5)*t)/(v*sqrt(t))
  15.     let d2=d1-v*sqrt(t)
  16.     match call_put_flag with
  17.         | Put -> x*exp(-r*t)*cnd(-d2)-s*cnd(-d1)
  18.         | Call ->s*cnd(d1)-x*exp(-r*t)*cnd(d2)

Concentrating only on the function’s arguments, what is s,x,t,r, and v?  It is not immediately apparent and you have to read the comments above the function to discover that s is the stockprice, etc…  I think there is a better way and I wonder if FxCop (if it ever comes to F#) would agree with me..  For example, this is a slightly better version

  1. member this.black_scholes call_put_flag stockPrice strikePriceOfPotion timeToExpierationInYears riskFreeInterestRate volatility =
  2.     let cnd x =

The problem is that the number of arguments makes the naming unwieldy.  A third, and much preferred option is to pass in a data structure:

  1. type PutCallFlag = Put | Call
  2.  
  3. type black_scholes_input_data =
  4.     {stockPrice:float;
  5.     strikePriceOfPotion:float;
  6.     timeToExpierationInYears:int;
  7.     riskFreeInterestRate:float;
  8.     volatility:float}
  9.  
  10. type StockAnalyzer() =
  11.     member this.black_scholes (call_put_flag:PutCallFlag,inputData:black_scholes_input_data) =
  12.         let cnd x =
  13.             let a1 = 0.3198253
  14.             let a2 = -0.35653782
  15.             let a3 = 1.781477937
  16.             let a4 = -1.821255978
  17.             let a5 = 1.330274429
  18.             let pi = 3.141592654
  19.             let l = abs(x)
  20.             let k = 1.0/(1.0 + 0.2316419)
  21.             let w = (1.0)
  22.             if x < 0.0 then 1.0 – w else w
  23.  
  24.         let d1=(log(inputData.stockPrice/inputData.strikePriceOfPotion)

I doubt there is a performance hit and the code becomes much more readable.  It also forces the really long functions to be broken up and each piece becomes more testable.I  f there is not a corresponding domain for the input_arguments, then that naming will have to do.

 

Using Subsets for Association Rule Learning

I finished up writing the association rule program from MSDN in F# last week.  One of the things bothering me about the way I implemented the algorithms is that I hard-coded the combinations (antecedent and consequent) from the item-sets:

  1. static member GetCombinationsForDouble(itemSet: int[]) =
  2.     let combinations =  new List<int[]*int[]*int[]>()
  3.     combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
  4.     combinations
  5.  
  6. static member GetCombinationsForTriple(itemSet: int[]) =
  7.     let combinations =  new List<int[]*int[]*int[]>()
  8.     combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
  9.     combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
  10.     combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
  11.     combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
  12.     combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
  13.     combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
  14.     combinations

I thought it would be a fun exercise to make a function that returns the combinations for an N number of itemSets.  My first several attempts failed because  I started off with the wrong vocabulary.  I spent several days trying to determine how to create all of the combinations and/or permutations from the itemSet.  It then hit me that I would be looking at getting all subsets and what do you know, there are some excellent examples out there.

So if I was going to use the yield and yield-bang method of calculating the subsets in my class, I first needed to remove the rec and just let the class call itself.

  1. static member Subsets s =
  2.     set [
  3.         yield s
  4.         for e in s do
  5.             yield! AssociationRuleProgram2.Subsets (Set.remove e s) ]

I then needed a way of translating the itemSet which is a an int array into a set and back again.  Fortunately, the set module has ofArray and toArray functions so I wrote my code exactly the way I just described the problem:

  1. static member GetAntcentAndConsequent(itemSet: int[]) =
  2.     let combinations =  new List<int[]*int[]*int[]>()
  3.     let itemSet' = Set.ofArray itemSet
  4.     let subSets = AssociationRuleProgram2.Subsets itemSet'
  5.     let subSets' = Set.toArray subSets
  6.     let subSets'' = Array.map(fun s-> Set.toArray s)
  7.     let subSets''' = Array.map(fun s -> Seq.toArray s, AssociationRuleProgram2.GetAntcentAndConsequent s)

 

Note that I had to call toArray twice because the Subsets returns a Set<Set<Int>>.

In any event, I then needed a way of spitting the itemSet into antecedents and consequents (called combinations) based on the current subset.  I toyed around with a couple different ways of solving the problem when I stumbled upon a way that makes alot of sense to me.  I changed the itemset from an array of int to an array of tuple<int*bool>.  If the subset is in the itemSet, then the bool flag is true, if not it is false.  Then, I would apply an Seq.filter to the array and separate it out into antecedents and consequents.

  1. static member GetCombination array subArray =
  2.     let array' = array |> Seq.map(fun i -> i, subArray |> Array.exists(fun j -> i = j))
  3.     let antecedent = array' |> Seq.filter(fun (i,j) -> j = true) |> Seq.toArray
  4.     let consquent = array' |> Seq.filter(fun (i,j) -> j = false) |> Seq.toArray
  5.     let antecedent' = antecedent|> Seq.map(fun (i,j) -> i)
  6.     let consquent' = consquent|> Seq.map(fun (i,j) -> i)
  7.     Seq.toArray antecedent', Seq.toArray consquent'

The major downside of this approach is that I am using Array.exists for my filter flag so if there is more than one of the same value in the itemset, it does not work.  However, the original example had each itemset being unique so think I am OK.

So with these tow methods, I now have a way of dealing with N number of itemsets.  Interestingly, the amount of code (even with my verbose F#) is significantly less than the C# equivalent and closer to how I think I think.

 

 

 

Association Rule Problem: Part 3

After spending a couple of weeks working though the imperative code, I decided to approach the problem from a F#/functional point of view.  Going back to the original article, there are several steps that McCaffrey walks through:

  • Get a series of transactions
  • Get the frequent item-sets for the transactions
  • For each item-set, get all possible combinations.  Each combination is broken into an antecedent and consequent
  • Apply the frequency of each antecedent in all transactions
  • If the frequency of the combination is greater than the confidence level, include it in the final set

For the purposes of this article, Step #1 and Step #2 were already done.  My code starts with step #3.  Instead of for..eaching and if..thening my way though the item-sets, I decided to look at how permutations and combinations are done in F#.  Interestingly, one of the first articles on permutations and combinations on Google is from McCaffrey in MSDN from four years ago.  Unfortunately, this article was of limited use because the code is decidedly non-functional so it might as well been written in C# (this was pointed out in the comments).  So going to Stack Overflow, there are plenty of good examples of getting combinations in F# on SO and elsewhere.  After playing with the code samples for a bit (my favorite one was this), it hit me that the ordinal positions are the same for an array of X size.  So going back to McCaffrey’s example, there is only item-sets of 2 and 3 length.  Therefore, I can hard-code the results and leave the actual calculation for another time.

  1. static member GetCombinationsForDouble(itemSet: int[]) =
  2.     let combinations =  new List<int[]*int[]*int[]>()
  3.     combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
  4.     combinations
  5.  
  6. static member GetCombinationsForTriple(itemSet: int[]) =
  7.     let combinations =  new List<int[]*int[]*int[]>()
  8.     combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
  9.     combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
  10.     combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
  11.     combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
  12.     combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
  13.     combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
  14.     combinations

I used a tuple to represent the antecedent array and consequent array values.  I then spun up a unit test to compare results based on McCaffrey’s detailed example:

  1. [TestMethod]
  2. public void GetValuesForATriple_ReturnsExpectedValue()
  3. {
  4.     var expected = new List<Tuple<int[], int[]>>();
  5.     expected.Add(Tuple.Create<int[], int[]>(new int[1] { 3 }, new int[2] { 4, 7 }));
  6.     expected.Add(Tuple.Create<int[], int[]>(new int[1] { 4 }, new int[2] { 3, 7 }));
  7.     expected.Add(Tuple.Create<int[], int[]>(new int[1] { 7 }, new int[2] { 3, 4 }));
  8.     expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 4 }, new int[1] { 7 }));
  9.     expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 7 }, new int[1] { 4 }));
  10.     expected.Add(Tuple.Create<int[], int[]>(new int[2] { 4, 7 }, new int[1] { 3 }));
  11.  
  12.     var itemSet = new int[3] { 3,4,7};
  13.     var actual = FS.AssociationRuleProgram2.GetCombinationsForTriple(itemSet);
  14.  
  15.     Assert.AreEqual(expected.Count, actual.Count);
  16. }

A couple of things to note about the unit test:

1) The rules about variable naming and whatnot that apply in business application development quickly fall down when applied to scientific computing.  For example, there is no way that this

List<Tuple<int[], int[]>> expected = new List<Tuple<int[], int[]>>();

is more readable that this

var expected = new List<Tuple<int[], int[]>>();

In fact, it is less readable.  The use of complex data structures and algorithms force a different set of naming conventions.  Applying Fx-Cop or other framework naming conventions to scientific programming is as useful as applying scientific naming conventions to framework development.  If it is a screw, use a screwdriver.  If it is a nail, user a hammer…

2) I don’t have a good way of comparing the results of a tuple of paired arrays for equivalence – there is certainly nothing out of the box in Microsoft.VisualStudio.TestTools.UnitTesting.  I toyed (briefly) with creating a method to compare equivalence in arrays but I did not in the interest of time.  That would be a welcome additional to the testing namespace.

Sure enough, running the unit test using McCaffrey’s data all run green.

With step 3 knocked out, I now needed to determine the frequency of the antecedent in the transactions list.  This step is better broken down into a couple of sub-steps.  I used McCaffrey’s detailed example of 3,4,7 as proof of correctness in my unit tests:

image

I need a way of taking the antecedent of 3, and comparing it to all transactions (which are arrays) to see how often it appears.  As an additional layer of complexity, that 3 is not an int, it is an array (all be it an array of one).  I could not find a equivalent question on StackOverflow (meaning I probably am asking the wrong question), so I went ahead of made a mental model where I would map the TryFindIndex function against each item of subset and see if that value is in the original set.  The result is a tuple with the original value and the ordinal position in the set.  The key thing is that if the item was not found, it returns “None”.  So I just have to filter on that flag and if the result of the filter is greater than 1, I know that something was not found and the functional can return false

image


In code, it pretty much looks like the way I just described it:

  1. static member SetContainsSubset(set: int[], subset: int[]) =
  2.     let notIncluded = subset
  3.                         |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
  4.                         |> Seq.filter(fun (i,j) -> j = None )
  5.                         |> Seq.toArray
  6.     if notIncluded.Length > 0 then false else true

And I generated my unit tests out of the example too: 

  1. [TestMethod]
  2. public void SetContainsSubsetUsingMatched_ReturnsTrue()
  3. {
  4.     var set = new int[4] { 1, 3, 4, 7 };
  5.     var subset = new int[3] { 3, 4, 7 };
  6.  
  7.     Boolean expected = true;
  8.     Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
  9.  
  10.     Assert.AreEqual(expected, actual);
  11. }
  12.  
  13. [TestMethod]
  14. public void SetContainsSubsetUsingUnMatched_ReturnsFalse()
  15. {
  16.     var set = new int[3] { 1, 4, 7 };
  17.     var subset = new int[3] { 3, 4, 7 };
  18.  
  19.     Boolean expected = false;
  20.     Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
  21.  
  22.     Assert.AreEqual(expected, actual);
  23.  
  24. }

With this supporting function ready, I can then apply it to an array and see how many trues I get.  That is the Count value in Figure 2 of the article.  Seq.Map fits this task perfectly. 

  1. static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
  2.     transactions
  3.         |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
  4.         |> Seq.filter(fun (t,f) -> f = true)
  5.         |> Seq.length

And the subsequent unit test also runs green

  1. [TestMethod]
  2. public void CountItemSetInTransactions_ReturnsExpected()
  3. {
  4.     List<int[]> transactions = new List<int[]>();
  5.     transactions.Add(new int[] { 0, 3, 4, 11 });
  6.     transactions.Add(new int[] { 1, 4, 5 });
  7.     transactions.Add(new int[] { 3, 4, 6, 7 });
  8.     transactions.Add(new int[] { 3, 4, 6, 7 });
  9.     transactions.Add(new int[] { 0, 5 });
  10.     transactions.Add(new int[] { 3, 5, 9 });
  11.     transactions.Add(new int[] { 2, 3, 4, 7 });
  12.     transactions.Add(new int[] { 2, 5, 8 });
  13.     transactions.Add(new int[] { 0, 1, 2, 5, 10 });
  14.     transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
  15.  
  16.     var itemSet = new int[1] { 3 };
  17.  
  18.     Int32 expected = 6;
  19.     Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
  20.  
  21.     Assert.AreEqual(expected, actual);
  22.  
  23. }

So with this in place, I am ready for the next column, the confidence column.  McCaffrey used the numerator of 3 which is shown here:

image

So I assume that this count is the number of times 3,4,7 show up in the the transaction set.  If so, the supporting function ItemSetCountInTransactions can also be used.  I created a unit test and it ran green

  1. [TestMethod]
  2. public void CountItemSetInTransactionsUsing347_ReturnsThree()
  3. {
  4.     List<int[]> transactions = new List<int[]>();
  5.     transactions.Add(new int[] { 0, 3, 4, 11 });
  6.     transactions.Add(new int[] { 1, 4, 5 });
  7.     transactions.Add(new int[] { 3, 4, 6, 7 });
  8.     transactions.Add(new int[] { 3, 4, 6, 7 });
  9.     transactions.Add(new int[] { 0, 5 });
  10.     transactions.Add(new int[] { 3, 5, 9 });
  11.     transactions.Add(new int[] { 2, 3, 4, 7 });
  12.     transactions.Add(new int[] { 2, 5, 8 });
  13.     transactions.Add(new int[] { 0, 1, 2, 5, 10 });
  14.     transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
  15.  
  16.     var itemSet = new int[3] { 3,4,7 };
  17.  
  18.     Int32 expected = 3;
  19.     Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
  20.  
  21.     Assert.AreEqual(expected, actual);
  22.  
  23. }

So the last piece was to put it together in the GetHighConfRules method.  I did not change the signature

  1. static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
  2.     let returnValue = new List<Rule>()
  3.     let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a))
  4.     combinations
  5.         |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
  6.         |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
  7.         |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
  8.         |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
  9.         |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
  10.     returnValue

 

Note that I did add a helper function to get Combinations based on the length of the array

  1. static member GetCombinations(itemSet: int[]) =
  2.     if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
  3.     else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)

And when I run that from the console:

image

So this is pretty close.  McCaffrey allows for inversion of the numbers in the array (3:4 is not the same as 4:3) and I do not – but his supporting detail does not show that so I am not sure what is the correct answer.  In any event, this is pretty good.  The F# code can be refactored so that all combinations can be sent from an array.  In the mean time, here is all 43 lines of the program. 

  1. open System
  2. open System.Collections.Generic
  3.  
  4. type AssociationRuleProgram2 =
  5.  
  6.     static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
  7.         let returnValue = new List<Rule>()
  8.         let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a))
  9.         combinations
  10.             |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
  11.             |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
  12.             |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
  13.             |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
  14.             |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
  15.         returnValue
  16.  
  17.     static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
  18.         transactions
  19.             |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
  20.             |> Seq.filter(fun (t,f) -> f = true)
  21.             |> Seq.length
  22.  
  23.     static member SetContainsSubset(set: int[], subset: int[]) =
  24.         let notIncluded = subset
  25.                             |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
  26.                             |> Seq.filter(fun (i,j) -> j = None )
  27.                             |> Seq.toArray
  28.         if notIncluded.Length > 0 then false else true
  29.  
  30.     static member GetCombinations(itemSet: int[]) =
  31.         if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
  32.         else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)
  33.  
  34.     static member GetCombinationsForDouble(itemSet: int[]) =
  35.         let combinations =  new List<int[]*int[]*int[]>()
  36.         combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
  37.         combinations
  38.  
  39.     static member GetCombinationsForTriple(itemSet: int[]) =
  40.         let combinations =  new List<int[]*int[]*int[]>()
  41.         combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
  42.         combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
  43.         combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
  44.         combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
  45.         combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
  46.         combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
  47.         combinations

Note how the code in the GetHighConfRules function matches almost one for one to the bullet points at the beginning of the post.  F# is a language where the code follows how you think, not the other way around.  Also note how the 43 lines of code compares to 136 lines of code in the C# example –> less noise, more signal.

 

Association Rule Learning Via F# (Part 2)

Continuing on the path of re-writing the association rule learning code found in this month’s MSDN, I started with the next function on list:

MakeAntecedent

Here is the original code C#

  1. public static int[] MakeAntecedent(int[] itemSet, int[] comb)
  2. {
  3.   // if item-set = (1 3 4 6 8) and combination = (0 2)
  4.   // then antecedent = (1 4)
  5.   int[] result = new int[comb.Length];
  6.   for (int i = 0; i < comb.Length; ++i)
  7.   {
  8.     int idx = comb[i];
  9.     result[i] = itemSet[idx];
  10.   }
  11.   return result;
  12. }

 

and the F# code:

  1. static member MakeAntecedent(itemSet:int[] , comb:int[]) =
  2.     comb |> Array.map(fun x -> itemSet.[x])

 

It is much easier to figure out what is going on via the F# code.  The function takes in 2 arrays.  Array #1 has values, Array #2 has the index of array #1 that is needed.  Using the Array.Map, I return an array where the index number is swapped out to the actual value.  The unit tests run green:

  1. [TestMethod]
  2. public void MakeAntecedentCSUsingExample_ReturnsExpectedValue()
  3. {
  4.     int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
  5.     int[] combo = new int[2] { 0, 2 };
  6.     int[] expected = new int[2] { 1, 4 };
  7.     var actual = CS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
  8.     Assert.AreEqual(expected.Length, actual.Length);
  9.     Assert.AreEqual(expected[0], actual[0]);
  10.     Assert.AreEqual(expected[1], actual[1]);
  11. }
  12.  
  13. [TestMethod]
  14. public void MakeAntecedentFSUsingExample_ReturnsExpectedValue()
  15. {
  16.     int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
  17.     int[] combo = new int[2] { 0, 2 };
  18.     int[] expected = new int[2] { 1, 4 };
  19.     var actual = FS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
  20.     Assert.AreEqual(expected.Length, actual.Length);
  21.     Assert.AreEqual(expected[0], actual[0]);
  22.     Assert.AreEqual(expected[1], actual[1]);
  23. }

 

MakeConsequent

Here is the original C# code:

  1. public static int[] MakeConsequent(int[] itemSet, int[] comb)
  2. {
  3.   // if item-set = (1 3 4 6 8) and combination = (0 2)
  4.   // then consequent = (3 6 8)
  5.   int[] result = new int[itemSet.Length – comb.Length];
  6.   int j = 0; // ptr into combination
  7.   int p = 0; // ptr into result
  8.   for (int i = 0; i < itemSet.Length; ++i)
  9.   {
  10.     if (j < comb.Length && i == comb[j]) // we are at an antecedent
  11.       ++j; // so continue
  12.     else
  13.       result[p++] = itemSet[i]; // at a consequent so add it
  14.   }
  15.   return result;
  16. }

 

Here is the F# Code:

  1. static member MakeConsequent(itemSet:int[] , comb:int[])=   
  2.     let isNotInComb x = not(Array.exists(fun elem -> elem = x) comb)
  3.     itemSet
  4.         |> Array.mapi(fun indexer value -> value,indexer )
  5.         |> Array.filter(fun (value,indexer) -> isNotInComb indexer)
  6.         |> Array.map(fun x -> fst x)

 

Again, it is easier to look at the F# code to figure out what is going on.  In this case, we have to take all of the items in the first array that are not in the second array.  The trick is that the second array does not contain values to be checked, rather the index position.  If you add the Antecedent and the Consequent, you have the total original array.

This code code me a bit of time to figure out because I kept trying to use the out of the box Array features (including slicing) for F# when it hit me that it would be much easier to create a tuple from the original array –> the value and the index.  I would then look up the index in the second array and confirm it is not there and then filter the ones that are not there.  The map function at the end removes the index part of the tuple because it is not needed anymore.

Sure enough, my unit tests ran green:

  1. [TestMethod]
  2. public void MakeConsequentCSUsingExample_ReturnsExpectedValue()
  3. {
  4.     int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
  5.     int[] combo = new int[2] { 0, 2 };
  6.     int[] expected = new int[3] { 3, 6, 8 };
  7.     var actual = CS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
  8.     Assert.AreEqual(expected.Length, actual.Length);
  9.     Assert.AreEqual(expected[0], actual[0]);
  10.     Assert.AreEqual(expected[1], actual[1]);
  11.     Assert.AreEqual(expected[2], actual[2]);
  12. }
  13.  
  14. [TestMethod]
  15. public void MakeConsequentFSUsingExample_ReturnsExpectedValue()
  16. {
  17.     int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
  18.     int[] combo = new int[2] { 0, 2 };
  19.     int[] expected = new int[3] { 3, 6, 8 };
  20.     var actual = FS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
  21.     Assert.AreEqual(expected.Length, actual.Length);
  22.     Assert.AreEqual(expected[0], actual[0]);
  23.     Assert.AreEqual(expected[1], actual[1]);
  24.     Assert.AreEqual(expected[2], actual[2]);
  25. }

 

IndexOf

I then decided to tackle the remaining three functions in reverse because they depend on each other (CountInTrans –> IsSubSetOf –> IndexOf).  IndexOf did not have any code comments of example cases, but the C# code is clear

  1. public static int IndexOf(int[] array, int item, int startIdx)
  2. {
  3.   for (int i = startIdx; i < array.Length; ++i)
  4.   {
  5.     if (i > item) return -1; // i is past where the target could possibly be
  6.     if (array[i] == item) return i;
  7.   }
  8.   return -1;
  9. }

 

What is even clearer is the F# code that does the same thing (yes, I am happy that FindIndex returns a –1 when not found and so did McCaffey):

  1. static member IndexOf(array:int[] , item:int, startIdx:int) =
  2.     Array.FindIndex(array, fun x -> x=item)

 

And I built some unit tests that run green that I think reflect McCaffey’s intent:

  1. [TestMethod]
  2. public void IndexOfCSUsingExample_ReturnsExpectedValue()
  3. {
  4.     int[] itemSet = new int[4] { 0, 1, 4, 5 };
  5.     Int32 item = 1;
  6.     Int32 startIndx = 1;
  7.  
  8.     int expected = 1;
  9.     int actual = CS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
  10.  
  11.     Assert.AreEqual(expected, actual);
  12. }
  13. public void IndexOfFSUsingExample_ReturnsExpectedValue()
  14. {
  15.     int[] itemSet = new int[4] { 0, 1, 4, 5 };
  16.     Int32 item = 1;
  17.     Int32 startIndx = 1;
  18.  
  19.     int expected = 1;
  20.     int actual = FS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
  21.  
  22.     Assert.AreEqual(expected, actual);
  23. }

 

IsSubsetOf

In the C# implementation, IndexOf is called to keep track of where the search is currently pointed. 

  1. public static bool IsSubsetOf(int[] itemSet, int[] trans)
  2. {
  3.   // 'trans' is an ordered transaction like [0 1 4 5 8]
  4.   int foundIdx = -1;
  5.   for (int j = 0; j < itemSet.Length; ++j)
  6.   {
  7.     foundIdx = IndexOf(trans, itemSet[j], foundIdx + 1);
  8.     if (foundIdx == -1) return false;
  9.   }
  10.   return true;
  11. }

In The F# one, that is not needed:

  1. static member IsSubsetOf(itemSet:int[] , trans:int[]) =
  2.     let isInTrans x = (Array.exists(fun elem -> elem = x) trans)
  3.     let filteredItemSet = itemSet
  4.                             |> Array.map(fun value -> value, isInTrans value)
  5.                             |> Array.filter(fun (value, trans) -> trans = false)
  6.     if filteredItemSet.Length = 0 then true
  7.         else false

CountInTrans

Here is the original C# code uses the IsSubsetOf function.

  1. public static int CountInTrans(int[] itemSet, List<int[]> trans, Dictionary<int[], int> countDict)
  2. {
  3.    //number of times itemSet occurs in transactions, using a lookup dict
  4.  
  5.     if (countDict.ContainsKey(itemSet) == true)
  6.     return countDict[itemSet]; // use already computed count
  7.  
  8.   int ct = 0;
  9.   for (int i = 0; i < trans.Count; ++i)
  10.     if (IsSubsetOf(itemSet, trans[i]) == true)
  11.       ++ct;
  12.   countDict.Add(itemSet, ct);
  13.   return ct;
  14. }

And here is the F# Code that also uses that subfunction

  1. static member CountInTrans(itemSet: int[], trans: List<int[]>, countDict: Dictionary<int[], int>) =
  2.     let trans' = trans |> Seq.map(fun value -> value, AssociationRuleProgram.IsSubsetOf (itemSet,value))
  3.     trans' |> Seq.filter(fun item -> snd item = true)
  4.            |> Seq.length

 

 GetHighConfRules

With the subfunctions created and running green, I then tackled the point of the exercise –> GetHighConfRules.  The C# implementation is pretty verbose and there are lots of things happening: 

  1.     public static List<Rule> GetHighConfRules(List<int[]> freqItemSets, List<int[]> trans, double minConfidencePct)
  2.     {
  3.       // generate candidate rules from freqItemSets, save rules that meet min confidence against transactions
  4.       List<Rule> result = new List<Rule>();
  5.  
  6.       Dictionary<int[], int> itemSetCountDict = new Dictionary<int[], int>(); // count of item sets
  7.  
  8.       for (int i = 0; i < freqItemSets.Count; ++i) // each freq item-set generates multiple candidate rules
  9.       {
  10.         int[] currItemSet = freqItemSets[i]; // for clarity only
  11.         int ctItemSet = CountInTrans(currItemSet, trans, itemSetCountDict); // needed for each candidate rule
  12.         for (int len = 1; len <= currItemSet.Length – 1; ++len) // antecedent len = 1, 2, 3, . .
  13.         {
  14.           int[] c = NewCombination(len); // a mathematical combination
  15.  
  16.           while (c != null) // each combination makes a candidate rule
  17.           {
  18.             int[] ante = MakeAntecedent(currItemSet, c);
  19.             int[] cons = MakeConsequent(currItemSet, c); // could defer this until known if needed
  20.           
  21.             int ctAntecendent = CountInTrans(ante, trans, itemSetCountDict); // use lookup if possible
  22.             double confidence = (ctItemSet * 1.0) / ctAntecendent;
  23.  
  24.             if (confidence >= minConfidencePct) // we have a winner!
  25.             {
  26.               Rule r = new Rule(ante, cons, confidence);
  27.               result.Add(r); // if freq item-sets are distinct, no dup rules ever created
  28.             }
  29.             c = NextCombination(c, currItemSet.Length);
  30.           } // while each combination
  31.         } // len each possible antecedent for curr item-set
  32.       } // i each freq item-set
  33.  
  34.       return result;
  35.     } // GetHighConfRules

In the F# code, It decided to work inside out and get the rule for 1 itemset.  I think the code reads pretty clear where each step is laid out

  1. static member GetHighConfRules(freqItemSets:List<int[]>, trans:List<int[]>,  minConfidencePct:float) =
  2.     let returnValue = new List<Rule>()
  3.     freqItemSets
  4.         |> Seq.map (fun i -> i, AssociationRuleProgram.CountInTrans'(i,trans))
  5.         |> Seq.filter(fun (i,c) -> (float)c > minConfidencePct)
  6.         |> Seq.map(fun (i,mcp) -> i,mcp,AssociationRuleProgram.MakeAntecedent(i, trans.[0]))
  7.         |> Seq.map(fun (i,mcp,a) -> i,mcp, a, AssociationRuleProgram.MakeConsequent(i, trans.[0]))
  8.         |> Seq.iter(fun (i,mcp,a,c) -> returnValue.Add(new Rule(a,c,mcp)))
  9.     returnValue

I then attempted to put this block into a larger block (trans.[0]) but then I realized that I was going about this the wrong way.  Instead of using the C# code as a my base line, I need to approach the problem from a functional viewpoint.  That will be the subject of my blog next week…

 

 

 

 

 

 

 

 

 

 

 

 

Association Rule Learning Via F# (Part 1)

I was reading the most recent MSDN when I came across this article.  How awesome is this?  McCaffrey did a great job explaining a really interesting area of analytics and I am loving the fact that MSDN is including articles about data analytics.  When I was reading the article, I ran across this sentence “The demo program is coded in C# but you should be able to refactor the code to other .NET languages such as Visual Basic or Iron Python without too much difficulty”  Iron Python?  Iron Python!  What about F#, the language that matches analytics the way peanut butter goes with chocolate?  Challenge accepted!

The first thing I did was to download his source code from here.  When I first opened the source code, I realized that the code would be a little bit hard to port because it is written from a scientific angle, not a business application point of view.  34 FxCop errors in 259 lines of code confirmed this:

image

Also, there are tons of comments which is very distracting. I generally hate comments, but I figure that since it is a MSDN article and it is supposed to explain what is going on, comments are OK. However, many of the comments can be refactored into more descriptive variables and method names. For example:

imageimage

In any event, let’s look at the code. The first thing I did was change the CS project from a console app to a library and move the test data into an another project . I then moved the console code to the UI. I also moved the Rule class code into its own file, made sure the namespaces matched, and made the AssociationRuleProgram public.  Yup it still runs:

imageimage

So then I created a FSharp library in the solution and set up the class with the single method:

image

A couple of things to note:

1) I left the parameter naming the same, even though it is not particularly intention-revealing

2) F# is typed inferred, so I don’t have to assign the types to the parameters

Next, started looking at the supporting functions to GetHighConfRules.  Up first was the function call NextCombination.  Here is the side by side between the imperative style and the functional style:

imageimage

The next function was NextCombination was more difficult for me to understand.  I stopped what I was doing and built a unit test project that proved correctness using the commented examples as the expected values.  I used 1 test project for both the C# and F# project so I could see both side by side.  An interesting side not is that the unit test naming is different than usual –> instead of naming the class XXXXTests where XXXX is the name of another class, XXXX is the function name that both classes are implementing:

So going back to the example,

image

I wrote two unit tests that match the two comments

image

When I ran the tests, the 1st test passed but the second did not:

image

The problem with the failing test is that null is not being returned, rather {3,4,6}.  So now I have a problem, do I base the F# implementation on the code comments or the code itself?  I decided to base it on the code, because comments often lie but CODE DON”T LIE (thanks ‘sheed).  I adjusted the unit test, got green. 

One of the reasons the code is pretty hard to read/understand is because of the use of ‘i’,’j’,’k’,’n’ variables.  I went back to the article and McCaffrey explains what is going on at the bottom left of page 60.  Another name for the function ‘NextCombination’ could be called ‘GetLexicographicalSuccessor’ and the variable ‘n’ could be called ‘numberOfPossibleItems’.   With that mental vocabulary in place, I went through the functional and divided it into 4 parts:

1Checking to see if the value of the first element is of a certain length

image

2) Creating a result array that is seeded with the values of the input array

image

3) Looping backwards to identify the 1st number in the array that will be adjusted

image

4) From that target element, looping forward and adjusting all subsequent items

image

#1 I will not worry about now and #2 is not needed in F#, so #3 is the first place to start.  What I need is a way of splitting the array into two parts.  Part 1 has the original values that will not change and part 2 has the values that will change.  Seq.Take and Seq.Skip are perfect for this:

  1. let i = Array.LastIndexOf(comb,n)
  2. let i' = if i = – 1 then 0 else i
  3. let comb' = comb |> Seq.take(i') |> Seq.toArray
  4. let comb'' = comb |> Seq.skip(i') |> Seq.toArray

Looking at #4, I now need to increment the values in part 2 by 1.  Seq.take will work:

image

And then putting part 1 and part 2 back together via Array.Append, we have equivalence*:

imageimage

*Equivalence is defined by my unit tests, which both pass green.  I have no idea if other inputs will not work.  Note that the second unit test runs red, so I really think that the code is wrong and that the comment to return null is correct.  The value I am getting for (3;4;5)(5) is (3;4;1) which seems to make sense.

I am not crazy about these explanatory variables (comb’, comb’’, and comb’’’) but I am not sure how to combine them without sacrificing readability.  I definitely want to combine the i and i’ into 1 statement…

I am not sure why Scan is returning 4 items in an array when I am passing in an array that has a length of 3.  I am running out of time today so I just hacked in a Seq.Take.

I’ll continue this exercise in my blog next week.

 

Kaplan-Meier Survival Analysis Using F#

I was reading the most recent issue of MSDN a couple of days ago when I came across this article on doing a Kaplan-Meier survival analysis.  I thought the article was great and I am excited that MSDN is starting to publish articles on data analytics.  However, I did notice that there wasn’t any code in the article, which is odd, so I went to the on-line article and others had a similar question:

image

I decided to implement a Kaplan-Meier survival (KMS) analysis using F#.  After reading the article a couple of times, I was still a bit unclear on how the KMS is implemented and there does not seem to be any pre-rolled in the standard .NET stat libraries out there.  I went on over to this site where there was an excellent description of how the survival probability is calculated.  I went ahead and built an Excel spreadsheet to match the nih one and then compare to what Topol is doing:

image

Notice that Topol censored the data for the article.  If we only cared about the probability of crashes, then we would not censor the data for when the device was turned off.

So then I was ready to start coding so spun up a solution with an F# project for the analysis and a C# project for the testing. 

image

I then loaded into the unit test project the datasets that Topol used:

  1. [TestMethod]
  2. public void EstimateForApplicationX_ReturnsExpected()
  3. {
  4.     var appX = new CrashMetaData[]
  5.     {
  6.         new CrashMetaData(0,1,false),
  7.         new CrashMetaData(1,5,true),
  8.         new CrashMetaData(2,5,false),
  9.         new CrashMetaData(3,8,false),
  10.         new CrashMetaData(4,10,false),
  11.         new CrashMetaData(5,12,true),
  12.         new CrashMetaData(6,15,false),
  13.         new CrashMetaData(7,18,true),
  14.         new CrashMetaData(8,21,false),
  15.         new CrashMetaData(9,22,true),
  16.     };
  17. }

I could then wire up the unit tests to compare the output to the article and what I had come up with.

  1. public void EstimateForApplicationX_ReturnsExpected()
  2. {
  3.     var appX = new CrashMetaData[]
  4.     {
  5.         new CrashMetaData(0,1,false),
  6.         new CrashMetaData(1,5,true),
  7.         new CrashMetaData(2,5,false),
  8.         new CrashMetaData(3,8,false),
  9.         new CrashMetaData(4,10,false),
  10.         new CrashMetaData(5,12,true),
  11.         new CrashMetaData(6,15,false),
  12.         new CrashMetaData(7,18,true),
  13.         new CrashMetaData(8,21,false),
  14.         new CrashMetaData(9,22,true),
  15.     };
  16.  
  17.     var expected = new SurvivalProbabilityData[]
  18.     {
  19.         new SurvivalProbabilityData(0,1.000),
  20.         new SurvivalProbabilityData(5,.889),
  21.         new SurvivalProbabilityData(12,.711),
  22.         new SurvivalProbabilityData(18,.474),
  23.         new SurvivalProbabilityData(22,.000)
  24.     };
  25.  
  26.     KaplanMeierEstimator estimator = new KaplanMeierEstimator();
  27.     var actual = estimator.CalculateSurvivalProbability(appX);
  28.  
  29.     Assert.AreSame(expected, actual);
  30. }

 

However, one of the neat features of F# is the REPL so I don’t need to keep running unit tests to prove correctness when I am proving out a concept.  So I added equivalent test code in the beginning of the F# project so I could run in the REPL my ideas:

  1. type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
  2.  
  3. type KapalanMeierAnalysis() =
  4.     member this.GenerateXAppData ()=
  5.                     [|  {userId=0; crashTime=1; crashed=false};{userId=1; crashTime=5; crashed=true};
  6.                         {userId=2; crashTime=5; crashed=false};{userId=3; crashTime=8; crashed=false};
  7.                         {userId=4; crashTime=10; crashed=false};{userId=5; crashTime=12; crashed=true};
  8.                         {userId=6; crashTime=15; crashed=false};{userId=7; crashTime=18; crashed=true};
  9.                         {userId=8; crashTime=21; crashed=false};{userId=9; crashTime=22; crashed=true}|]
  10.     
  11.     member this.RunAnalysis(crashMetaData: array<CrashMetaData>) =

The first thing I did was duplicate the 1st 3 columns of the Excel spreadsheet:

  1. let crashSequence = crashMetaData
  2.                         |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  3.                                                                                 | true -> 1
  4.                                                                                 | false -> 0),
  5.                                                                  (match crash.crashed with
  6.                                                                                 | true -> 0
  7.                                                                                 | false -> 1))

 

In the REPL:

image

The forth column is tricky because it is a cumulative calculation.  Instead of for..eaching in an imperative style, I took advantage of the functional language constructs to make the code much more readable.   Once I calculated that column outside of the base Sequence, I added it back in via Seq.Zip

  1. let cumulativeDevices = crashMetaData.Length
  2.  
  3. let crashSequence = crashMetaData
  4.                         |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  5.                                                                                 | true -> 1
  6.                                                                                 | false -> 0),
  7.                                                                  (match crash.crashed with
  8.                                                                                 | true -> 0
  9.                                                                                 | false -> 1))
  10. let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
  11.  
  12. let crashSequence' = Seq.zip crashSequence availableDeviceSequence
  13.                             |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)

 

In the REPL:

image

The next two columns were a snap –> they were just calculations based on the existing values:

  1. let cumulativeDevices = crashMetaData.Length
  2.  
  3. let crashSequence = crashMetaData
  4.                         |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  5.                                                                                 | true -> 1
  6.                                                                                 | false -> 0),
  7.                                                                  (match crash.crashed with
  8.                                                                                 | true -> 0
  9.                                                                                 | false -> 1))
  10. let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
  11.  
  12. let crashSequence' = Seq.zip crashSequence availableDeviceSequence
  13.                             |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
  14.  
  15. let crashSequence'' = crashSequence'
  16.                             |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld))

 

The last column was another cumulative calculation so I added another accumulator and used Seq.scan and Seq.Zip.

  1. let cumulativeDevices = crashMetaData.Length
  2. let cumulativeSurvivalProbability = 1.
  3.  
  4. let crashSequence = crashMetaData
  5.                         |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  6.                                                                                 | true -> 1
  7.                                                                                 | false -> 0),
  8.                                                                  (match crash.crashed with
  9.                                                                                 | true -> 0
  10.                                                                                 | false -> 1))
  11. let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
  12.  
  13. let crashSequence' = Seq.zip crashSequence availableDeviceSequence
  14.                             |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
  15.  
  16. let crashSequence'' = crashSequence'
  17.                             |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld))
  18.  
  19. let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
  20. let survivalProbabilitySequence' = survivalProbabilitySequence
  21.                                             |> Seq.skip 1

The last step was to map all of the columns and only output what was in the article.  The final answer is:

  1. namespace ChickenSoftware.SurvivalAnalysis
  2.  
  3. type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
  4. type public SurvivalProbabilityData = {crashTime: int; survivalProbaility: float}
  5.  
  6. type KaplanMeierEstimator() =
  7.     member this.CalculateSurvivalProbability(crashMetaData: array<CrashMetaData>) =
  8.             let cumulativeDevices = crashMetaData.Length
  9.             let cumulativeSurvivalProbability = 1.
  10.  
  11.             let crashSequence = crashMetaData
  12.                                     |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
  13.                                                                                             | true -> 1
  14.                                                                                             | false -> 0),
  15.                                                                              (match crash.crashed with
  16.                                                                                             | true -> 0
  17.                                                                                             | false -> 1))
  18.             let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
  19.  
  20.             let crashSequence' = Seq.zip crashSequence availableDeviceSequence
  21.                                         |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
  22.  
  23.             let crashSequence'' = crashSequence'
  24.                                         |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld))
  25.  
  26.             let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
  27.             let survivalProbabilitySequence' = survivalProbabilitySequence
  28.                                                         |> Seq.skip 1
  29.  
  30.             let crashSequence''' = Seq.zip crashSequence'' survivalProbabilitySequence'
  31.                                         |> Seq.map(fun ((t,c,nc,cumld,dp,sp),cumlsp) -> t,c,nc,cumld,dp,sp,cumlsp)
  32.             crashSequence'''
  33.                     |> Seq.filter(fun (t,c,nc,cumld,dp,sp,cumlsp) -> c=1 )
  34.                     |> Seq.map(fun (t,c,nc,cumld,dp,sp,cumlsp) -> t,System.Math.Round(cumlsp,3))

image

And this matches the article (almost exactly).  The article also has a row for iteration zero, which I did not bake in.  Instead of fixing my code, I changed the unit test and removed that 1st column.  In any event, I ran the test and it ran red –> but the values are identical so I assume it is a problem with the Assert.AreSame()  function.  I would take the time to figure it out but it is 75 degrees on a Sunday afternoon and I want to go play catch with my kids…

image

Note it also matches the other data set Topol has in the article:

image

In any event, this code reads pretty much the way I was thinking about the problem – each column of the Excel spreadsheet has a 1 to 1 correspondence to the F# code block.   I did use explanatory variables liberally which might offend the more advanced functional programmers but taking each step in turn really helped me focus on getting each step correct before going to the next one.

1) I had to offset the cumulativeSurvivalProabability by one because the calculation is how many crashed on a day compared to how many were working at the start of the day.  The Seq.Scan increments the counter for the next row of the sequence and I need it for the current row.  Perhaps there is an overload for Seq.Scan?

2) I adopted the functional convention of using ticks to denote different physical manifestations of the same logic concept (crashedDeviceSequence “became” crashedDeviceSequence’, etc…).  Since everything is immutable by default in F#, this kind of naming convention makes a lot of sense to me.  However, I can see it quickly becoming unwieldy.

3) I could not figure out how to operate on the base tuple so instead  I used a couple of supporting Sequences and then put everything together using Seq.Zip.  I assume there is a more efficient way to do that.

4) One of the knocks against functional/scientific programming is that values are named poorly.  To combat that, I used the full names in my tuples  to start.  After a certain point though, the names got too unwieldy so I resorted to their initials.  I am not sure what the right answer is here, or even if there is right answer.

.

 

 

 

 

Microsoft Language Stack Analogy

I am getting ready for my presentations at Charlotte Code Camp next Saturday.  My F# session is a business-case driven one: reasons why the average C# developer might want to take a look at F#.  I break the session down into 5 sections:  F# is integrated, fast, expressive, bug-resistant, and analytical.  In the fast piece, I am going to make the analogy of Visual Studio to a garage. 

Consider a man who lives in a nice house in a suburban neighborhood with a three car garage. Every morning when he gets ready for his morning commute to work, he opens the door that goes from their house into the their garage and there sitting in the 1st bay is a minivan. 

image

Now there is nothing wrong with the minivan – it is dependable, all of the neighbors drive it, it does many things pretty well.  However, consider that right next to the minivan, never been used, is a Ferrari.  Our suburban programmer has heard about a Ferrari, and has perhaps even glanced at it curiously when he  pulls out in the morning , but he:

  • Doesn’t see the point of driving it because the minivan suits him just fine
  • Is afraid to try driving it because he doesn’t drive stick and taking the time to learn would slow him down
  • Don’t want to drive it because then he would have to explain to his project manager wife why he are driving around town in such a car

So the Ferrari sits unused.  To round out the analogy, in the 3rd bay is a helicopter that no one in their right mind will touch.  Finally, there is a junked car around back that no one uses anymore that he has to keep around because it is too expensive to haul it to the junkyard.

image

 

So this is what happens to a majority of .NET developers when they open their garage called visual studio.  The go with the comfortable language of the C# minivan, ignoring the power and expressiveness of the F# Ferrari and certainly not touching the C++ helicopter.  I picked helicopter for C++ b/c helicopters can go places cars can not, is notoriously difficult to pilot, and when they crash, it is often spectacular and brings down others with them.  The junked car is VB.NET, which makes me sad on certain days….

Also, since C# 2.0, the minivan has tried to becomes more Ferrari-like.  It has added turbo engine called linq, added the var keyword, anonymous types, the dynamic keyword, all in the attempt to become the one minivan that shall rule all.

image

I don’t know much about Roslyn but what I have seen, I think I can take and remove language syntax and it will still compile.  If so, I will try and write a C# program that removes all curly-braces and semi-colons and replaces the var keyword with let.  Is it still C# then?

OT: can you tell which session I am doing at the Hartford Code Camp in 2 weeks?

image

(And no, I did not submit in all caps.  I guess the organizer is very excited about the topic?)

F# and List manipulations

I am preparing for a Beginning F# dojo for TRINUG tomorrow and I decided to do a presentation of Seq.GroupBy, Seq.CountBy, and Seq.SumBy for tuples.  It is not apparent by the same the difference among these constructs and I think having a knowledge of them is indispensible when doing any kind of list analysis.

I started with a basic list like so:

  1. let data = [("A",1);("A",3);("B",2);("C",1)]

I then ran a GroupBy through the REPL and got the following results:

  1. let grouping = data
  2.                 |> Seq.groupBy(fun (letter,number) -> letter)
  3.                 |> Seq.iter (printfn "%A")

  1. ("A", seq [("A", 1); ("A", 3)])
  2. ("B", seq [("B", 2)])
  3. ("C", seq [("C", 1)])

I then ran a CountBy through the REPL and got the following results:

  1. let counting = data
  2.                 |> Seq.countBy(fun (letter,number) -> letter)
  3.                 |> Seq.iter (printfn "%A")

  1. ("A", 2)
  2. ("B", 1)
  3. ("C", 1)

I then ran a SumBy through the REPL and got the following results:

  1. let summing = data
  2.                 |> Seq.sumBy(fun (letter,number) -> number)
  3.                 |> printfn "%A"

  1. 7

Now the fun begins.  I combined a GroupBy and a CountBy through the REPL and got the following results:

  1. let groupingAndCounting = data
  2.                         |> Seq.groupBy(fun (letter,number) -> letter)
  3.                         |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.countBy snd))
  4.                         |> Seq.iter (printfn "%A")

  1. ("A", seq [(1, 1); (3, 1)])
  2. ("B", seq [(2, 1)])
  3. ("C", seq [(1, 1)])

Next I combined a GroupBy and a SumBy through the REPL and got the following results:

  1. let groupingAndSumming = data
  2.                             |> Seq.groupBy(fun (letter,number) -> letter)
  3.                             |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
  4.                             |> Seq.iter (printfn "%A")

  1. ("A", 4)
  2. ("B", 2)
  3. ("C", 1)

I then combined all three:

  1. let groupingAndCountingSummed = data
  2.                                 |> Seq.groupBy(fun (letter,number) -> letter)
  3.                                 |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.countBy snd))
  4.                                 |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
  5.                                 |> Seq.iter (printfn "%A")

  1. ("A", 2)
  2. ("B", 1)
  3. ("C", 1)

With this in hand, I created a way of both counting and summing the second value of a tuple, which is a pretty common task:

  1. let revisedData =
  2.     let summed = data
  3.                     |> Seq.groupBy(fun (letter,number) -> letter)
  4.                     |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
  5.     let counted = data
  6.                     |> Seq.groupBy(fun (letter,number) -> letter)
  7.                     |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.countBy snd))
  8.                     |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
  9.     Seq.zip summed counted
  10.                     |> Seq.map(fun ((letter,summed),(letter,counted)) -> letter,summed,counted)
  11.                     |> Seq.iter (printfn "%A")

  1. ("A", 4, 2)
  2. ("B", 2, 1)
  3. ("C", 1, 1)

Finally, Mathias pointed out that I could use this as an entry to Deddle.  Which is a really good idea….