F# | Jamie Dixon's Home

Consuming Sky Biometry’s Image Recognition API

June 24, 2014 2 Comments

I was looking at Sky Biometry a couple of days ago to do some facial recognition. I am not sure how many other companies out there do this, but working with their API from F# was a great experience.

Sky Biometry uses both Json and Xml for API calls. They also have an example covering library in C#. I decided to use F# and use the json type provider to explore their api and the quality of their recognition algorithms. They have a good documentation page and it didn’t take very long to get my account up and running and then making API calls.

I thought it would be fun to use an example from the first Terminator movie when Ah-nold went around looking for Sarah Connor. I picked two images from the movie series. The first is this image of a photograph of Sarah Connor has taken at the end of the first movie. I know that the terminator didn’t have this photograph in the movie, but working off-script, pretend that the Terminator has that photograph. This second image is from the first movie when she is in a bar. So if the Terminator was powered by Sky Biometry and found Sarah in the bar, how close would it match her to the first photo?

The first thing I did was to fire up a FSharp project in Visual Studio and start scripting the API calls that I would need to do facial recognition.

With all of the calls working in the REPL, I then moved the code into my module. I declared the types at the top and then crated a class that could be consumed by external projects.

You will notice that the type providers are using a local copy of the json to infer the type in the module. I did run into the problem where the type provider using the web call was not capturing the full graph in the type definition, so I took the json and made it local. This led to an interesting problem because a FSharp project out of the box in Visual Studio does not support adding folders. To get around that, I went to my file system and added a folder

I then created files in the folder that correspond to the type providers

Next I unloaded the FS project and edited it

and in the project, I added those files to the ItemGroup that brings in all of the files from disk

<ItemGroup>
  <Compile Include="SkyBiometryImageComparer.fs" />
  <None Include="Script.fsx" />
  <None Include="packages.config" />
  <None Include="app.config" />
  <None Include="SkyBiometryImageJson/AddTags.json" />
  <None Include="SkyBiometryImageJson/FaceDetection.json" />
  <None Include="SkyBiometryImageJson/FaceRecognition.json" />
  <None Include="SkyBiometryImageJson/FaceTraining.json" />
  <None Include="SkyBiometryImageJson/RemoveTags.json" />
</ItemGroup>

Once the project is re-loaded, the folder and the files show up in Solution Explorer.

In each of the .json files, I pasted in the results from the service call that I did in the REPL. For example,

So then I swapped them out in the module and now I get the full graph

With that out of the way, I implemented a one to one match of supporting methods to the API calls:

member this.DetectFace(imageUri:string)=
    let stringBuilder = new StringBuilder()
    stringBuilder.Append(skybiometryUri) |> ignore
    stringBuilder.Append("/fc/faces/detect.json?urls=") |> ignore
    stringBuilder.Append(imageUri) |> ignore
    stringBuilder.Append("&api_key=") |> ignore
    stringBuilder.Append(apiKey) |> ignore
    stringBuilder.Append("&api_secret=") |> ignore
    stringBuilder.Append(apiSecret) |> ignore
    let faceDetection =  skybiometryFaceDetection.Load(stringBuilder.ToString())
    faceDetection.Photos.[0].Tags.[0].Tid
 
member this.SaveTag(tid:string)=
    let stringBuilder = new StringBuilder()
    stringBuilder.Append(skybiometryUri) |> ignore
    stringBuilder.Append("/fc/tags/save.json?uid=") |> ignore
    stringBuilder.Append(uid) |> ignore
    stringBuilder.Append("&tids=") |> ignore
    stringBuilder.Append(tid) |> ignore
    stringBuilder.Append("&api_key=") |> ignore
    stringBuilder.Append(apiKey) |> ignore
    stringBuilder.Append("&api_secret=") |> ignore
    stringBuilder.Append(apiSecret) |> ignore
    let tags = skybiometryAddTags.Load(stringBuilder.ToString())
    tags.Status
 
member this.TrainFace()=
    let stringBuilder = new StringBuilder()
    stringBuilder.Append(skybiometryUri) |> ignore
    stringBuilder.Append("/fc/faces/train.json?uids=") |> ignore
    stringBuilder.Append(uid) |> ignore
    stringBuilder.Append("&api_key=") |> ignore
    stringBuilder.Append(apiKey) |> ignore
    stringBuilder.Append("&api_secret=") |> ignore
    stringBuilder.Append(apiSecret) |> ignore
    let training = skybiometryFaceTraining.Load(stringBuilder.ToString())
    training.Status
 
member this.RecognizeFace(imageUri:string)=
    let stringBuilder = new StringBuilder()
    stringBuilder.Append(skybiometryUri) |> ignore
    stringBuilder.Append("/fc/faces/recognize.json?uids=") |> ignore
    stringBuilder.Append(uid) |> ignore
    stringBuilder.Append("&urls=") |> ignore
    stringBuilder.Append(imageUri) |> ignore
    stringBuilder.Append("&api_key=") |> ignore
    stringBuilder.Append(apiKey) |> ignore
    stringBuilder.Append("&api_secret=") |> ignore
    stringBuilder.Append(apiSecret) |> ignore
    let recognition = skybiometryFaceRecognition.Load(stringBuilder.ToString())
    if recognition.Photos.[0].Tags |> Seq.length > 0 then
        recognition.Photos.[0].Tags.[0].Attributes.Face.Confidence
    else
        0
 
member this.RemoveTag(tid:string) =
    let stringBuilder = new StringBuilder()
    stringBuilder.Append(skybiometryUri) |> ignore
    stringBuilder.Append("/fc/tags/remove.json?tids=") |> ignore
    stringBuilder.Append(tid) |> ignore
    stringBuilder.Append("&api_key=") |> ignore
    stringBuilder.Append(apiKey) |> ignore
    stringBuilder.Append("&api_secret=") |> ignore
    stringBuilder.Append(apiSecret) |> ignore
    let tagRemoval = skybiometryRemoveTags.Load(stringBuilder.ToString())
    tagRemoval.Status

The StringBuilder makes the code pretty verbose, but I understand that it is the most efficient way to aggregate strings so I went with it. Also note that this is happy path programming with no error checking and I assume that the json coming back is well formed.

In any event, with the supporting methods done, it was just an exercise of calling each one in turn:

member this.CalculateFacialRecognitionConfidence(baseUri:string, comparisionUri:string) =
    let tid = this.DetectFace(baseUri)
    if this.SaveTag(tid) = "success" then
        if this.TrainFace() = "success" then
            let confidence = this.RecognizeFace(comparisionUri)
            this.RemoveTag(tid) |> ignore
            confidence
        else
            0
    else
        0

And hopping over to a C# unit test project, I can call the FSharp and run some tests. I created a test for each of the supporting methods and then three happy path tests for the CalculateFacialRecognitionConfidence: comparing the same image should be 100, comparing 2 completely unrelated images should be 0, and then our exercise of identifying Sarah Connor should be better than 50/50. Here is an example of the 100% match use case:

And here is the actual test of finding Sarah Connor.:

    var imageComparer = new SkyBiometryImageComparer(skyBiometryUri, uid, apiKey, apiSecret);
    String basePhotoUri = "http://img2.wikia.nocookie.net/__cb20080226002749/terminator/images/thumb/7/75/T2_sarah_polaroid.jpg/300px-T2_sarah_polaroid.jpg&quot;;
    String targetPhotoUri = "http://emandrews2.files.wordpress.com/2013/04/sarah-connor-scared.jpg&quot;;
    Int32 confidence = imageComparer.CalculateFacialRecognitionConfidence(basePhotoUri, targetPhotoUri);
 
    bool expected = true;
    bool actual = confidence > 50;
    Assert.AreEqual(expected, actual);
}

It runs green (the actual value is 58% for identifying Sarah Connor).

My guess is that that Terminator would start shooting… first at Sarah Connor and then at C#. The C# project that Sky Biometry provides has 2 classes of 1600 and 1420 lines of code to get functional equivalence of the 93 lines of F# code that I wrote (and a vast majority of that code is dealing with the string builder).

Filed under F#, Image Recognition

TRINUG F# Analytics SIG Prep

June 17, 2014 1 Comment

I am putting together the finishing touches of the lab that the F#/data analytics group will do on Wednesday. We are working through Johan Astborg’s awesome book F# For Quantitative Finance. Since the group is pretty much all C# with some bi-curious F#ers, I created a solution that mixes both C# and F#. I put F# in the places where the language really shines: the data layer and the analytical layer. I put C# in the places where C# has popular support: UI and unit tests.

In this lab, I am introducing script files and the REPL for the first time as a way to ‘prove out’ concepts. Once the idea is sufficiently hashed out in the REPL, the code is moved into a source file and corresponding unit tests are created. For example, in the data layer, we use the csv type provider to hit up the yahoo data and save and retrieve it from disk:

#r "C:\Users\Jamie\Desktop\NewCo.OptionsTradingProgram.Solution\packages\FSharp.Data.2.0.8\lib\portable-net40+sl5+wp8+win8\FSharp.Data.dll"
#r "C:\Users\Jamie\Desktop\NewCo.OptionsTradingProgram.Solution\packages\Newtonsoft.Json.6.0.3\lib\\net45\Newtonsoft.Json.dll"
 
open System
open FSharp.Data
open Newtonsoft.Json
open System.IO
 
type csvProvider = CsvProvider<"http://ichart.finance.yahoo.com/table.csv?s=MSFT&quot;>
 
type YahooStockProvider() = 
    member this.GetData(stockSymbol: string) =
            csvProvider.Load("http://ichart.finance.yahoo.com/table.csv?s=&quot; + stockSymbol).Rows
    
type FileSystemStockProvider() =
    member this.PutData(filePath:string, stockData) =
        let serializedData = stockData 
                                |> Seq.map(fun row -> JsonConvert.SerializeObject(row))
        File.WriteAllLines(filePath,serializedData)
 
    member this.GetData(filePath:string) =
        let serializedData = File.ReadAllLines(filePath)
        serializedData 
            |> Seq.map(fun row -> JsonConvert.DeserializeObject<(DateTime*float*float*float*float*int*float)>(row))

In the analytics module, we spend a good amount of time implementing mathematical calculations and getting comfortable with some of the differences between how imperative C# might implement a solution and how F# would

open System.Collections.Generic
 
let testData = [1.0 .. 6.0]
let sum = testData |> Seq.sum 
let average = testData |> Seq.average
let min = testData |> Seq.min
let max = testData |> Seq.max
let evens = testData |> Seq.filter(fun number -> number % 2. = 0.)
let addOne = testData |> Seq.map(fun number -> number + 1.)
 
 
//http://www.mathsisfun.com/data/standard-deviation.html
let variance (source:IEnumerable<double>) =
    let mean = Seq.average source
    let deltas = Seq.map(fun x -> pown(x-mean) 2) source
    Seq.average deltas
 
let standardDeviation(values:IEnumerable<double>) =
    sqrt(variance(values))
 
let movingAverage(values:IEnumerable<double>, windowSize:int)=
    values 
        |> Seq.windowed (windowSize)
        |> Seq.map(fun window -> window |> Seq.average)
 
let movingStandardDeviation(values:IEnumerable<double>, windowSize:int)=
    values 
        |> Seq.windowed (windowSize)
        |> Seq.map(fun window -> window |> standardDeviation)
 
let bollingerBands (values:IEnumerable<double>, windowSize:int)=
    let movingAverage = movingAverage(values,windowSize)
    let movingStandardDeviation = movingStandardDeviation(values,windowSize)
    let movingStandardDeviation' = movingStandardDeviation |> Seq.map(fun window -> window * 2.)
    Seq.zip movingAverage movingStandardDeviation'

Finally, we have a WPF application that does some basic charting and graphing. The point of this lab is not about building a snazzy UI. The UI (and in fact, the data layer) are just plug-ins to the heart of the application – the analytical module. Also, a majority of the TRINUG’ content focuses on the UI so there is plenty of that already. In any event, here is the code behind in C#.

private void GoButton_Click(object sender, RoutedEventArgs e)
{
    FileSystemStockProvider provider = new FileSystemStockProvider(@"C:\Data\TEST.txt");
    var stockPrices = provider.GetData().Take(20);
    this.StockPriceDataGrid.ItemsSource = stockPrices;
 
    var adjustedClosePrices = from stockPrice in stockPrices
            select stockPrice.Item7;
 
    var dates = from stockPrice in stockPrices.Skip(2)
                             select new { stockPrice.Item1 };
 
    var calculations = new Calculations();
    var movingAverage = calculations.MovingAverage(adjustedClosePrices, 3);
    var movingAverages = dates.Zip(movingAverage, (d, p) => new { date=d.Item1, price=p});
 
    var bollingerBands = calculations.BollingerBands(adjustedClosePrices, 3);
    var upperBandBands = dates.Zip(bollingerBands, (d, bb) => new { date = d.Item1, upperBand = bb.Item1 + (bb.Item2 * 2) });
    var lowerBandBands = dates.Zip(bollingerBands, (d, bb) => new { date = d.Item1, lowerBand = bb.Item1 + (bb.Item2 * 2) * -1 });
 
    this.stockPriceLineGraph.DependentValuePath = "price";
    this.stockPriceLineGraph.IndependentValuePath = "date";
    this.stockPriceLineGraph.ItemsSource = movingAverages;
 
    this.stockPriceLineGraph2.DependentValuePath = "upperBand";
    this.stockPriceLineGraph2.IndependentValuePath = "date";
    this.stockPriceLineGraph2.ItemsSource = upperBandBands;
 
    this.stockPriceLineGraph3.DependentValuePath = "lowerBand";
    this.stockPriceLineGraph3.IndependentValuePath = "date";
    this.stockPriceLineGraph3.ItemsSource = lowerBandBands;
}

I am toying of not using Linq and doing this all via imperative code, which would really drive home the point and power of a functional approach. Here is the UI when running:

This was a alot fun to do and I am looking forward to the next lab, which is implementing the remainder of Astborg’s book.

Filed under F#

F# For Quantitative Finance

June 10, 2014 3 Comments

I picked up Johan Astborg’s F# for Quantitative Finance a couple of days ago and I thought it was great. It was a practical hands-on way of working with FSharp. It is so good, I think I am going to use it as a basis for a couple Triangle .NET User group F# SIGs. One of the great things about the book are the code samples –they expose features of the F# language in a way that allows beginners to understand. So in the trade-off between clarity and cleverness, Astborg errs on the side of clarity. The only thing I think I would change would be to remove Chapter 2. Chapter 2 suffers the same problem that many “intro to X computer language” books have –> pages of language features with small independent samples. This kind of book does not help many (if any) people learn the language easily, and it reads about as well as the MSDN component docs.

I did notice that Astborg did use the typical scientific computing approach to variable naming. For example, on page 141 he creates a function like this:

member this.black_scholes call_put_flag s x t r v =
    let cnd x =
        let a1 = 0.3198253
        let a2 = -0.35653782
        let a3 = 1.781477937
        let a4 = -1.821255978
        let a5 = 1.330274429
        let pi = 3.141592654
        let l = abs(x)
        let k = 1.0/(1.0 + 0.2316419)
        let w = (1.0)
        if x < 0.0 then 1.0 – w else w
 
    let d1=(log(s/x) + (r*v*v*0.5)*t)/(v*sqrt(t))
    let d2=d1-v*sqrt(t)
    match call_put_flag with
        | Put -> x*exp(-r*t)*cnd(-d2)-s*cnd(-d1)
        | Call ->s*cnd(d1)-x*exp(-r*t)*cnd(d2)

Concentrating only on the function’s arguments, what is s,x,t,r, and v? It is not immediately apparent and you have to read the comments above the function to discover that s is the stockprice, etc… I think there is a better way and I wonder if FxCop (if it ever comes to F#) would agree with me.. For example, this is a slightly better version

member this.black_scholes call_put_flag stockPrice strikePriceOfPotion timeToExpierationInYears riskFreeInterestRate volatility =
    let cnd x =

The problem is that the number of arguments makes the naming unwieldy. A third, and much preferred option is to pass in a data structure:

type PutCallFlag = Put | Call
 
type black_scholes_input_data = 
    {stockPrice:float;
    strikePriceOfPotion:float;
    timeToExpierationInYears:int;
    riskFreeInterestRate:float;
    volatility:float}
 
type StockAnalyzer() = 
    member this.black_scholes (call_put_flag:PutCallFlag,inputData:black_scholes_input_data) =
        let cnd x =
            let a1 = 0.3198253
            let a2 = -0.35653782
            let a3 = 1.781477937
            let a4 = -1.821255978
            let a5 = 1.330274429
            let pi = 3.141592654
            let l = abs(x)
            let k = 1.0/(1.0 + 0.2316419)
            let w = (1.0)
            if x < 0.0 then 1.0 – w else w
 
        let d1=(log(inputData.stockPrice/inputData.strikePriceOfPotion) 

I doubt there is a performance hit and the code becomes much more readable. It also forces the really long functions to be broken up and each piece becomes more testable.I f there is not a corresponding domain for the input_arguments, then that naming will have to do.

Filed under Book Review, F#

Using Subsets for Association Rule Learning

June 3, 2014 1 Comment

I finished up writing the association rule program from MSDN in F# last week. One of the things bothering me about the way I implemented the algorithms is that I hard-coded the combinations (antecedent and consequent) from the item-sets:

static member GetCombinationsForDouble(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
    combinations
 
static member GetCombinationsForTriple(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
    combinations

I thought it would be a fun exercise to make a function that returns the combinations for an N number of itemSets. My first several attempts failed because I started off with the wrong vocabulary. I spent several days trying to determine how to create all of the combinations and/or permutations from the itemSet. It then hit me that I would be looking at getting all subsets and what do you know, there are some excellent examples out there.

So if I was going to use the yield and yield-bang method of calculating the subsets in my class, I first needed to remove the rec and just let the class call itself.

static member Subsets s =
    set [
        yield s
        for e in s do
            yield! AssociationRuleProgram2.Subsets (Set.remove e s) ]

I then needed a way of translating the itemSet which is a an int array into a set and back again. Fortunately, the set module has ofArray and toArray functions so I wrote my code exactly the way I just described the problem:

static member GetAntcentAndConsequent(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    let itemSet' = Set.ofArray itemSet
    let subSets = AssociationRuleProgram2.Subsets itemSet'
    let subSets' = Set.toArray subSets
    let subSets'' = Array.map(fun s-> Set.toArray s)
    let subSets''' = Array.map(fun s -> Seq.toArray s, AssociationRuleProgram2.GetAntcentAndConsequent s)

Note that I had to call toArray twice because the Subsets returns a Set<Set<Int>>.

In any event, I then needed a way of spitting the itemSet into antecedents and consequents (called combinations) based on the current subset. I toyed around with a couple different ways of solving the problem when I stumbled upon a way that makes alot of sense to me. I changed the itemset from an array of int to an array of tuple<int*bool>. If the subset is in the itemSet, then the bool flag is true, if not it is false. Then, I would apply an Seq.filter to the array and separate it out into antecedents and consequents.

static member GetCombination array subArray = 
    let array' = array |> Seq.map(fun i -> i, subArray |> Array.exists(fun j -> i = j))
    let antecedent = array' |> Seq.filter(fun (i,j) -> j = true) |> Seq.toArray
    let consquent = array' |> Seq.filter(fun (i,j) -> j = false) |> Seq.toArray
    let antecedent' = antecedent|> Seq.map(fun (i,j) -> i) 
    let consquent' = consquent|> Seq.map(fun (i,j) -> i) 
    Seq.toArray antecedent', Seq.toArray consquent'

The major downside of this approach is that I am using Array.exists for my filter flag so if there is more than one of the same value in the itemset, it does not work. However, the original example had each itemset being unique so think I am OK.

So with these tow methods, I now have a way of dealing with N number of itemsets. Interestingly, the amount of code (even with my verbose F#) is significantly less than the C# equivalent and closer to how I think I think.

Filed under Analytics, F#

Association Rule Problem: Part 3

May 27, 2014 1 Comment

After spending a couple of weeks working though the imperative code, I decided to approach the problem from a F#/functional point of view. Going back to the original article, there are several steps that McCaffrey walks through:

Get a series of transactions
Get the frequent item-sets for the transactions
For each item-set, get all possible combinations. Each combination is broken into an antecedent and consequent
Apply the frequency of each antecedent in all transactions
If the frequency of the combination is greater than the confidence level, include it in the final set

For the purposes of this article, Step #1 and Step #2 were already done. My code starts with step #3. Instead of for..eaching and if..thening my way though the item-sets, I decided to look at how permutations and combinations are done in F#. Interestingly, one of the first articles on permutations and combinations on Google is from McCaffrey in MSDN from four years ago. Unfortunately, this article was of limited use because the code is decidedly non-functional so it might as well been written in C# (this was pointed out in the comments). So going to Stack Overflow, there are plenty of good examples of getting combinations in F# on SO and elsewhere. After playing with the code samples for a bit (my favorite one was this), it hit me that the ordinal positions are the same for an array of X size. So going back to McCaffrey’s example, there is only item-sets of 2 and 3 length. Therefore, I can hard-code the results and leave the actual calculation for another time.

static member GetCombinationsForDouble(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
    combinations
 
static member GetCombinationsForTriple(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
    combinations

I used a tuple to represent the antecedent array and consequent array values. I then spun up a unit test to compare results based on McCaffrey’s detailed example:

[TestMethod]
public void GetValuesForATriple_ReturnsExpectedValue()
{
    var expected = new List<Tuple<int[], int[]>>();
    expected.Add(Tuple.Create<int[], int[]>(new int[1] { 3 }, new int[2] { 4, 7 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[1] { 4 }, new int[2] { 3, 7 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[1] { 7 }, new int[2] { 3, 4 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 4 }, new int[1] { 7 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 7 }, new int[1] { 4 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[2] { 4, 7 }, new int[1] { 3 }));
 
    var itemSet = new int[3] { 3,4,7};
    var actual = FS.AssociationRuleProgram2.GetCombinationsForTriple(itemSet);
 
    Assert.AreEqual(expected.Count, actual.Count);
}

A couple of things to note about the unit test:

1) The rules about variable naming and whatnot that apply in business application development quickly fall down when applied to scientific computing. For example, there is no way that this

List<Tuple<int[], int[]>> expected = new List<Tuple<int[], int[]>>();

is more readable that this

var expected = new List<Tuple<int[], int[]>>();

In fact, it is less readable. The use of complex data structures and algorithms force a different set of naming conventions. Applying Fx-Cop or other framework naming conventions to scientific programming is as useful as applying scientific naming conventions to framework development. If it is a screw, use a screwdriver. If it is a nail, user a hammer…

2) I don’t have a good way of comparing the results of a tuple of paired arrays for equivalence – there is certainly nothing out of the box in Microsoft.VisualStudio.TestTools.UnitTesting. I toyed (briefly) with creating a method to compare equivalence in arrays but I did not in the interest of time. That would be a welcome additional to the testing namespace.

Sure enough, running the unit test using McCaffrey’s data all run green.

With step 3 knocked out, I now needed to determine the frequency of the antecedent in the transactions list. This step is better broken down into a couple of sub-steps. I used McCaffrey’s detailed example of 3,4,7 as proof of correctness in my unit tests:

I need a way of taking the antecedent of 3, and comparing it to all transactions (which are arrays) to see how often it appears. As an additional layer of complexity, that 3 is not an int, it is an array (all be it an array of one). I could not find a equivalent question on StackOverflow (meaning I probably am asking the wrong question), so I went ahead of made a mental model where I would map the TryFindIndex function against each item of subset and see if that value is in the original set. The result is a tuple with the original value and the ordinal position in the set. The key thing is that if the item was not found, it returns “None”. So I just have to filter on that flag and if the result of the filter is greater than 1, I know that something was not found and the functional can return false

In code, it pretty much looks like the way I just described it:

static member SetContainsSubset(set: int[], subset: int[]) =
    let notIncluded = subset
                        |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
                        |> Seq.filter(fun (i,j) -> j = None ) 
                        |> Seq.toArray
    if notIncluded.Length > 0 then false else true

And I generated my unit tests out of the example too:

[TestMethod]
public void SetContainsSubsetUsingMatched_ReturnsTrue()
{
    var set = new int[4] { 1, 3, 4, 7 };
    var subset = new int[3] { 3, 4, 7 };
 
    Boolean expected = true;
    Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
 
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void SetContainsSubsetUsingUnMatched_ReturnsFalse()
{
    var set = new int[3] { 1, 4, 7 };
    var subset = new int[3] { 3, 4, 7 };
 
    Boolean expected = false;
    Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
 
    Assert.AreEqual(expected, actual);
 
}

With this supporting function ready, I can then apply it to an array and see how many trues I get. That is the Count value in Figure 2 of the article. Seq.Map fits this task perfectly.

static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
    transactions
        |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
        |> Seq.filter(fun (t,f) -> f = true)
        |> Seq.length

And the subsequent unit test also runs green

[TestMethod]
public void CountItemSetInTransactions_ReturnsExpected()
{
    List<int[]> transactions = new List<int[]>();
    transactions.Add(new int[] { 0, 3, 4, 11 });
    transactions.Add(new int[] { 1, 4, 5 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 0, 5 });
    transactions.Add(new int[] { 3, 5, 9 });
    transactions.Add(new int[] { 2, 3, 4, 7 });
    transactions.Add(new int[] { 2, 5, 8 });
    transactions.Add(new int[] { 0, 1, 2, 5, 10 });
    transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
 
    var itemSet = new int[1] { 3 };
 
    Int32 expected = 6;
    Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
 
    Assert.AreEqual(expected, actual);
 
}

So with this in place, I am ready for the next column, the confidence column. McCaffrey used the numerator of 3 which is shown here:

So I assume that this count is the number of times 3,4,7 show up in the the transaction set. If so, the supporting function ItemSetCountInTransactions can also be used. I created a unit test and it ran green

[TestMethod]
public void CountItemSetInTransactionsUsing347_ReturnsThree()
{
    List<int[]> transactions = new List<int[]>();
    transactions.Add(new int[] { 0, 3, 4, 11 });
    transactions.Add(new int[] { 1, 4, 5 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 0, 5 });
    transactions.Add(new int[] { 3, 5, 9 });
    transactions.Add(new int[] { 2, 3, 4, 7 });
    transactions.Add(new int[] { 2, 5, 8 });
    transactions.Add(new int[] { 0, 1, 2, 5, 10 });
    transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
 
    var itemSet = new int[3] { 3,4,7 };
 
    Int32 expected = 3;
    Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
 
    Assert.AreEqual(expected, actual);
 
}

So the last piece was to put it together in the GetHighConfRules method. I did not change the signature

static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
    let returnValue = new List<Rule>()
    let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a)) 
    combinations 
        |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
        |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
        |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
        |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
        |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
    returnValue

Note that I did add a helper function to get Combinations based on the length of the array

static member GetCombinations(itemSet: int[]) =
    if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
    else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)

And when I run that from the console:

So this is pretty close. McCaffrey allows for inversion of the numbers in the array (3:4 is not the same as 4:3) and I do not – but his supporting detail does not show that so I am not sure what is the correct answer. In any event, this is pretty good. The F# code can be refactored so that all combinations can be sent from an array. In the mean time, here is all 43 lines of the program.

open System
open System.Collections.Generic
 
type AssociationRuleProgram2 = 
 
    static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
        let returnValue = new List<Rule>()
        let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a)) 
        combinations 
            |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
            |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
            |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
            |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
            |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
        returnValue
 
    static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
        transactions
            |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
            |> Seq.filter(fun (t,f) -> f = true)
            |> Seq.length
 
    static member SetContainsSubset(set: int[], subset: int[]) =
        let notIncluded = subset
                            |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
                            |> Seq.filter(fun (i,j) -> j = None ) 
                            |> Seq.toArray
        if notIncluded.Length > 0 then false else true
 
    static member GetCombinations(itemSet: int[]) =
        if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
        else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)
 
    static member GetCombinationsForDouble(itemSet: int[]) =
        let combinations =  new List<int[]*int[]*int[]>()
        combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
        combinations
 
    static member GetCombinationsForTriple(itemSet: int[]) =
        let combinations =  new List<int[]*int[]*int[]>()
        combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
        combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
        combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
        combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
        combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
        combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
        combinations

Note how the code in the GetHighConfRules function matches almost one for one to the bullet points at the beginning of the post. F# is a language where the code follows how you think, not the other way around. Also note how the 43 lines of code compares to 136 lines of code in the C# example –> less noise, more signal.

Filed under Analytics, F#

Association Rule Learning Via F# (Part 2)

May 20, 2014 1 Comment

Continuing on the path of re-writing the association rule learning code found in this month’s MSDN, I started with the next function on list:

MakeAntecedent

Here is the original code C#

public static int[] MakeAntecedent(int[] itemSet, int[] comb)
{
  // if item-set = (1 3 4 6 8) and combination = (0 2) 
  // then antecedent = (1 4)
  int[] result = new int[comb.Length];
  for (int i = 0; i < comb.Length; ++i)
  {
    int idx = comb[i];
    result[i] = itemSet[idx];
  }
  return result;
}

and the F# code:

static member MakeAntecedent(itemSet:int[] , comb:int[]) =
    comb |> Array.map(fun x -> itemSet.[x])

It is much easier to figure out what is going on via the F# code. The function takes in 2 arrays. Array #1 has values, Array #2 has the index of array #1 that is needed. Using the Array.Map, I return an array where the index number is swapped out to the actual value. The unit tests run green:

[TestMethod]
public void MakeAntecedentCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[2] { 1, 4 };
    var actual = CS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
}
 
[TestMethod]
public void MakeAntecedentFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[2] { 1, 4 };
    var actual = FS.AssociationRuleProgram.MakeAntecedent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
}

MakeConsequent

Here is the original C# code:

public static int[] MakeConsequent(int[] itemSet, int[] comb)
{
  // if item-set = (1 3 4 6 8) and combination = (0 2) 
  // then consequent = (3 6 8)
  int[] result = new int[itemSet.Length – comb.Length];
  int j = 0; // ptr into combination
  int p = 0; // ptr into result
  for (int i = 0; i < itemSet.Length; ++i)
  {
    if (j < comb.Length && i == comb[j]) // we are at an antecedent
      ++j; // so continue
    else
      result[p++] = itemSet[i]; // at a consequent so add it
  }
  return result;
}

Here is the F# Code:

static member MakeConsequent(itemSet:int[] , comb:int[])=   
    let isNotInComb x = not(Array.exists(fun elem -> elem = x) comb)
    itemSet 
        |> Array.mapi(fun indexer value -> value,indexer )
        |> Array.filter(fun (value,indexer) -> isNotInComb indexer)
        |> Array.map(fun x -> fst x)

Again, it is easier to look at the F# code to figure out what is going on. In this case, we have to take all of the items in the first array that are not in the second array. The trick is that the second array does not contain values to be checked, rather the index position. If you add the Antecedent and the Consequent, you have the total original array.

This code code me a bit of time to figure out because I kept trying to use the out of the box Array features (including slicing) for F# when it hit me that it would be much easier to create a tuple from the original array –> the value and the index. I would then look up the index in the second array and confirm it is not there and then filter the ones that are not there. The map function at the end removes the index part of the tuple because it is not needed anymore.

Sure enough, my unit tests ran green:

[TestMethod]
public void MakeConsequentCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[3] { 3, 6, 8 };
    var actual = CS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
    Assert.AreEqual(expected[2], actual[2]);
}
 
[TestMethod]
public void MakeConsequentFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[5] { 1, 3, 4, 6, 8 };
    int[] combo = new int[2] { 0, 2 };
    int[] expected = new int[3] { 3, 6, 8 };
    var actual = FS.AssociationRuleProgram.MakeConsequent(itemSet, combo);
    Assert.AreEqual(expected.Length, actual.Length);
    Assert.AreEqual(expected[0], actual[0]);
    Assert.AreEqual(expected[1], actual[1]);
    Assert.AreEqual(expected[2], actual[2]);
}

IndexOf

I then decided to tackle the remaining three functions in reverse because they depend on each other (CountInTrans –> IsSubSetOf –> IndexOf). IndexOf did not have any code comments of example cases, but the C# code is clear

public static int IndexOf(int[] array, int item, int startIdx)
{
  for (int i = startIdx; i < array.Length; ++i)
  {
    if (i > item) return -1; // i is past where the target could possibly be
    if (array[i] == item) return i;
  }
  return -1;
}

What is even clearer is the F# code that does the same thing (yes, I am happy that FindIndex returns a –1 when not found and so did McCaffey):

static member IndexOf(array:int[] , item:int, startIdx:int) =
    Array.FindIndex(array, fun x -> x=item)

And I built some unit tests that run green that I think reflect McCaffey’s intent:

[TestMethod]
public void IndexOfCSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[4] { 0, 1, 4, 5 };
    Int32 item = 1;
    Int32 startIndx = 1;
 
    int expected = 1;
    int actual = CS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
 
    Assert.AreEqual(expected, actual);
}
public void IndexOfFSUsingExample_ReturnsExpectedValue()
{
    int[] itemSet = new int[4] { 0, 1, 4, 5 };
    Int32 item = 1;
    Int32 startIndx = 1;
 
    int expected = 1;
    int actual = FS.AssociationRuleProgram.IndexOf(itemSet, item, startIndx);
 
    Assert.AreEqual(expected, actual);
}

IsSubsetOf

In the C# implementation, IndexOf is called to keep track of where the search is currently pointed.

public static bool IsSubsetOf(int[] itemSet, int[] trans)
{
  // 'trans' is an ordered transaction like [0 1 4 5 8]
  int foundIdx = -1;
  for (int j = 0; j < itemSet.Length; ++j)
  {
    foundIdx = IndexOf(trans, itemSet[j], foundIdx + 1);
    if (foundIdx == -1) return false;
  }
  return true;
}

In The F# one, that is not needed:

static member IsSubsetOf(itemSet:int[] , trans:int[]) =
    let isInTrans x = (Array.exists(fun elem -> elem = x) trans)
    let filteredItemSet = itemSet
                            |> Array.map(fun value -> value, isInTrans value)
                            |> Array.filter(fun (value, trans) -> trans = false)
    if filteredItemSet.Length = 0 then true
        else false

CountInTrans

Here is the original C# code uses the IsSubsetOf function.

public static int CountInTrans(int[] itemSet, List<int[]> trans, Dictionary<int[], int> countDict)
{
   //number of times itemSet occurs in transactions, using a lookup dict
 
    if (countDict.ContainsKey(itemSet) == true)
    return countDict[itemSet]; // use already computed count
 
  int ct = 0;
  for (int i = 0; i < trans.Count; ++i)
    if (IsSubsetOf(itemSet, trans[i]) == true)
      ++ct;
  countDict.Add(itemSet, ct);
  return ct;
}

And here is the F# Code that also uses that subfunction

static member CountInTrans(itemSet: int[], trans: List<int[]>, countDict: Dictionary<int[], int>) =
    let trans' = trans |> Seq.map(fun value -> value, AssociationRuleProgram.IsSubsetOf (itemSet,value))
    trans' |> Seq.filter(fun item -> snd item = true)
           |> Seq.length

GetHighConfRules

With the subfunctions created and running green, I then tackled the point of the exercise –> GetHighConfRules. The C# implementation is pretty verbose and there are lots of things happening:

    public static List<Rule> GetHighConfRules(List<int[]> freqItemSets, List<int[]> trans, double minConfidencePct)
    {
      // generate candidate rules from freqItemSets, save rules that meet min confidence against transactions
      List<Rule> result = new List<Rule>();
 
      Dictionary<int[], int> itemSetCountDict = new Dictionary<int[], int>(); // count of item sets
 
      for (int i = 0; i < freqItemSets.Count; ++i) // each freq item-set generates multiple candidate rules
      {
        int[] currItemSet = freqItemSets[i]; // for clarity only
        int ctItemSet = CountInTrans(currItemSet, trans, itemSetCountDict); // needed for each candidate rule
 
        for (int len = 1; len <= currItemSet.Length – 1; ++len) // antecedent len = 1, 2, 3, . .
        {
          int[] c = NewCombination(len); // a mathematical combination
 
          while (c != null) // each combination makes a candidate rule
          {
            int[] ante = MakeAntecedent(currItemSet, c);
            int[] cons = MakeConsequent(currItemSet, c); // could defer this until known if needed
          
            int ctAntecendent = CountInTrans(ante, trans, itemSetCountDict); // use lookup if possible 
            double confidence = (ctItemSet * 1.0) / ctAntecendent;
 
            if (confidence >= minConfidencePct) // we have a winner!
            {
              Rule r = new Rule(ante, cons, confidence); 
              result.Add(r); // if freq item-sets are distinct, no dup rules ever created
            }
            c = NextCombination(c, currItemSet.Length);
          } // while each combination
        } // len each possible antecedent for curr item-set
      } // i each freq item-set
 
      return result;
    } // GetHighConfRules

In the F# code, It decided to work inside out and get the rule for 1 itemset. I think the code reads pretty clear where each step is laid out

static member GetHighConfRules(freqItemSets:List<int[]>, trans:List<int[]>,  minConfidencePct:float) =
    let returnValue = new List<Rule>()
    freqItemSets 
        |> Seq.map (fun i -> i, AssociationRuleProgram.CountInTrans'(i,trans))
        |> Seq.filter(fun (i,c) -> (float)c > minConfidencePct)
        |> Seq.map(fun (i,mcp) -> i,mcp,AssociationRuleProgram.MakeAntecedent(i, trans.[0]))
        |> Seq.map(fun (i,mcp,a) -> i,mcp, a, AssociationRuleProgram.MakeConsequent(i, trans.[0]))
        |> Seq.iter(fun (i,mcp,a,c) -> returnValue.Add(new Rule(a,c,mcp)))
    returnValue

I then attempted to put this block into a larger block (trans.[0]) but then I realized that I was going about this the wrong way. Instead of using the C# code as a my base line, I need to approach the problem from a functional viewpoint. That will be the subject of my blog next week…

Filed under Analytics, F#

Association Rule Learning Via F# (Part 1)

May 13, 2014 1 Comment

I was reading the most recent MSDN when I came across this article. How awesome is this? McCaffrey did a great job explaining a really interesting area of analytics and I am loving the fact that MSDN is including articles about data analytics. When I was reading the article, I ran across this sentence “The demo program is coded in C# but you should be able to refactor the code to other .NET languages such as Visual Basic or Iron Python without too much difficulty” Iron Python? Iron Python! What about F#, the language that matches analytics the way peanut butter goes with chocolate? Challenge accepted!

The first thing I did was to download his source code from here. When I first opened the source code, I realized that the code would be a little bit hard to port because it is written from a scientific angle, not a business application point of view. 34 FxCop errors in 259 lines of code confirmed this:

Also, there are tons of comments which is very distracting. I generally hate comments, but I figure that since it is a MSDN article and it is supposed to explain what is going on, comments are OK. However, many of the comments can be refactored into more descriptive variables and method names. For example:

In any event, let’s look at the code. The first thing I did was change the CS project from a console app to a library and move the test data into an another project . I then moved the console code to the UI. I also moved the Rule class code into its own file, made sure the namespaces matched, and made the AssociationRuleProgram public. Yup it still runs:

So then I created a FSharp library in the solution and set up the class with the single method:

A couple of things to note:

1) I left the parameter naming the same, even though it is not particularly intention-revealing

2) F# is typed inferred, so I don’t have to assign the types to the parameters

Next, started looking at the supporting functions to GetHighConfRules. Up first was the function call NextCombination. Here is the side by side between the imperative style and the functional style:

The next function was NextCombination was more difficult for me to understand. I stopped what I was doing and built a unit test project that proved correctness using the commented examples as the expected values. I used 1 test project for both the C# and F# project so I could see both side by side. An interesting side not is that the unit test naming is different than usual –> instead of naming the class XXXXTests where XXXX is the name of another class, XXXX is the function name that both classes are implementing:

So going back to the example,

I wrote two unit tests that match the two comments

When I ran the tests, the 1st test passed but the second did not:

The problem with the failing test is that null is not being returned, rather {3,4,6}. So now I have a problem, do I base the F# implementation on the code comments or the code itself? I decided to base it on the code, because comments often lie but CODE DON”T LIE (thanks ‘sheed). I adjusted the unit test, got green.

One of the reasons the code is pretty hard to read/understand is because of the use of ‘i’,’j’,’k’,’n’ variables. I went back to the article and McCaffrey explains what is going on at the bottom left of page 60. Another name for the function ‘NextCombination’ could be called ‘GetLexicographicalSuccessor’ and the variable ‘n’ could be called ‘numberOfPossibleItems’. With that mental vocabulary in place, I went through the functional and divided it into 4 parts:

1Checking to see if the value of the first element is of a certain length

2) Creating a result array that is seeded with the values of the input array

3) Looping backwards to identify the 1st number in the array that will be adjusted

4) From that target element, looping forward and adjusting all subsequent items

#1 I will not worry about now and #2 is not needed in F#, so #3 is the first place to start. What I need is a way of splitting the array into two parts. Part 1 has the original values that will not change and part 2 has the values that will change. Seq.Take and Seq.Skip are perfect for this:

let i = Array.LastIndexOf(comb,n)
let i' = if i = – 1 then 0 else i
let comb' = comb |> Seq.take(i') |> Seq.toArray
let comb'' = comb |> Seq.skip(i') |> Seq.toArray

Looking at #4, I now need to increment the values in part 2 by 1. Seq.take will work:

And then putting part 1 and part 2 back together via Array.Append, we have equivalence*:

*Equivalence is defined by my unit tests, which both pass green. I have no idea if other inputs will not work. Note that the second unit test runs red, so I really think that the code is wrong and that the comment to return null is correct. The value I am getting for (3;4;5)(5) is (3;4;1) which seems to make sense.

I am not crazy about these explanatory variables (comb’, comb’’, and comb’’’) but I am not sure how to combine them without sacrificing readability. I definitely want to combine the i and i’ into 1 statement…

I am not sure why Scan is returning 4 items in an array when I am passing in an array that has a length of 3. I am running out of time today so I just hacked in a Seq.Take.

I’ll continue this exercise in my blog next week.

Filed under Analytics, F#

Kaplan-Meier Survival Analysis Using F#

May 6, 2014 5 Comments

I was reading the most recent issue of MSDN a couple of days ago when I came across this article on doing a Kaplan-Meier survival analysis. I thought the article was great and I am excited that MSDN is starting to publish articles on data analytics. However, I did notice that there wasn’t any code in the article, which is odd, so I went to the on-line article and others had a similar question:

I decided to implement a Kaplan-Meier survival (KMS) analysis using F#. After reading the article a couple of times, I was still a bit unclear on how the KMS is implemented and there does not seem to be any pre-rolled in the standard .NET stat libraries out there. I went on over to this site where there was an excellent description of how the survival probability is calculated. I went ahead and built an Excel spreadsheet to match the nih one and then compare to what Topol is doing:

Notice that Topol censored the data for the article. If we only cared about the probability of crashes, then we would not censor the data for when the device was turned off.

So then I was ready to start coding so spun up a solution with an F# project for the analysis and a C# project for the testing.

I then loaded into the unit test project the datasets that Topol used:

[TestMethod]
public void EstimateForApplicationX_ReturnsExpected()
{
    var appX = new CrashMetaData[]
    {
        new CrashMetaData(0,1,false),
        new CrashMetaData(1,5,true),
        new CrashMetaData(2,5,false),
        new CrashMetaData(3,8,false),
        new CrashMetaData(4,10,false),
        new CrashMetaData(5,12,true),
        new CrashMetaData(6,15,false),
        new CrashMetaData(7,18,true),
        new CrashMetaData(8,21,false),
        new CrashMetaData(9,22,true),
    };
}

I could then wire up the unit tests to compare the output to the article and what I had come up with.

public void EstimateForApplicationX_ReturnsExpected()
{
    var appX = new CrashMetaData[]
    {
        new CrashMetaData(0,1,false),
        new CrashMetaData(1,5,true),
        new CrashMetaData(2,5,false),
        new CrashMetaData(3,8,false),
        new CrashMetaData(4,10,false),
        new CrashMetaData(5,12,true),
        new CrashMetaData(6,15,false),
        new CrashMetaData(7,18,true),
        new CrashMetaData(8,21,false),
        new CrashMetaData(9,22,true),
    };
 
    var expected = new SurvivalProbabilityData[]
    {
        new SurvivalProbabilityData(0,1.000),
        new SurvivalProbabilityData(5,.889),
        new SurvivalProbabilityData(12,.711),
        new SurvivalProbabilityData(18,.474),
        new SurvivalProbabilityData(22,.000)
    };
 
    KaplanMeierEstimator estimator = new KaplanMeierEstimator();
    var actual = estimator.CalculateSurvivalProbability(appX);
 
    Assert.AreSame(expected, actual);
}

However, one of the neat features of F# is the REPL so I don’t need to keep running unit tests to prove correctness when I am proving out a concept. So I added equivalent test code in the beginning of the F# project so I could run in the REPL my ideas:

type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
 
type KapalanMeierAnalysis() = 
    member this.GenerateXAppData ()= 
                    [|  {userId=0; crashTime=1; crashed=false};{userId=1; crashTime=5; crashed=true};
                        {userId=2; crashTime=5; crashed=false};{userId=3; crashTime=8; crashed=false};
                        {userId=4; crashTime=10; crashed=false};{userId=5; crashTime=12; crashed=true};
                        {userId=6; crashTime=15; crashed=false};{userId=7; crashTime=18; crashed=true};
                        {userId=8; crashTime=21; crashed=false};{userId=9; crashTime=22; crashed=true}|]
    
    member this.RunAnalysis(crashMetaData: array<CrashMetaData>) = 

The first thing I did was duplicate the 1st 3 columns of the Excel spreadsheet:

let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))

In the REPL:

The forth column is tricky because it is a cumulative calculation. Instead of for..eaching in an imperative style, I took advantage of the functional language constructs to make the code much more readable. Once I calculated that column outside of the base Sequence, I added it back in via Seq.Zip

let cumulativeDevices = crashMetaData.Length
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)

In the REPL:

The next two columns were a snap –> they were just calculations based on the existing values:

let cumulativeDevices = crashMetaData.Length
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
let crashSequence'' = crashSequence'
                            |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 

The last column was another cumulative calculation so I added another accumulator and used Seq.scan and Seq.Zip.

let cumulativeDevices = crashMetaData.Length
let cumulativeSurvivalProbability = 1.
 
let crashSequence = crashMetaData 
                        |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                | true -> 1
                                                                                | false -> 0),
                                                                 (match crash.crashed with
                                                                                | true -> 0
                                                                                | false -> 1))
let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                            |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
let crashSequence'' = crashSequence'
                            |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 
 
let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
let survivalProbabilitySequence' = survivalProbabilitySequence
                                            |> Seq.skip 1

The last step was to map all of the columns and only output what was in the article. The final answer is:

namespace ChickenSoftware.SurvivalAnalysis
 
type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
type public SurvivalProbabilityData = {crashTime: int; survivalProbaility: float}
 
type KaplanMeierEstimator() = 
    member this.CalculateSurvivalProbability(crashMetaData: array<CrashMetaData>) = 
            let cumulativeDevices = crashMetaData.Length
            let cumulativeSurvivalProbability = 1.
 
            let crashSequence = crashMetaData 
                                    |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
                                                                                            | true -> 1
                                                                                            | false -> 0),
                                                                             (match crash.crashed with
                                                                                            | true -> 0
                                                                                            | false -> 1))
            let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
 
            let crashSequence' = Seq.zip crashSequence availableDeviceSequence
                                        |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
 
            let crashSequence'' = crashSequence'
                                        |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld)) 
 
            let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
            let survivalProbabilitySequence' = survivalProbabilitySequence
                                                        |> Seq.skip 1
 
            let crashSequence''' = Seq.zip crashSequence'' survivalProbabilitySequence'
                                        |> Seq.map(fun ((t,c,nc,cumld,dp,sp),cumlsp) -> t,c,nc,cumld,dp,sp,cumlsp)
            crashSequence'''
                    |> Seq.filter(fun (t,c,nc,cumld,dp,sp,cumlsp) -> c=1 )
                    |> Seq.map(fun (t,c,nc,cumld,dp,sp,cumlsp) -> t,System.Math.Round(cumlsp,3))

And this matches the article (almost exactly). The article also has a row for iteration zero, which I did not bake in. Instead of fixing my code, I changed the unit test and removed that 1st column. In any event, I ran the test and it ran red –> but the values are identical so I assume it is a problem with the Assert.AreSame() function. I would take the time to figure it out but it is 75 degrees on a Sunday afternoon and I want to go play catch with my kids…

Note it also matches the other data set Topol has in the article:

In any event, this code reads pretty much the way I was thinking about the problem – each column of the Excel spreadsheet has a 1 to 1 correspondence to the F# code block. I did use explanatory variables liberally which might offend the more advanced functional programmers but taking each step in turn really helped me focus on getting each step correct before going to the next one.

1) I had to offset the cumulativeSurvivalProabability by one because the calculation is how many crashed on a day compared to how many were working at the start of the day. The Seq.Scan increments the counter for the next row of the sequence and I need it for the current row. Perhaps there is an overload for Seq.Scan?

2) I adopted the functional convention of using ticks to denote different physical manifestations of the same logic concept (crashedDeviceSequence “became” crashedDeviceSequence’, etc…). Since everything is immutable by default in F#, this kind of naming convention makes a lot of sense to me. However, I can see it quickly becoming unwieldy.

3) I could not figure out how to operate on the base tuple so instead I used a couple of supporting Sequences and then put everything together using Seq.Zip. I assume there is a more efficient way to do that.

4) One of the knocks against functional/scientific programming is that values are named poorly. To combat that, I used the full names in my tuples to start. After a certain point though, the names got too unwieldy so I resorted to their initials. I am not sure what the right answer is here, or even if there is right answer.

Filed under Analytics, F#

Microsoft Language Stack Analogy

April 29, 2014 2 Comments

I am getting ready for my presentations at Charlotte Code Camp next Saturday. My F# session is a business-case driven one: reasons why the average C# developer might want to take a look at F#. I break the session down into 5 sections: F# is integrated, fast, expressive, bug-resistant, and analytical. In the fast piece, I am going to make the analogy of Visual Studio to a garage.

Consider a man who lives in a nice house in a suburban neighborhood with a three car garage. Every morning when he gets ready for his morning commute to work, he opens the door that goes from their house into the their garage and there sitting in the 1st bay is a minivan.

Now there is nothing wrong with the minivan – it is dependable, all of the neighbors drive it, it does many things pretty well. However, consider that right next to the minivan, never been used, is a Ferrari. Our suburban programmer has heard about a Ferrari, and has perhaps even glanced at it curiously when he pulls out in the morning , but he:

Doesn’t see the point of driving it because the minivan suits him just fine
Is afraid to try driving it because he doesn’t drive stick and taking the time to learn would slow him down
Don’t want to drive it because then he would have to explain to his ~~project manager~~ wife why he are driving around town in such a car

So the Ferrari sits unused. To round out the analogy, in the 3rd bay is a helicopter that no one in their right mind will touch. Finally, there is a junked car around back that no one uses anymore that he has to keep around because it is too expensive to haul it to the junkyard.

So this is what happens to a majority of .NET developers when they open their garage called visual studio. The go with the comfortable language of the C# minivan, ignoring the power and expressiveness of the F# Ferrari and certainly not touching the C++ helicopter. I picked helicopter for C++ b/c helicopters can go places cars can not, is notoriously difficult to pilot, and when they crash, it is often spectacular and brings down others with them. The junked car is VB.NET, which makes me sad on certain days….

Also, since C# 2.0, the minivan has tried to becomes more Ferrari-like. It has added turbo engine called linq, added the var keyword, anonymous types, the dynamic keyword, all in the attempt to become the one minivan that shall rule all.

I don’t know much about Roslyn but what I have seen, I think I can take and remove language syntax and it will still compile. If so, I will try and write a C# program that removes all curly-braces and semi-colons and replaces the var keyword with let. Is it still C# then?

OT: can you tell which session I am doing at the Hartford Code Camp in 2 weeks?

(And no, I did not submit in all caps. I guess the organizer is very excited about the topic?)

Filed under .NET Languages, C#, C++, F#

F# and List manipulations

April 22, 2014 2 Comments

I am preparing for a Beginning F# dojo for TRINUG tomorrow and I decided to do a presentation of Seq.GroupBy, Seq.CountBy, and Seq.SumBy for tuples. It is not apparent by the same the difference among these constructs and I think having a knowledge of them is indispensible when doing any kind of list analysis.

I started with a basic list like so:

let data = [("A",1);("A",3);("B",2);("C",1)]

I then ran a GroupBy through the REPL and got the following results:

let grouping = data 
                |> Seq.groupBy(fun (letter,number) -> letter)
                |> Seq.iter (printfn "%A")

("A", seq [("A", 1); ("A", 3)])
("B", seq [("B", 2)])
("C", seq [("C", 1)])

I then ran a CountBy through the REPL and got the following results:

let counting = data 
                |> Seq.countBy(fun (letter,number) -> letter)
                |> Seq.iter (printfn "%A")

("A", 2)
("B", 1)
("C", 1)

I then ran a SumBy through the REPL and got the following results:

let summing = data 
                |> Seq.sumBy(fun (letter,number) -> number)
                |> printfn "%A"

7

Now the fun begins. I combined a GroupBy and a CountBy through the REPL and got the following results:

let groupingAndCounting = data
                        |> Seq.groupBy(fun (letter,number) -> letter)
                        |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.countBy snd))
                        |> Seq.iter (printfn "%A")

("A", seq [(1, 1); (3, 1)])
("B", seq [(2, 1)])
("C", seq [(1, 1)])

Next I combined a GroupBy and a SumBy through the REPL and got the following results:

let groupingAndSumming = data
                            |> Seq.groupBy(fun (letter,number) -> letter)
                            |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
                            |> Seq.iter (printfn "%A")

("A", 4)
("B", 2)
("C", 1)

I then combined all three:

let groupingAndCountingSummed = data
                                |> Seq.groupBy(fun (letter,number) -> letter)
                                |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.countBy snd))
                                |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
                                |> Seq.iter (printfn "%A")

("A", 2)
("B", 1)
("C", 1)

With this in hand, I created a way of both counting and summing the second value of a tuple, which is a pretty common task:

let revisedData = 
    let summed = data
                    |> Seq.groupBy(fun (letter,number) -> letter)
                    |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
    let counted = data
                    |> Seq.groupBy(fun (letter,number) -> letter)
                    |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.countBy snd))
                    |> Seq.map(fun (letter,sequence) -> (letter,sequence |> Seq.sumBy snd))
    Seq.zip summed counted
                    |> Seq.map(fun ((letter,summed),(letter,counted)) -> letter,summed,counted)
                    |> Seq.iter (printfn "%A")

("A", 4, 2)
("B", 2, 1)
("C", 1, 1)

Finally, Mathias pointed out that I could use this as an entry to Deddle. Which is a really good idea….

Filed under F#

← Older posts

Newer posts →

Jamie Dixon's Home

Consuming Sky Biometry’s Image Recognition API

TRINUG F# Analytics SIG Prep

F# For Quantitative Finance

Using Subsets for Association Rule Learning

Association Rule Problem: Part 3

Association Rule Learning Via F# (Part 2)

Association Rule Learning Via F# (Part 1)

Kaplan-Meier Survival Analysis Using F#

Microsoft Language Stack Analogy

F# and List manipulations

Categories

Recent Posts

Archives

Blogroll

Meta