Analytics | Jamie Dixon's Home

Using IBM’s Watson With F#

December 9, 2014 6 Comments

I think everyone is aware of IBM’s Watson from its appearance on Jeopardy. Apparently, IBM has made the Watson Api available for developers if you sign up here. Well, there goes my Sunday morning! I signed up and after one email confirm later, I was in.

IBM has tied Watson to something called “Blue Mix”, which looks to be a full-service suite of applications from deployment to hosting . When I looked at the api documentation here, I decided to use the language translation service as a good “hello world” project. Looking at the api help page, I was hoping just to make a request and get a response with a auth token in the header, like every other api in the world. However, the documentation really leads you down a path of installing the Watson Explorer on your local machine, and a create a blue mix project, etc..

Fortunately, the documentation has some pointers to other projects where people have made their own app. I used thisthis one as a model and set up Fiddler like so

The authorization token is the username and password separated by a colon encoded to base 64.

Sure enough, a 200

Setting it up in #FSharp was a snap

 1 #r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll"
 2 #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll"
 3 
 4 open System
 5 open System.Net.Http
 6 open System.Net.Http.Headers
 7 open System.Net.Http.Formatting
 8 open System.Collections.Generic
 9 
10 
11 let serviceName = "machine_translation"
12 let baseUrl = "http://wex-mt.mybluemix.net/resources/translate"
13 let userName = "youNameHere@aol.com"
14 let password = "yourPasswordHere"
15 let authKey = userName + ":" + password
16 
17 let client = new HttpClient()
18 client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Basic",authKey)
19 
20 let input = new Dictionary<string,string>()
21 input.Add("text","This is a test")
22 input.Add("sid","mt-enus-eses")
23 let content = new FormUrlEncodedContent(input)
24 
25 let result = client.PostAsync(baseUrl,content).Result
26 let resultContent = result.Content.ReadAsStringAsync().Result

And sure enough

You can see the gist here

So with that simple call/request under my belt, I decided to look at the api that everyone is talking about, the question/answer api. I fired up Fiddler again and took a look at the docs. After some tweaking of the Uri, I got a successful request/response:

With the answers to an empty question kind interesting. if not head-scratching:

So passing in a question:

So we are cooking with gas. Back into FSI

 1 #r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll"
 2 #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll"
 3 
 4 open System
 5 open System.Net.Http
 6 open System.Net.Http.Headers
 7 open System.Net.Http.Formatting
 8 open System.Collections.Generic
 9 
10 
11 let baseUrl = "http://wex-qa.mybluemix.net/resources/question"
12 let userName = "yourName@aol.com"
13 let password = "yourCreds"
14 let authKey = userName + ":" + password
15 
16 let client = new HttpClient()
17 client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Basic",authKey)
18 
19 let input = new Dictionary<string,string>()
20 input.Add("question","what time is it")
21 let content = new FormUrlEncodedContent(input)
22 
23 let result = client.PostAsync(baseUrl,content).Result
24 let resultContent = result.Content.ReadAsStringAsync().Result

With the result like so

And since it is Json coming back, why not use the type provider?

 1 let client = new HttpClient()
 2 client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Basic",authKey)
 3 
 4 let input = new Dictionary<string,string>()
 5 input.Add("question","How can I quit smoking")
 6 let content = new FormUrlEncodedContent(input)
 7 
 8 let result = client.PostAsync(baseUrl,content).Result
 9 let resultContent = result.Content.ReadAsStringAsync().Result
10 
11 type qaResponse = JsonProvider<".\QAResponseJson.json">
12 let qaAnswer = qaResponse.Parse(resultContent)
13 
14 qaAnswer.Question.Answers 
15                     |> Seq.ofArray 
16                     |> Seq.iter(fun a -> printfn "(%s)" a.Text)

Here is Watson’s response:

You can see the gist here

Filed under Analytics, F#, IBM Watson

I think there is general agreement that the age of the ASP.NET wire-framing post-back web dev is over. If you are going to writing web applications in 2015 in the .NET stack, you have to be able to use java script and associated javascript frameworks like Angular. Similarly, the full-stack developer needs to have a much deeper understanding of the data that is passing in and and out of their application. With the rise of analytics in an application, the developer needs different tools and approaches to their application. Just as you need to know javascript if you are going to be in the browser, you need to know F# if you are going to be building industrial-grade domain and data layers.

I decided to refactor an existing ASP.NET postback website to see how hard it would be to introduce F# to the project and apply some basic statistics to make the site smarter. It was pretty easy and the payoffs were quite large.

If you are not familiar, nerd Dinner is the cannonal example of a MVC application that was created to show Microsoft web devs how to create a website using the .NET stack. The original project was put into a book with the Mount Rushmore of MSFT uber-devs

The project was so successful that it actually was launched into a real website

and you can find the code on Codeplex here

When you download the source code from the repository, you will notice a couple of things:

1) It is not a very big project – with only 1100 lines of code

2) There are 191 FxCop violations

3) It does compile coming out of source, but some of the unit tests fail

4) There is pretty low code coverage (21%)

Focusing on the code coverage issue, it makes sense that there is not much code coverage because there is not much code that can be covered. There is maybe 15 lines of “business logic” if the term business logic is expanded to include input validation. This is an example

Also, there is maybe ten lines of code that do some basic filtering

So step one in the quest to refactor nerd dinner to be a bit smarter was to rename the projects. Since MVC is a UI framework, it made sense to call it that. I then changed the namespaces to reflect the new structure

The next step was to take the domain classes out of the UI and put them into the application. First, I created another project

I then took all of the interfaces that was in the UI and placed them into the application

 1 namespace NerdDinner.Models
 2 
 3 open System
 4 open System.Linq
 5 open System.Linq.Expressions
 6 
 7 type IRepository<'T> =
 8   abstract All : IQueryable<'T>
 9   abstract AllIncluding 
10     : [<ParamArray>] includeProperties:Expression<Func<'T, obj>>[] -> IQueryable<'T>
11     abstract member Find: int -> 'T
12     abstract member InsertOrUpdate: 'T -> unit
13     abstract member Delete: int -> unit
14     abstract member SubmitChanges: unit -> unit
15 
16 type IDinnerRepository =
17   inherit IRepository<Dinner>
18   abstract member FindByLocation: float*float -> IQueryable<Dinner>
19   abstract FindUpcomingDinners : unit -> IQueryable<Dinner>
20   abstract FindDinnersByText : string -> IQueryable<Dinner>
21   abstract member DeleteRsvp: 'T -> unit

I then tooks all of the data structures/models and placed them in the application.

  1 namespace NerdDinner.Models
  2 
  3 open System
  4 open System.Web.Mvc
  5 open System.Collections.Generic
  6 open System.ComponentModel.DataAnnotations
  7 open System.ComponentModel.DataAnnotations.Schema
  8 
  9 type public LocationDetail (latitude,longitude,title,address) =
 10     let mutable latitude = latitude
 11     let mutable longitude = longitude
 12     let mutable title = title
 13     let mutable address = address
 14     
 15     member public this.Latitude
 16         with get() = latitude
 17         and set(value) = latitude <- value    
 18     
 19     member public this.Longitude
 20         with get() = longitude
 21         and set(value) = longitude <- value    
 22         
 23     member public this.Title
 24         with get() = title
 25         and set(value) = title <- value    
 26 
 27     member public this.Address
 28         with get() = address
 29         and set(value) = address <- value    
 30 
 31 type public RSVP () =
 32     let mutable rsvpID = 0
 33     let mutable dinnerID = 0
 34     let mutable attendeeName = ""
 35     let mutable attendeeNameId = ""
 36     let mutable dinner = null
 37 
 38     member public self.RsvpID
 39         with get() = rsvpID
 40         and set(value) = rsvpID <- value
 41 
 42     member public self.DinnerID
 43         with get() = dinnerID
 44         and set(value) = dinnerID <- value
 45 
 46     member public self.AttendeeName
 47         with get() = attendeeName
 48         and set(value) = attendeeName <- value
 49 
 50     member public self.AttendeeNameId
 51         with get() = attendeeNameId
 52         and set(value) = attendeeNameId <- value
 53 
 54     member public self.Dinner
 55         with get() = dinner
 56         and set(value) = dinner <- value
 57 
 58 
 59 and public Dinner () =
 60     let mutable dinnerID = 0
 61     let mutable title = ""
 62     let mutable eventDate = DateTime.MinValue
 63     let mutable description = ""
 64     let mutable hostedBy = ""
 65     let mutable contactPhone = ""
 66     let mutable address = ""
 67     let mutable country = ""
 68     let mutable latitude = 0.
 69     let mutable longitude = 0.
 70     let mutable hostedById = ""
 71     let mutable rsvps = List<RSVP>() :> ICollection<RSVP> 
 72     
 73     [<HiddenInput(DisplayValue=false)>]
 74     member public self.DinnerID
 75         with get() = dinnerID
 76         and set(value) = dinnerID <- value
 77 
 78     [<Required(ErrorMessage="Title Is Required")>]
 79     [<StringLength(50,ErrorMessage="Title may not be longer than 50 characters")>]
 80     member public self.Title
 81         with get() = title
 82         and set(value) = title <- value
 83 
 84     [<Required(ErrorMessage="EventDate Is Required")>]
 85     [<Display(Name="Event Date")>]
 86     member public self.EventDate
 87         with get() = eventDate
 88         and set(value) = eventDate <- value
 89 
 90     [<Required(ErrorMessage="Description Is Required")>]
 91     [<StringLength(256,ErrorMessage="Description may not be longer than 256 characters")>]
 92     [<DataType(DataType.MultilineText)>]
 93     member public self.Description
 94         with get() = description
 95         and set(value) = description <- value
 96 
 97     [<StringLength(256,ErrorMessage="Hosted By may not be longer than 256 characters")>]
 98     [<Display(Name="Hosted By")>]
 99     member public self.HostedBy
100         with get() = hostedBy
101         and set(value) = hostedBy <- value
102 
103     [<Required(ErrorMessage="Contact Phone Is Required")>]
104     [<StringLength(20,ErrorMessage="Contact Phone may not be longer than 20 characters")>]
105     [<Display(Name="Contact Phone")>]
106     member public self.ContactPhone
107         with get() = contactPhone
108         and set(value) = contactPhone <- value
109 
110     [<Required(ErrorMessage="Address Is Required")>]
111     [<StringLength(20,ErrorMessage="Address may not be longer than 50 characters")>]
112     [<Display(Name="Address")>]
113     member public self.Address
114         with get() = address
115         and set(value) = address <- value    
116 
117     [<UIHint("CountryDropDown")>]
118     member public this.Country
119         with get() = country
120         and set(value) = country <- value    
121 
122     [<HiddenInput(DisplayValue=false)>]
123     member public self.Latitude
124         with get() = latitude
125         and set(value) = latitude <- value    
126     
127     [<HiddenInput(DisplayValue=false)>]
128     member public v.Longitude
129         with get() = longitude
130         and set(value) = longitude <- value    
131 
132     [<HiddenInput(DisplayValue=false)>]
133     member public self.HostedById
134         with get() = hostedById
135         and set(value) = hostedById <- value    
136 
137     member public self.RSVPs
138         with get() = rsvps
139         and set(value) = rsvps <- value    
140 
141     member public self.IsHostedBy (userName:string) =
142         System.String.Equals(hostedBy,userName,System.StringComparison.Ordinal)
143 
144     member public self.IsUserRegistered(userName:string) =
145         rsvps |> Seq.exists(fun r -> r.AttendeeName = userName)
146               
147 
148     [<UIHint("Location Detail")>]
149     [<NotMapped()>]
150     member public self.Location
151         with get() = new LocationDetail(self.Latitude,self.Longitude,self.Title,self.Address)
152         and set(value:LocationDetail) = 
153             let latitude = value.Latitude
154             let longitude = value.Longitude
155             let title = value.Title
156             let address = value.Address
157             ()

Unlike C# where there is a class per file, all of the related elements are placed into a the same location. Also, notice that the absence of semi-colons, curly braces, and other distracting characters, and finally you can see that because were are in the .NET framework, all of the data annotations are the same. Sure enough, pointing the MVC UI to the application and hitting run, the application just works.

With the separation complete, it was time time to make our app much smarter. The first thing that I thought of was when the person creates an account, they enter their first and last name

This seems like an excellent opportunity to add some user ~~manipulation~~ personalization to our site. Going back to this analysis of names gives to newborns in the United States, if I know your first name, I have a pretty good chance of guessing your age/gender/and state of birth. For example ‘Jose’ is probably a male born in his twenties in either Texas or California. ‘James’ is probably a male in his 40s or 50s.

I added 6 pictures to the site for young,middleAged, and old males and females.

I then modified the logonStatus partial view like so

 1 @using NerdDinner.UI;
 2 
 3 
 4 @if(Request.IsAuthenticated) {
 5     <text>Welcome <b>@(((NerdIdentity)HttpContext.Current.User.Identity).FriendlyName)</b>!
 6     [ @Html.ActionLink("Log Off", "LogOff", "Account") ]</text>
 7 }
 8 else {
 9     @:[ @Html.ActionLink("Log On", "LogOn", new { controller = "Account", returnUrl = HttpContext.Current.Request.RawUrl }) ]
10 }
11 
12 @if (Session["adUri"] != null)
13 {
14     <img alt="product placement" title="product placement" src="@Session["adUri"]" height="40" />
15 }

Then, I created a session variable called adUri that the picture will reference in the Logon controller

 1         public ActionResult LogOn(LogOnModel model, string returnUrl)
 2         {
 3             if (ModelState.IsValid)
 4             {
 5                 if (ValidateLogOn(model.UserName, model.Password))
 6                 {
 7                     // Make sure we have the username with the right capitalization
 8                     // since we do case sensitive checks for OpenID Claimed Identifiers later.
 9                     string userName = MembershipService.GetCanonicalUsername(model.UserName);
10 
11                     FormsAuth.SignIn(userName, model.RememberMe);
12 
13                     AdProvider adProvider = new AdProvider();
14                     String catagory = adProvider.GetCatagory(userName);
15                     Session["adUri"] = "/Content/images/" + catagory + ".png";
16

And finally, I added an implementation of the adProvider back in the application:

1     type AdProvider () =
2         member this.GetCatagory personName: string =
3             "middleAgedMale"

So running the app, we have a product placement for a Middle Aged Male

So the last thing to do is to turn names into those categories. I thought of a couple of different implementations: loading the entire census data set and searching it on demand, I then thought about using Azure ML and making a API request each time, I then decided into just creating a lookup table that can be searched. In any event, since I am using an interface, swapping out implementations is easy and since I am using F#, creating implementations is easy.

I went back to my script file that analyzed the baby names from the US census and created a new script. I loaded the names into memory like before

 1 #r "C:/Git/NerdChickenChicken/04_mvc3_Working/packages/FSharp.Data.2.0.14/lib/net40/FSharp.Data.dll"
 2 
 3 open FSharp.Data
 4 
 5 type censusDataContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/AK.TXT">
 6 type stateCodeContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv">
 7  
 8 let stateCodes =  stateCodeContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv");
 9  
10 let fetchStateData (stateCode:string)=
11         let uri = System.String.Format("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/{0}.TXT",stateCode)
12         censusDataContext.Load(uri)
13  
14 let usaData = stateCodes.Rows 
15                 |> Seq.collect(fun r -> fetchStateData(r.Abbreviation).Rows)
16                 |> Seq.toArray     
17

I then created a function that tells the probability of male

 1 let genderSearch name = 
 2     let nameFilter = usaData
 3                         |> Seq.filter(fun r -> r.Mary = name)
 4                         |> Seq.groupBy(fun r -> r.F)
 5                         |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 
 6  
 7     let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c)
 8     nameFilter 
 9         |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 
10         |> Seq.filter(fun (g,c,p) -> g = "M")
11         |> Seq.map(fun (g,c,p) -> p)
12         |> Seq.head
13 
14 genderSearch "James"
15

I then created a function that calculated the year the last name was popular (using 1 standard deviation away)

 1 let ageSearch name =
 2     let nameFilter = usaData
 3                         |> Seq.filter(fun r -> r.Mary = name)
 4                         |> Seq.groupBy(fun r -> r.``1910``)
 5                         |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 
 6                         |> Seq.toArray
 7     let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c)
 8     nameFilter 
 9         |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 
10         |> Seq.toArray
11 
12 let variance (source:float seq) =
13     let mean = Seq.average source
14     let deltas = Seq.map(fun x -> pown(x-mean) 2) source
15     Seq.average deltas
16 
17 let standardDeviation(values:float seq) =
18     sqrt(variance(values))
19 
20 let standardDeviation' name = ageSearch name
21                             |> Seq.map(fun (y,c,p) -> float c)
22                             |> standardDeviation
23 
24 let average name = ageSearch name
25                 |> Seq.map(fun (y,c,p) -> float c)
26                 |> Seq.average
27 
28 let attachmentPoint name = (average name) + (standardDeviation' name)
29 
30 let popularYears name = 
31     let allYears = ageSearch name
32     let attachmentPoint' = attachmentPoint name
33     let filteredYears = allYears 
34                         |> Seq.filter(fun (y,c,p) -> float c > attachmentPoint')
35                         |> Seq.sortBy(fun (y,c,p) -> y)
36     filteredYears
37 
38 let lastPopularYear name = popularYears name |> Seq.last
39 let firstPopularYear name = popularYears name |> Seq.head
40 
41 lastPopularYear "James"
42

And then created a function that takes in the gender probability of being male and the last year the name was poular and assigns the name into a category:

1 let nameAssignment (malePercent, lastYearPopular) =
2     match malePercent > 0.75, malePercent < 0.75, lastYearPopular < 1945, lastYearPopular > 1980 with
3         | true, false, true, false -> "oldMale"
4         | true, false, false, false -> "middleAgedMale"
5         | true, false, false, true -> "youngMale"
6         | false, true, true, false -> "oldFemale"
7         | false, true, false, false -> "middleAgedFemale"
8         | false, true, false, true -> "youngFeMale"
9         | _,_,_,_ -> "unknown"

And then it was a matter of tying the functions together for each of the names in the master list:

 1 let nameList = usaData 
 2                 |> Seq.map(fun r -> r.Mary)
 3                 |> Seq.distinct
 4 
 5 nameList
 6     |> Seq.map(fun n -> n, genderSearch n)
 7     |> Seq.map(fun (n,mp) -> n,mp, lastPopularYear n)
 8     |> Seq.map(fun (n,mp,(y,c,p)) -> n, mp, y)
 9 
10 let nameList' = nameList
11                 |> Seq.map(fun n -> n, genderSearch n)
12                 |> Seq.map(fun (n,mp) -> n,mp, lastPopularYear n)
13                 |> Seq.map(fun (n,mp,(y,c,p)) -> n, mp, y)
14                 |> Seq.map(fun (n,mp,y) -> n,nameAssignment(mp,y))
15

And then write the list out to a file

1 open System.IO
2 let outFile = new StreamWriter(@"c:\data\nameList.csv")
3 
4 nameList' |> Seq.iter(fun (n,c) -> outFile.WriteLine(sprintf "%s,%s" n c))
5 outFile.Flush
6 outFile.Close()

Thanks to this stack overflow post for the file write (I wish the csv type provider had this ability). With the file created, I can then use the file as a lookup for my name function back in the MVC app using a csv type provider

 1     type nameMappingContext = CsvProvider<"C:/data/nameList.csv">
 2 
 3     type AdProvider () =
 4         member this.GetCatagory personName: string =
 5             let nameList = nameMappingContext.Load("C:/data/nameList.csv")
 6             let foundName = nameList.Rows
 7                                 |> Seq.filter(fun r -> r.Annie = personName)
 8                                 |> Seq.map(fun r -> r.oldFemale)
 9                                 |> Seq.toArray
10             if foundName.Length > 0 then
11                 foundName.[0]
12             else
13                 "middleAgedMale"

And now I have some (basic) personalization to Nerd Dinner. (Emma is a young female name so they get a picturer of a campground)

So this a rather crude. There is no provision for nicknames, case-sensitivity, etc. But the site is along the way to becoming smarter…

The code can be found on github here.

Filed under Analytics, F#, MVC

Wake County Restaurant Inspection Data with Azure ML and F#

September 30, 2014 1 Comment

With Azure ML now available, I was thinking about some of the analysis I did last year and how I could do even more things with the same data set. One such analysis that came to mind was the restaurant inspection data that I analyzed last year. You can see the prior analysis here.

I uploaded the restaurant data into Azure and thought of a simple question –> can we predict inspection scores based on some easily available data? This is an interesting dataset because there are some categorical data elements (zip code, restaurant type, etc…) and there are some continuous ones (priority foundation, etc…).

Here is the base dataset:

I created a new experiment and I used a boosted regression model and a neural network regression and used a 70/30 train/test split.

After running the models and inspecting the model evaluation, I don’t have a very good model

I then decided to go back and pull some of the X variables out of the dataset and concentrate on only a couple of variables. I added a project column module and then selected Restaurant Type and Zip Code as the X variables and left the Inspection Score as the Y variable.

With this done, I added a couple of more models (Bayesian Linear Regression and a Decision Forest Regression) and gave it a whirl

Interesting, adding these models did not give us any better of a prediction and dropping the variables to two made a less accurate model. Without doing any more analysis, I picked the model with the lowest MAE )Boosted Decision Tree Regression) and published it at a web service:

I published it as a web service and now I can consume if from a client app. I used the code that I used for voting analysis found here as a template and sure enough:

["27519","Restaurant","0","96.0897827148438"]

["27612","Restaurant","0","95.5728530883789"]

So restaurants in Cary,NC have a higher inspection score than the ones found in Northwest Raleigh. However, before we start alerting the the Cary Chamber of Commerce to create a marketing campaign (“Eat in Cary, we are safer”), the difference is within the MAE.

In any event, it would be easy to create a phone app and you don’t know a restaurant score, you can punch in the establishment type and the zip code and have a good idea about the score of the restaurant.

This is an academic exercise b/c the establishments have to show you their card and yelp has their score on them, but a fun exercise none the less. Happy eating.

Filed under Analytics, Azure ML, F# Tagged with Analytics, AzureML, F#

Consuming Azure ML With F#

September 16, 2014 2 Comments

(This post is a continuation of this one)

So with a model that works well enough, I selected only that model and saved it

Created a new experiment and used that model with the base data. I then marked the project columns as the input and the score as the output (green and blue circle respectively)

After running it, I published it as a web service

And voila, an endpoint ready to go. I then took the auto generated script and opened up a new Visual Studio F# project to use it. The problem was that this is the data structure that the model needs

FeatureVector = new Dictionary<string, string>() 
    {
        { "Precinct", "0" },
        { "VRN", "0" },
        { "VRstatus", "0" },
        { "VRlastname", "0" },
        { "VRfirstname", "0" },
        { "VRmiddlename", "0" },
        { "VRnamesufx", "0" },
        { "VRstreetnum", "0" },
        { "VRstreethalfcode", "0" },
        { "VRstreetdir", "0" },
        { "VRstreetname", "0" },
        { "VRstreettype", "0" },
        { "VRstreetsuff", "0" },
        { "VRstreetunit", "0" },
        { "VRrescity", "0" },
        { "VRstate", "0" },
        { "Zip Code", "0" },
        { "VRfullresstreet", "0" },
        { "VRrescsz", "0" },
        { "VRmail1", "0" },
        { "VRmail2", "0" },
        { "VRmail3", "0" },
        { "VRmail4", "0" },
        { "VRmailcsz", "0" },
        { "Race", "0" },
        { "Party", "0" },
        { "Gender", "0" },
        { "Age", "0" },
        { "VRregdate", "0" },
        { "VRmuni", "0" },
        { "VRmunidistrict", "0" },
        { "VRcongressional", "0" },
        { "VRsuperiorct", "0" },
        { "VRjudicialdistrict", "0" },
        { "VRncsenate", "0" },
        { "VRnchouse", "0" },
        { "VRcountycomm", "0" },
        { "VRschooldistrict", "0" },
        { "11/6/2012", "0" },
        { "Voted Ind", "0" },
    },
    GlobalParameters = new Dictionary<string, string>() 
    {
    }
};

And since I am only using 6 of the columns, it made sense to reload the Wake County Voter Data with just the needed columns. I went back to the original CSV and did that. Interestingly, I could not set the original dataset as the publish input so I added a project column module that does nothing

With that in place, I republished the service and opened Visual Studio. I decided to start with a script. I was struggling though the async when Tomas P helped me on Stack Overflow here. I’ll say it again, the F# community is tops. In any event, here is the initial script:


#r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll"
#r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll"

open System
open System.Net.Http
open System.Net.Http.Headers
open System.Net.Http.Formatting
open System.Collections.Generic

type scoreData = {FeatureVector:Dictionary<string,string>;GlobalParameters:Dictionary<string,string>}
type scoreRequest = {Id:string; Instance:scoreData}

let invokeService () = async {
    let apiKey = ""
    let uri = "https://ussouthcentral.services.azureml.net/workspaces/19a2e623b6a944a3a7f07c74b31c3b6d/services/f51945a42efa42a49f563a59561f5014/score"
    use client = new HttpClient()
    client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Bearer",apiKey)
    client.BaseAddress <- new Uri(uri)

    let input = new Dictionary<string,string>()
    input.Add("Zip Code","27519")
    input.Add("Race","W")
    input.Add("Party","UNA")
    input.Add("Gender","M")
    input.Add("Age","45")
    input.Add("Voted Ind","1")

    let instance = {FeatureVector=input; GlobalParameters=new Dictionary<string,string>()}
    let scoreRequest = {Id="score00001";Instance=instance}

    let! response = client.PostAsJsonAsync("",scoreRequest) |> Async.AwaitTask
    let! result = response.Content.ReadAsStringAsync() |> Async.AwaitTask

    if response.IsSuccessStatusCode then
        printfn "%s" result
    else
        printfn "FAILED: %s" result
    response |> ignore
}

invokeService() |> Async.RunSynchronously

Unfortunately, when I run it, it fails. Below is the Fiddler trace:

So it looks like the Json Serializer is postpending the “@” symbol. I changed the records to types and voila:

You can see the final script here.

So then throwing in some different numbers.

A millennial: ["27519","W","D","F","25","1","1","0.62500011920929"]
A senior citizen: ["27519","W","D","F","75","1","1","0.879632294178009"]

I wonder why social security never gets cut?

In any event, just to check the model:

A 15 year old: ["27519","W","D","F","15","1","0","0.00147285079583526"]

Filed under Analytics, Azure ML, F#

Azure ML and Wake County Election Data

September 16, 2014 2 Comments

I have been spending the last couple of weeks using Azure ML and I think it is one of the most exciting technologies for business developers and analysts since ODBC and FSharp type providers. If you remember, when ODBC came out, every relational database in the world became accessible and therefore usable/analyzable. When type providers came out, programming, exploring, and analyzing data sources became much easier and it expanded from RDBMS to all formats (notably Json). So getting data was no longer a problem, but analyzing it still was.

Enter Azure ML.

I downloaded the Wake County Voter History data from here. I took the Excel spreadsheet and converted it to a .csv locally. I then logged into Azure ML and imported the data

I then created an experiment and added the dataset to the canvas

And looked at the basic statistics of the data set

(Note that I find that using the FSharp REPL a better way to explore the data as I can just dot each element I am interested in and view the results).

In any event, the first question I want to answer is

“given a person’s ZipCode, Race, Party,Gender, and Age, can I predict if they will vote in November”

To that end, I first narrowed down the columns using a Column Projection and picked only the columns I care about. I picked “11/6/2012” and the X variable because that was the last national election and that is what we are going to have in November. I prob should have done 2010 b/c that is a national without a President, but that can be analyzed at a later date.

I then ran my experiment so the data would be available in the Project Column step.

I then renamed the columns to make them a bit readable by using a series Metadata Editors (it does not look like you can do all renames in 1 step. Equally as annoying is that you have to add each module, run it, then add the next.)

(one example)

I then added a Missing Values scrubber for the voted column. So instead of a null field, people who didn’t vote get a “N”

The problem is that it doesn’t work –> looks like we can’t change the values per column.

I asked the question on the forum but in the interest of time, I decided to change the voted column from a categorical column to an indicator. That way I can do binary analysis. That also failed. I went back to the original spreadsheet and added a Indicator column and then also renamed the column headers so I am not cluttering up my canvas with those meta data transforms. Finally, I realized I want only active voters but there does not seems to be a filtering ability (remove rows only works for missing) so I removed those also from the original dataset. I think the ability to scrub and munge data is an area for improvement, but since this is release 1, I understand.

After re-importing the data, I changed my experiment like so

I then split the dataset into Training/Validation/And Testing using a 60/20/20 split

So the left point on the second split is 60% of the original dataset, the right point on the second split is 20% of the original dataset (or 75%/25% of the 80% of the first split)

I then added a SVM with a train and score module. Note that I am training with 60% of the original dataset and I am validating with 20%

After it runs, there are 2 new columns in the dataset –> Scored labels and probabilities so each row now has a score.

With the model in place, I can then evaluate it using an evaluation model

And we can see an AUC of .666, which immediately made me think of this

In any event, I added a Logisitc Regression and a Boosted Decision Tree to the canvas and hooked them up to the training and validation sets

And this is what we have

SVM: .666 AUC

Regression: .689 AUC

Boosted Decision Tree: .713 AUC

So with Boosted Decision Tree ahead, I added a Sweep Parameter module to see if I can tune it more. I am using AUC as the performance metric

So the best AUC I am going to get is .7134 with the highlighted parameters. I then added 1 more Model that uses those parameters against the entire training dataset (80% of the total) and then evaluates it against the remaining 20%.

With the final answer of

With that in hand, I can create a new experiment that will be the bases of a real time voting app.

Filed under Analytics, Azure ML

Fun with Statistics and Charts

September 2, 2014 1 Comment

I am preparing my Raleigh Code Camp submission ‘Nerd Dinner With Brains” this weekend. If you are not familiar, Nerd Dinner is the canonical example of a MVC application and is very familiar to Web Devs who want to learn MVC the Microsoft way. You can see the walkthrough here. For everything that Nerd Dinner is, it is not … smart. There is no business rules outside of some basic input validation, which is pretty representative of many “Boring Line Of Business Applications (BLOBAs according to Scott Waschlan). Not coincidently, the lack of business logic is the biggest reason many BLOBAs don’t have many unit tests –> if all you are doing is wire framing a database, what business logic needs to be tested?

The talk is going to take the Nerd Diner wireframe and inject some analytics to the application. To that end, I first considered the person who is attending the dinner. All we know about them is their name and possibly their location. So what can a name tell you? Turns out, plenty.

As I showed in this post, there is a great source of the number of names given by gender, yearOfBrith, and stateOfBirth from the US census. Picking up where that post left off, I loaded in the entire data set into memory.

My first question was, “given a name, can I tell what gender the person is?” This is very straight forward to calculate.

 1 let genderSearch name = 
 2     let nameFilter = usaData
 3                         |> Seq.filter(fun r -> r.Mary = name)
 4                         |> Seq.groupBy(fun r -> r.F)
 5                         |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 
 6  
 7     let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c)
 8     nameFilter 
 9         |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 
10         |> Seq.toArray
11 
12 genderSearch "James"
13

And the REPL shows me that is is very likely that “James” is a male:

I can then set up in the web.config file a confidence point where there name is a male/female, I am thinking 75%. Once we have that, the app can respond differently. Perhaps we have a product-placement advertisement that becomes a male-focused if we are reasonably certain that the user is a male. Perhaps we can be more subtle and change the theme of the site, or the page navigation, to induce the person to do additional things on the site.

In any event, I then wanted to tackle age. I spun up some code to isolate a person’s age

 1 let ageSearch name =
 2     let nameFilter = usaData
 3                         |> Seq.filter(fun r -> r.Mary = name)
 4                         |> Seq.groupBy(fun r -> r.``1910``)
 5                         |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 
 6                         |> Seq.toArray
 7     let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c)
 8     nameFilter 
 9         |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 
10         |> Seq.toArray

I had no idea if names have a certain age connotation so I decided to do some basic charting. Isaac Abraham pointed me to FSharp.Chart which is a great way to do some basic charting for discovery.

1 let chartData = ageSearch "James"
2                     |> Seq.map(fun (y,c,p) -> y, c)
3                     |> Seq.sortBy(fun (y,c) -> y)
4     
5 Chart.Line(chartData).ShowChart()

And sure enough, the name “James” has a real ebb and flow for its popularity.

so if the user has a name of “James”, you can make a reasonable assumption they are male and probably born before 1975. Cue up the Van Halen!

And yes, because I had to:

1 let chartData = ageSearch "Britney"
2                     |> Seq.map(fun (y,c,p) -> y, c)
3                     |> Seq.sortBy(fun (y,c) -> y)

Kinda does match her career, no?

Anyway, back to the task at hand. In terms of analytics, I want to be a bit more precise then eyeballing a chart. I started with the following code:

 1 ageSearch "James"
 2     |> Seq.map(fun (y,c,p) -> float c)
 3     |> Seq.average
 4 
 5 ageSearch "James"
 6     |> Seq.map(fun (y,c,p) -> float c)
 7     |> Seq.min
 8 
 9 ageSearch "James"
10     |> Seq.map(fun (y,c,p) -> float c)
11     |> Seq.max
12

With these basic statistics out of the way, I then wanted to look at when the name was no longer popular. I decided to use 1 standard deviation away from the average to determine an outlier. First the standard deviation:

 1 let variance (source:float seq) =
 2     let mean = Seq.average source
 3     let deltas = Seq.map(fun x -> pown(x-mean) 2) source
 4     Seq.average deltas
 5 
 6 let standardDeviation(values:float seq) =
 7     sqrt(variance(values))
 8 
 9 ageSearch "James"
10     |> Seq.map(fun (y,c,p) -> float c)
11     |> standardDeviation
12 
13 let standardDeviation' = ageSearch "James"
14                             |> Seq.map(fun (y,c,p) -> float c)
15                             |> standardDeviation
16 
17 let average = ageSearch "James"
18                 |> Seq.map(fun (y,c,p) -> float c)
19                 |> Seq.average
20 
21 let attachmentPoint = average+standardDeviation'

And then I can get the last year that the name was within 1 standard deviation above the average (greater than 71,180 names given):

1 
2 let popularYears = ageSearch "James"
3                         |> Seq.map(fun (y,c,p) -> y, float c)
4                         |> Seq.filter(fun (y,c) -> c > attachmentPoint)
5                         |> Seq.sortBy(fun (y,c) -> y)
6                         |> Seq.last

So “James” is very likely a male and likely born before 1964. Cue up the Pink Floyd!

The last piece was the state of birth –> can I guess the state of birth for a user? I first looked at the states on a plot

1 let chartData' = stateSearch "James"
2                     |> Seq.map(fun (s,c,p) -> s,c)
3     
4 Chart.Column(chartData').ShowChart()
5

Nothing really stands out at me –> states with the most births have the most names. I could do an academic exercise of seeing what states favor certain names, but that does not help me with Nerd Dinner in guessing the state of birth when given a name.

I pressed on to look at the top 10 states:

 1 let topTenStates = stateSearch "James"
 2                     |> Seq.sortBy(fun (s,c,p) -> -c-1)
 3                     |> Seq.take 10
 4 
 5 let topTenTotal = topTenStates 
 6                     |> Seq.sumBy(fun (s,c,p) -> c)
 7 let total = stateSearch "James"
 8                 |> Seq.sumBy(fun (s,c,p) -> c)
 9 
10 float topTenTotal/float total

So 50% of “James” were born in 10 states. Again, I am not sure there is any actionable information here. For example, if a majority of “James” were born in MI, I might have something (cue up the Bob Seger).

Interestingly, there are certain number of names where the state of birth does matter. For example, consider “Jose”:

Unsurprisingly, the two states are CA and TX. Just using James and Jose as an example:

James is a male born before 1964
Jose is a male born before 2008 in either TX or CA

As an academic exercise, we could construct a random forest to find the names with the greatest state affinity. However, that won’t help us on Nerd Dinner so I am leaving that out for another day.

This analysis does not account for a host of factors (person not born in the USA, nicknames, etc..), but it is still better than the nothing that Nerd Dinner currently has. This analysis is not particular sophisticated but I often find that even the most basic statistics can be very powerful if used correctly. That will be the next part of the talk…

Filed under Analytics, F#

Neural Networks

July 15, 2014 1 Comment

I picked up James McCaffrey’s Neural Networks Using C# a couple of weeks ago and decided to see if I could rewrite the code in F#. Unfortunately, the source code is not available (as far as I could tell), so I did some C# then F# coding to see if I could get functional equivalence.

My first stop was chapter one. I made the decision to get the F# code working for the sample data that McCaffrey provided first and then refactor it to a more general program that would work with inputs and values of different datasets. My final upgrade will be use Deedle instead of any other data structure. But first things first, I want to get the examples working so I fired up a script file and opened my REPL.

McCaffrey defines a sample dataset like this

string[] sourceData = new string[] { "Sex Age Locale Income Politics", 
    "==============================================", 
    "Male 25 Rural 63,000.00 Conservative", 
    "Female 36 Suburban 55,000.00 Liberal", "Male 40 Urban 74,000.00 Moderate", 
    "Female 23 Rural 28,000.00 Liberal" };

He then creates a parser for the comma-delimited string values into a double[][]. I just created the dataset as a List of tuples.

let chapter1TestData = [("Male",25.,"Rural",63000.00,"Conservative");
                ("Female",36.,"Suburban",55000.00,"Liberal");
                ("Male",40.,"Urban",74000.00,"Moderate");
                ("Female",23.,"Rural",28000.00,"Liberal")]

I did try an implementation using a record type but for reasons below, I am using Tuples. With the equivalent data loaded into the REPL, I tackled the first supporting function: MinMax. Here is the C# code that McCaffrey wrote:

static void MinMaxNormal(double[][] data, int column)
{ 
    int j = column;
    double min = data[0][j];
    double max = data[0][j]; 
    for (int i = 0; i < data.Length; ++i)
    {
        if (data[i][j] < min) min = data[i][j]; 
        if (data[i][j] > max) max = data[i][j];
    }
    double range = max – min; 
    if (range == 0.0) // ugly 
    { for (int i = 0; i < data.Length; ++i)
        data[i][j] = 0.5;
        return; } 
    for (int i = 0; i < data.Length; ++i)
        data[i][j] = (data[i][j] – min) / range;
}

and here is the equivalent F# code.

let minMax (fullSet, i) =
    let min = fullSet |> Seq.min
    let max = fullSet |> Seq.max
    (i-min)/(max-min)

Note that McCaffrey does not have any unit tests but when I ran the dummy data through the F# implementation, the results matched his screen shots so that will work well enough. If you ever need a reason to use F#, consider those 2 code samples. Granted McCaffrey’s code is more abstract because it can be any column in double array, but my counterpoint is that the function is really doing too much and it is trivial in F# to pick a given column. Is there any doubt what the F# code is doing? Is there any certainty of what the C# code is doing?

In any event, moving along to the next functions, McCaffrey created two functions that do all of the encoding of the string values to appropriate numeric ones. Depending on if the value is a X value (independent) or Y value (dependent), there is a different encoding scheme:

 static string EffectsEncoding(int index, int N)
 {
     // If N = 3 and index = 0 -> 1,0. 
     // If N = 3 and index = 1 -> 0,1. 
     // If N = 3 and index = 2 -> -1,-1. 
     if (N == 2)
     // Special case. 
     { if (index == 0) return "-1"; else if (index == 1) return "1"; }
     int[] values = new int[N – 1];
     if (index == N – 1)
     // Last item is all -1s. 
     { for (int i = 0; i < values.Length; ++i) values[i] = -1; }
     else
     {
         values[index] = 1;
         // 0 values are already there.
     } string s = values[0].ToString();
     for (int i = 1; i < values.Length; ++i) s += "," + values[i]; return s;
 }
 
 static string DummyEncoding(int index, int N)
 { 
     int[] values = new int[N]; values[index] = 1;
     string s = values[0].ToString();
     for (int i = 1; i < values.Length; ++i) s += "," + values[i]; 
     return
}

In my F# project, I decided to domain-specific encoding. I plan to refactor this to something more abstract.

//Transform Sex
let testData' = chapter1TestData |> Seq.map(fun (s,a,l,i,p) -> match s with
                                                               | "Male"-> -1.0,a,l,i,p
                                                             | "Female" -> 1.0,a,l,i,p
                                                             | _ -> failwith "Invalid sex")
//Normalize Age
let testData'' = 
    let fullSet =  testData' |> Seq.map(fun (s,a,l,i,p) -> a)
    testData' |> Seq.map(fun (s,a,l,i,p) -> s,minMax(fullSet,a),l,i,p)
 
//Transform Locale
let testData''' = testData'' |> Seq.map(fun (s,a,l,i,p) -> match l with
                                                                | "Rural" -> s,a,1.,0.,i,p
                                                                | "Suburban" -> s,a,0.,1.,i,p
                                                                | "Urban" -> s,a,-1.,-1.,i,p
                                                                | _ -> failwith "Invalid locale")
//Transform and Normalize Income
let testData'''' = 
    let fullSet =  testData''' |> Seq.map(fun (s,a,l0,l1,i,p) -> i)
    testData''' |> Seq.map(fun (s,a,l0,l1,i,p) -> s,a,l0,l1,minMax(fullSet,i),p)
 
//Transform Politics
let testData''''' = testData'''' |> Seq.map(fun (s,a,l0,l1,i,p) -> match p with
                                                                | "Conservative" -> s,a,l0,l1,i,1.,0.,0.
                                                                | "Liberal" -> s,a,l0,l1,i,0.,1.,0.
                                                                | "Moderate" -> s,a,l0,l1,i,0.,0.,1.
                                                                | _ -> failwith "Invalid politics")

When I execute the script:

Which is the same as McCaffrey’s.

Note that he used Gaussian normalization on column 2 and I did Min/Max based on his advice in the book.

Filed under Analytics, F#

TRINUG F# Analytics Prep: Part 2

July 1, 2014 1 Comment

I finished up the second part of the F#/Analytics lab scheduled for August. It is a continuation of going through Astborg’s F# for Quantitative Finance that we started last month. Here is my fist blog post on it.

In this lab, we are going to tackle the more advanced statistical calculations: the Black-Scholes formula, the Greeks, and Monte Carlo simulation. Using the same solution and projects, I started the script file to figure out the Black Scholes formula. Astborg uses a couple of supporting functions which I knocked out first: Power and CumulativeDistribution. I first created his function verbatim like this:

let pow x n = exp(n*log(x))

and then refactored it to make it more readable like this

let power baseNumber exponent = exp(exponent * log(baseNumber))

and then I realized it is the same thing as using pown which is already found in FSharp.Core.

In any event, I then attacked the cumulativeDistribution method. I downloaded the source from his website and then refactored it so that each step is clearly laid out. Here is the refactored function:

let cumulativeDistribution (x) =
        let a1 =  0.31938153
        let a2 = -0.356563782
        let a3 =  1.781477937
        let a4 = -1.821255978
        let a5 =  1.330274429
        let pi = 3.141592654
        let l  = abs(x)
        let k  = 1.0 / (1.0 + 0.2316419 * l)
 
        let a1' = a1*k
        let a2' = a2*k*k
        let a3' = a3*(power k 3.0)
        let a4' = a4*(power k 4.0)
        let a5' = a5*(power k 5.0)
        let w1 = 1.0/sqrt(2.0*pi)
        let w2 = exp(-l*l/2.0)
        let w3 = a1'+a2'+a3'+a4'+a5'
        let w  = 1.0-w1*w2*w3
        if x < 0.0 then 1.0 – w else w

And here is some test values from the REPL:

Finally, the Black Scholes formula. I did create a separate POCO for the input data like this:

type putCallFlag = Put | Call
 
type blackScholesInputData = 
    {stockPrice:float;
    strikePrice:float;
    timeToExpiry:float;
    interestRate:float;
    volatility:float}

And I refactored his code to make it more readable like this:

let blackScholes(inputData:blackScholesInputData, putCallFlag:putCallFlag)=
   let sx = log(inputData.stockPrice / inputData.strikePrice)
   let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
   let rvt = rv*inputData.timeToExpiry
   let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
   let d1=(sx + rvt)/vt
   let d2=d1-vt
    
   match putCallFlag with
    | Put -> 
        let xrt = inputData.strikePrice*exp(-inputData.interestRate*inputData.timeToExpiry)
        let cdD1 = xrt*cumulativeDistribution(-d2)
        let cdD2 = inputData.stockPrice*cumulativeDistribution(-d1)
        cdD1-cdD2
    | Call ->
        let xrt = inputData.strikePrice*exp(-inputData.interestRate*inputData.timeToExpiry)
        let cdD1 = inputData.stockPrice*cumulativeDistribution(d1)
        let cdD2 = xrt*cumulativeDistribution(d2)
        cdD1-cdD2

And since I was in the script environment, I put in test data that matches the sample that Astborg used in the book:

let inputData = {stockPrice=58.60;strikePrice=60.;timeToExpiry=0.5;interestRate=0.01;volatility=0.3}
let runBSCall = blackScholes(inputData,Call)
let runBSPut = blackScholes(inputData,Put)

And voila, the results match the book:

With the Black-Scholes out of the way, I then implemented the Greeks. Note that I did add helper functions for clarity, and the results match the book:

let blackScholesDelta (inputData:blackScholesInputData, putCallFlag:putCallFlag) =
    let sx = log(inputData.stockPrice / inputData.strikePrice)
    let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
    let rvt = rv*inputData.timeToExpiry
    let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
    let d1=(sx + rvt)/vt
    match putCallFlag with
    | Put -> cumulativeDistribution(d1) – 1.0
    | Call -> cumulativeDistribution(d1)
 
let deltaPut = blackScholesDelta(inputData, Put)
let deltaCall = blackScholesDelta(inputData, Call)
 
let blackScholesGamma (inputData:blackScholesInputData) =
    let sx = log(inputData.stockPrice / inputData.strikePrice)
    let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
    let rvt = rv*inputData.timeToExpiry
    let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
    let d1=(sx + rvt)/vt
    normalDistribution.Density(d1)
 
let gamma = blackScholesGamma(inputData)
 
let blackScholesVega (inputData:blackScholesInputData) =
    let sx = log(inputData.stockPrice / inputData.strikePrice)
    let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
    let rvt = rv*inputData.timeToExpiry
    let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
    let d1=(sx + rvt)/vt   
    inputData.stockPrice*normalDistribution.Density(d1)*sqrt(inputData.timeToExpiry)
 
let vega = blackScholesVega(inputData)
 
let blackScholesTheta (inputData:blackScholesInputData, putCallFlag:putCallFlag) =
    let sx = log(inputData.stockPrice / inputData.strikePrice)
    let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
    let rvt = rv*inputData.timeToExpiry
    let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
    let d1=(sx + rvt)/vt   
    let d2=d1-vt
    match putCallFlag with
    | Put -> 
        let ndD1 = inputData.stockPrice*normalDistribution.Density(d1)*inputData.volatility
        let ndD1' = ndD1/(2.0*sqrt(inputData.timeToExpiry))
        let rx = inputData.interestRate*inputData.strikePrice
        let rt = exp(-inputData.interestRate*inputData.timeToExpiry)
        let cdD2 = rx*rt*cumulativeDistribution(-d2) 
        -(ndD1')+cdD2
    | Call -> 
        let ndD1 = inputData.stockPrice*normalDistribution.Density(d1)*inputData.volatility
        let ndD1' = ndD1/(2.0*sqrt(inputData.timeToExpiry))
        let rx = inputData.interestRate*inputData.strikePrice
        let rt = exp(-inputData.interestRate*inputData.timeToExpiry)
        let cdD2 = cumulativeDistribution(d2)
        -(ndD1')-rx*rt*cdD2
 
let thetaPut = blackScholesTheta(inputData, Put)
let thetaCall = blackScholesTheta(inputData, Call)
 
let blackScholesRho (inputData:blackScholesInputData, putCallFlag:putCallFlag) =
    let sx = log(inputData.stockPrice / inputData.strikePrice)
    let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
    let rvt = rv*inputData.timeToExpiry
    let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
    let d1=(sx + rvt)/vt   
    let d2=d1-vt
    match putCallFlag with
    | Put ->
        let xt = inputData.strikePrice*inputData.timeToExpiry
        let rt = exp(-inputData.interestRate*inputData.timeToExpiry)  
        -xt*rt*cumulativeDistribution(-d2)
    | Call -> 
        let xt = inputData.strikePrice*inputData.timeToExpiry
        let rt = exp(-inputData.interestRate*inputData.timeToExpiry)          
        xt*rt*cumulativeDistribution(d2)
 
let rhoPut = blackScholesRho(inputData, Put)
let rhoCall = blackScholesRho(inputData, Call)

Finally, I threw in the Monte Carlo, which also used a POCO:

type monteCarloInputData = 
    {stockPrice:float;
    strikePrice:float;
    timeToExpiry:float;
    interestRate:float;
    volatility:float}
 
let priceAtMaturity (inputData:monteCarloInputData, randomValue:float) =
    let s = inputData.stockPrice
    let rv = (inputData.interestRate-inputData.volatility*inputData.volatility/2.0)
    let rvt = rv*inputData.timeToExpiry
    let vr = inputData.volatility*randomValue
    let t = sqrt(inputData.timeToExpiry)
    s*exp(rvt+vr*t)
    
let maturityPriceInputData = {stockPrice=58.60;strikePrice=60.0;timeToExpiry=0.5;interestRate=0.01;volatility=0.3}
priceAtMaturity(maturityPriceInputData, 10.0)
 
let monteCarlo(inputData: monteCarloInputData, randomValues:seq<float>) = 
    randomValues 
        |> Seq.map(fun randomValue -> priceAtMaturity(inputData,randomValue) – inputData.strikePrice )
        |> Seq.average
 
 
let random = new System.Random()
let rnd() = random.NextDouble()
let data = [for i in 1 .. 1000 -> rnd() * 1.0]
 
let monteCarloInputData = {stockPrice=58.60;strikePrice=60.0;timeToExpiry=0.5;interestRate=0.01;volatility=0.3;}
monteCarlo(monteCarloInputData,data)

One thing I really like about Astborg is that the Monte Carlo function does not new up the array of random numbers, rather they are passed in. This makes the function much more testable and is the right way to right it (IMHO). In fact, I think that seeing “new Random” or “DateTime.Now” hard-coded into functions is an anti-pattern that is all too common.

With the last of the functions done in the script file, I moved them into the .fs file and created covering unit tests based on the sample data that I did in the REPL.

[TestMethod]
public void PowerUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = 8;
    Double actual = Math.Round(calculations.Power(2.0, 3.0), 0);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void CumulativeDistributionUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = .84134;
    Double actual = Math.Round(calculations.CumulativeDistribution(1.0),5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesCallUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = 4.4652;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholes(inputData,PutCallFlag.Call), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesPutUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = 5.56595;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholes(inputData, PutCallFlag.Put), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void DaysToYearsUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = .08214;
    Double actual = Math.Round(calculations.DaysToYears(30), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesDeltaCallUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = .50732;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholesDelta(inputData, PutCallFlag.Call), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesDeltaPutUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = -.49268;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholesDelta(inputData, PutCallFlag.Put), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesGammaUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = .39888;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholesGamma(inputData), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesVegaUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = 16.52798;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholesVega(inputData), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesThetaCallUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = -5.21103;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholesTheta(inputData, PutCallFlag.Call), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesThetaPutUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = -4.61402;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholesTheta(inputData, PutCallFlag.Put), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesRhoCallUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = 12.63174;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholesRho(inputData, PutCallFlag.Call), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void BlackScholesRhoPutUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = -17.21863;
    var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.BlackScholesRho(inputData, PutCallFlag.Put), 5);
    Assert.AreEqual(expected, actual);
}
 
 
[TestMethod]
public void PriceAtMaturityUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    Double expected = 480.36923;
    var inputData = new MonteCarloInputData(58.6, 60.0, .5, .01, .3);
    Double actual = Math.Round(calculations.PriceAtMaturity(inputData, 10.0), 5);
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void MonteCarloUsingValidData_ReturnsExpected()
{
    var calculations = new Calculations();
    var inputData = new MonteCarloInputData(58.6, 60.0, .5, .01, .3);
    var random = new System.Random();
    List<Double> randomData = new List<double>();
    for (int i = 0; i < 1000; i++)
    {
        randomData.Add(random.NextDouble());
    }
 
    Double actual = Math.Round(calculations.MonteCarlo(inputData, randomData), 5);
    var greaterThanFour = actual > 4.0;
    var lessThanFive = actual < 5.0;
 
    Assert.AreEqual(true, greaterThanFour);
    Assert.AreEqual(true, lessThanFive);
}

With all of the tests running green, I then turned my attention to the UI. I created more real state on the MainWindow and added some additional data structures to the results of the analytics that lend themselves to charting and graphing. For example:

public class GreekData
{
    public Double StrikePrice { get; set; }
    public Double DeltaCall { get; set; }
    public Double DeltaPut { get; set; }
    public Double Gamma { get; set; }
    public Double Vega { get; set; }
    public Double ThetaCall { get; set; }
    public Double ThetaPut { get; set; }
    public Double RhoCall { get; set; }
    public Double RhoPut { get; set; }
 
}

And in the code behind of the MainWindow, I added some calcs based on the prior code that was already in it:

var theGreeks = new List<GreekData>();
for (int i = 0; i < 5; i++)
{
    var greekData = new GreekData();
    greekData.StrikePrice = closestDollar – i;
    theGreeks.Add(greekData);
    greekData = new GreekData();
    greekData.StrikePrice = closestDollar + i;
    theGreeks.Add(greekData);
}
theGreeks.Sort((greek1,greek2)=>greek1.StrikePrice.CompareTo(greek2.StrikePrice));
 
foreach (var greekData in theGreeks)
{
    var inputData =
        new BlackScholesInputData(adjustedClose, greekData.StrikePrice, .5, .01, .3);
    greekData.DeltaCall = calculations.BlackScholesDelta(inputData, PutCallFlag.Call);
    greekData.DeltaPut = calculations.BlackScholesDelta(inputData, PutCallFlag.Put);
    greekData.Gamma = calculations.BlackScholesGamma(inputData);
    greekData.RhoCall = calculations.BlackScholesRho(inputData, PutCallFlag.Call);
    greekData.RhoPut = calculations.BlackScholesRho(inputData, PutCallFlag.Put);
    greekData.ThetaCall = calculations.BlackScholesTheta(inputData, PutCallFlag.Call);
    greekData.ThetaPut = calculations.BlackScholesTheta(inputData, PutCallFlag.Put);
    greekData.Vega = calculations.BlackScholesVega(inputData);
 
}
 
this.TheGreeksDataGrid.ItemsSource = theGreeks;
 
 
var blackScholes = new List<BlackScholesData>();
for (int i = 0; i < 5; i++)
{
    var blackScholesData = new BlackScholesData();
    blackScholesData.StrikePrice = closestDollar – i;
    blackScholes.Add(blackScholesData);
    blackScholesData = new BlackScholesData();
    blackScholesData.StrikePrice = closestDollar + i;
    blackScholes.Add(blackScholesData);
}
blackScholes.Sort((bsmc1, bsmc2) => bsmc1.StrikePrice.CompareTo(bsmc2.StrikePrice));
 
var random = new System.Random();
List<Double> randomData = new List<double>();
for (int i = 0; i < 1000; i++)
{
    randomData.Add(random.NextDouble());
}
 
foreach (var blackScholesMonteCarlo in blackScholes)
{
    var blackScholesInputData =
        new BlackScholesInputData(adjustedClose, blackScholesMonteCarlo.StrikePrice, .5, .01, .3);
    var monteCarloInputData =
        new MonteCarloInputData(adjustedClose, blackScholesMonteCarlo.StrikePrice, .5, .01, .3);
 
    blackScholesMonteCarlo.Call = calculations.BlackScholes(blackScholesInputData, PutCallFlag.Call);
    blackScholesMonteCarlo.Put = calculations.BlackScholes(blackScholesInputData, PutCallFlag.Put);
    blackScholesMonteCarlo.MonteCarlo = calculations.MonteCarlo(monteCarloInputData, randomData);
}
 
this.BlackScholesDataGrid.ItemsSource = blackScholes;

And Whammo, the UI.

Fortunately Conrad D’Cruz is a member of TRINUG and an options trader and is going to explain what the heck we are looking at when the SIG gets together again.

Filed under Analytics, F#

Using Subsets for Association Rule Learning

June 3, 2014 1 Comment

I finished up writing the association rule program from MSDN in F# last week. One of the things bothering me about the way I implemented the algorithms is that I hard-coded the combinations (antecedent and consequent) from the item-sets:

static member GetCombinationsForDouble(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
    combinations
 
static member GetCombinationsForTriple(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
    combinations

I thought it would be a fun exercise to make a function that returns the combinations for an N number of itemSets. My first several attempts failed because I started off with the wrong vocabulary. I spent several days trying to determine how to create all of the combinations and/or permutations from the itemSet. It then hit me that I would be looking at getting all subsets and what do you know, there are some excellent examples out there.

So if I was going to use the yield and yield-bang method of calculating the subsets in my class, I first needed to remove the rec and just let the class call itself.

static member Subsets s =
    set [
        yield s
        for e in s do
            yield! AssociationRuleProgram2.Subsets (Set.remove e s) ]

I then needed a way of translating the itemSet which is a an int array into a set and back again. Fortunately, the set module has ofArray and toArray functions so I wrote my code exactly the way I just described the problem:

static member GetAntcentAndConsequent(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    let itemSet' = Set.ofArray itemSet
    let subSets = AssociationRuleProgram2.Subsets itemSet'
    let subSets' = Set.toArray subSets
    let subSets'' = Array.map(fun s-> Set.toArray s)
    let subSets''' = Array.map(fun s -> Seq.toArray s, AssociationRuleProgram2.GetAntcentAndConsequent s)

Note that I had to call toArray twice because the Subsets returns a Set<Set<Int>>.

In any event, I then needed a way of spitting the itemSet into antecedents and consequents (called combinations) based on the current subset. I toyed around with a couple different ways of solving the problem when I stumbled upon a way that makes alot of sense to me. I changed the itemset from an array of int to an array of tuple<int*bool>. If the subset is in the itemSet, then the bool flag is true, if not it is false. Then, I would apply an Seq.filter to the array and separate it out into antecedents and consequents.

static member GetCombination array subArray = 
    let array' = array |> Seq.map(fun i -> i, subArray |> Array.exists(fun j -> i = j))
    let antecedent = array' |> Seq.filter(fun (i,j) -> j = true) |> Seq.toArray
    let consquent = array' |> Seq.filter(fun (i,j) -> j = false) |> Seq.toArray
    let antecedent' = antecedent|> Seq.map(fun (i,j) -> i) 
    let consquent' = consquent|> Seq.map(fun (i,j) -> i) 
    Seq.toArray antecedent', Seq.toArray consquent'

The major downside of this approach is that I am using Array.exists for my filter flag so if there is more than one of the same value in the itemset, it does not work. However, the original example had each itemset being unique so think I am OK.

So with these tow methods, I now have a way of dealing with N number of itemsets. Interestingly, the amount of code (even with my verbose F#) is significantly less than the C# equivalent and closer to how I think I think.

Filed under Analytics, F#

Association Rule Problem: Part 3

May 27, 2014 1 Comment

After spending a couple of weeks working though the imperative code, I decided to approach the problem from a F#/functional point of view. Going back to the original article, there are several steps that McCaffrey walks through:

Get a series of transactions
Get the frequent item-sets for the transactions
For each item-set, get all possible combinations. Each combination is broken into an antecedent and consequent
Apply the frequency of each antecedent in all transactions
If the frequency of the combination is greater than the confidence level, include it in the final set

For the purposes of this article, Step #1 and Step #2 were already done. My code starts with step #3. Instead of for..eaching and if..thening my way though the item-sets, I decided to look at how permutations and combinations are done in F#. Interestingly, one of the first articles on permutations and combinations on Google is from McCaffrey in MSDN from four years ago. Unfortunately, this article was of limited use because the code is decidedly non-functional so it might as well been written in C# (this was pointed out in the comments). So going to Stack Overflow, there are plenty of good examples of getting combinations in F# on SO and elsewhere. After playing with the code samples for a bit (my favorite one was this), it hit me that the ordinal positions are the same for an array of X size. So going back to McCaffrey’s example, there is only item-sets of 2 and 3 length. Therefore, I can hard-code the results and leave the actual calculation for another time.

static member GetCombinationsForDouble(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
    combinations
 
static member GetCombinationsForTriple(itemSet: int[]) =
    let combinations =  new List<int[]*int[]*int[]>()
    combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
    combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
    combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
    combinations

I used a tuple to represent the antecedent array and consequent array values. I then spun up a unit test to compare results based on McCaffrey’s detailed example:

[TestMethod]
public void GetValuesForATriple_ReturnsExpectedValue()
{
    var expected = new List<Tuple<int[], int[]>>();
    expected.Add(Tuple.Create<int[], int[]>(new int[1] { 3 }, new int[2] { 4, 7 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[1] { 4 }, new int[2] { 3, 7 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[1] { 7 }, new int[2] { 3, 4 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 4 }, new int[1] { 7 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 7 }, new int[1] { 4 }));
    expected.Add(Tuple.Create<int[], int[]>(new int[2] { 4, 7 }, new int[1] { 3 }));
 
    var itemSet = new int[3] { 3,4,7};
    var actual = FS.AssociationRuleProgram2.GetCombinationsForTriple(itemSet);
 
    Assert.AreEqual(expected.Count, actual.Count);
}

A couple of things to note about the unit test:

1) The rules about variable naming and whatnot that apply in business application development quickly fall down when applied to scientific computing. For example, there is no way that this

List<Tuple<int[], int[]>> expected = new List<Tuple<int[], int[]>>();

is more readable that this

var expected = new List<Tuple<int[], int[]>>();

In fact, it is less readable. The use of complex data structures and algorithms force a different set of naming conventions. Applying Fx-Cop or other framework naming conventions to scientific programming is as useful as applying scientific naming conventions to framework development. If it is a screw, use a screwdriver. If it is a nail, user a hammer…

2) I don’t have a good way of comparing the results of a tuple of paired arrays for equivalence – there is certainly nothing out of the box in Microsoft.VisualStudio.TestTools.UnitTesting. I toyed (briefly) with creating a method to compare equivalence in arrays but I did not in the interest of time. That would be a welcome additional to the testing namespace.

Sure enough, running the unit test using McCaffrey’s data all run green.

With step 3 knocked out, I now needed to determine the frequency of the antecedent in the transactions list. This step is better broken down into a couple of sub-steps. I used McCaffrey’s detailed example of 3,4,7 as proof of correctness in my unit tests:

I need a way of taking the antecedent of 3, and comparing it to all transactions (which are arrays) to see how often it appears. As an additional layer of complexity, that 3 is not an int, it is an array (all be it an array of one). I could not find a equivalent question on StackOverflow (meaning I probably am asking the wrong question), so I went ahead of made a mental model where I would map the TryFindIndex function against each item of subset and see if that value is in the original set. The result is a tuple with the original value and the ordinal position in the set. The key thing is that if the item was not found, it returns “None”. So I just have to filter on that flag and if the result of the filter is greater than 1, I know that something was not found and the functional can return false

In code, it pretty much looks like the way I just described it:

static member SetContainsSubset(set: int[], subset: int[]) =
    let notIncluded = subset
                        |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
                        |> Seq.filter(fun (i,j) -> j = None ) 
                        |> Seq.toArray
    if notIncluded.Length > 0 then false else true

And I generated my unit tests out of the example too:

[TestMethod]
public void SetContainsSubsetUsingMatched_ReturnsTrue()
{
    var set = new int[4] { 1, 3, 4, 7 };
    var subset = new int[3] { 3, 4, 7 };
 
    Boolean expected = true;
    Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
 
    Assert.AreEqual(expected, actual);
}
 
[TestMethod]
public void SetContainsSubsetUsingUnMatched_ReturnsFalse()
{
    var set = new int[3] { 1, 4, 7 };
    var subset = new int[3] { 3, 4, 7 };
 
    Boolean expected = false;
    Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
 
    Assert.AreEqual(expected, actual);
 
}

With this supporting function ready, I can then apply it to an array and see how many trues I get. That is the Count value in Figure 2 of the article. Seq.Map fits this task perfectly.

static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
    transactions
        |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
        |> Seq.filter(fun (t,f) -> f = true)
        |> Seq.length

And the subsequent unit test also runs green

[TestMethod]
public void CountItemSetInTransactions_ReturnsExpected()
{
    List<int[]> transactions = new List<int[]>();
    transactions.Add(new int[] { 0, 3, 4, 11 });
    transactions.Add(new int[] { 1, 4, 5 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 0, 5 });
    transactions.Add(new int[] { 3, 5, 9 });
    transactions.Add(new int[] { 2, 3, 4, 7 });
    transactions.Add(new int[] { 2, 5, 8 });
    transactions.Add(new int[] { 0, 1, 2, 5, 10 });
    transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
 
    var itemSet = new int[1] { 3 };
 
    Int32 expected = 6;
    Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
 
    Assert.AreEqual(expected, actual);
 
}

So with this in place, I am ready for the next column, the confidence column. McCaffrey used the numerator of 3 which is shown here:

So I assume that this count is the number of times 3,4,7 show up in the the transaction set. If so, the supporting function ItemSetCountInTransactions can also be used. I created a unit test and it ran green

[TestMethod]
public void CountItemSetInTransactionsUsing347_ReturnsThree()
{
    List<int[]> transactions = new List<int[]>();
    transactions.Add(new int[] { 0, 3, 4, 11 });
    transactions.Add(new int[] { 1, 4, 5 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 3, 4, 6, 7 });
    transactions.Add(new int[] { 0, 5 });
    transactions.Add(new int[] { 3, 5, 9 });
    transactions.Add(new int[] { 2, 3, 4, 7 });
    transactions.Add(new int[] { 2, 5, 8 });
    transactions.Add(new int[] { 0, 1, 2, 5, 10 });
    transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
 
    var itemSet = new int[3] { 3,4,7 };
 
    Int32 expected = 3;
    Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
 
    Assert.AreEqual(expected, actual);
 
}

So the last piece was to put it together in the GetHighConfRules method. I did not change the signature

static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
    let returnValue = new List<Rule>()
    let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a)) 
    combinations 
        |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
        |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
        |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
        |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
        |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
    returnValue

Note that I did add a helper function to get Combinations based on the length of the array

static member GetCombinations(itemSet: int[]) =
    if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
    else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)

And when I run that from the console:

So this is pretty close. McCaffrey allows for inversion of the numbers in the array (3:4 is not the same as 4:3) and I do not – but his supporting detail does not show that so I am not sure what is the correct answer. In any event, this is pretty good. The F# code can be refactored so that all combinations can be sent from an array. In the mean time, here is all 43 lines of the program.

open System
open System.Collections.Generic
 
type AssociationRuleProgram2 = 
 
    static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
        let returnValue = new List<Rule>()
        let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a)) 
        combinations 
            |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
            |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
            |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
            |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
            |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
        returnValue
 
    static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
        transactions
            |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
            |> Seq.filter(fun (t,f) -> f = true)
            |> Seq.length
 
    static member SetContainsSubset(set: int[], subset: int[]) =
        let notIncluded = subset
                            |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
                            |> Seq.filter(fun (i,j) -> j = None ) 
                            |> Seq.toArray
        if notIncluded.Length > 0 then false else true
 
    static member GetCombinations(itemSet: int[]) =
        if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
        else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)
 
    static member GetCombinationsForDouble(itemSet: int[]) =
        let combinations =  new List<int[]*int[]*int[]>()
        combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
        combinations
 
    static member GetCombinationsForTriple(itemSet: int[]) =
        let combinations =  new List<int[]*int[]*int[]>()
        combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
        combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
        combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
        combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
        combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
        combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
        combinations

Note how the code in the GetHighConfRules function matches almost one for one to the bullet points at the beginning of the post. F# is a language where the code follows how you think, not the other way around. Also note how the 43 lines of code compares to 136 lines of code in the C# example –> less noise, more signal.

Filed under Analytics, F#

← Older posts

Newer posts →

Jamie Dixon's Home

Using IBM’s Watson With F#

I think everyone is aware of IBM’s Watson from its appearance on Jeopardy. Apparently, IBM has made the Watson Api available for developers if you sign up here. Well, there goes my Sunday morning! I signed up and after one email confirm later, I was in.

Fortunately, the documentation has some pointers to other projects where people have made their own app. I used thisthis one as a model and set up Fiddler like so

The authorization token is the username and password separated by a colon encoded to base 64.

Sure enough, a 200

Setting it up in #FSharp was a snap

And sure enough

You can see the gist here

So with that simple call/request under my belt, I decided to look at the api that everyone is talking about, the question/answer api. I fired up Fiddler again and took a look at the docs. After some tweaking of the Uri, I got a successful request/response:

Smart Nerd Dinner

Wake County Restaurant Inspection Data with Azure ML and F#

Consuming Azure ML With F#

Azure ML and Wake County Election Data

Fun with Statistics and Charts

Neural Networks

TRINUG F# Analytics Prep: Part 2

Using Subsets for Association Rule Learning

Association Rule Problem: Part 3

Categories

Recent Posts

Archives

Blogroll

Meta