Using IBM’s Watson With F#

 

I think everyone is aware of IBM’s Watson from its appearance on Jeopardy.  Apparently, IBM has made the Watson Api available for developers if you sign up here.  Well, there goes my Sunday morning!  I signed up and after one email confirm later, I was in. 
IBM has tied Watson to something called “Blue Mix”, which looks to be a full-service suite of applications from deployment to hosting .  When I looked at the api documentation here, I decided to use the language translation service as a good “hello world” project.  Looking at the api help page, I was hoping just to make a request and get a response with a auth token in the header, like every other api in the world.  However, the documentation really leads you down a path of installing the Watson Explorer on your local machine, and a create a blue mix project, etc.. 
Fortunately, the documentation has some pointers to other projects where people have made their own app.  I used thisthis one as a model and set up Fiddler like so

image

The authorization token is the username and password separated by a colon encoded to base 64.
Sure enough, a 200

image

Setting it up in #FSharp was a snap
1 #r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll" 2 #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll" 3 4 open System 5 open System.Net.Http 6 open System.Net.Http.Headers 7 open System.Net.Http.Formatting 8 open System.Collections.Generic 9 10 11 let serviceName = "machine_translation" 12 let baseUrl = "http://wex-mt.mybluemix.net/resources/translate" 13 let userName = "youNameHere@aol.com" 14 let password = "yourPasswordHere" 15 let authKey = userName + ":" + password 16 17 let client = new HttpClient() 18 client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Basic",authKey) 19 20 let input = new Dictionary<string,string>() 21 input.Add("text","This is a test") 22 input.Add("sid","mt-enus-eses") 23 let content = new FormUrlEncodedContent(input) 24 25 let result = client.PostAsync(baseUrl,content).Result 26 let resultContent = result.Content.ReadAsStringAsync().Result

And sure enough

image

 

You can see the gist here
So with that simple call/request under my belt, I decided to look at the api that everyone is talking about, the question/answer api.  I fired up Fiddler again and took a look at the docs.  After some tweaking of the Uri, I got a successful request/response:

image

image

With the answers to an empty question kind interesting. if not head-scratching:

image

So passing in a question:

image

image

So we are cooking with gas.  Back into FSI

1 #r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll" 2 #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll" 3 4 open System 5 open System.Net.Http 6 open System.Net.Http.Headers 7 open System.Net.Http.Formatting 8 open System.Collections.Generic 9 10 11 let baseUrl = "http://wex-qa.mybluemix.net/resources/question" 12 let userName = "yourName@aol.com" 13 let password = "yourCreds" 14 let authKey = userName + ":" + password 15 16 let client = new HttpClient() 17 client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Basic",authKey) 18 19 let input = new Dictionary<string,string>() 20 input.Add("question","what time is it") 21 let content = new FormUrlEncodedContent(input) 22 23 let result = client.PostAsync(baseUrl,content).Result 24 let resultContent = result.Content.ReadAsStringAsync().Result

With the result like so

image

And since it is Json coming back, why not use the type provider?

1 let client = new HttpClient() 2 client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Basic",authKey) 3 4 let input = new Dictionary<string,string>() 5 input.Add("question","How can I quit smoking") 6 let content = new FormUrlEncodedContent(input) 7 8 let result = client.PostAsync(baseUrl,content).Result 9 let resultContent = result.Content.ReadAsStringAsync().Result 10 11 type qaResponse = JsonProvider<".\QAResponseJson.json"> 12 let qaAnswer = qaResponse.Parse(resultContent) 13 14 qaAnswer.Question.Answers 15 |> Seq.ofArray 16 |> Seq.iter(fun a -> printfn "(%s)" a.Text)

Here is Watson’s response:

image

You can see the gist here

Smart Nerd Dinner

I think there is general agreement that the age of the ASP.NET wire-framing post-back web dev is over.  If you are going to writing web applications in 2015 in the .NET stack, you have to be able to use java script and associated javascript frameworks like Angular.  Similarly, the full-stack developer needs to have a much deeper understanding of the data that is passing in and and out of their application.  With the rise of analytics in an application, the developer needs different tools and approaches to their application.  Just as you need to know javascript if you are going to be in the browser, you need to know F# if you are going to be building industrial-grade  domain and  data layers.

I decided to refactor an existing ASP.NET postback website to see how hard it would be to introduce F# to the project and apply some basic statistics to make the site smarter.  It was pretty easy and the payoffs were quite large.

If you are not familiar, nerd Dinner is the cannonal example of a MVC application that was created to show Microsoft web devs how to create a website using the .NET stack.  The original project was put into a book with the Mount Rushmore of MSFT uber-devs

image

The project was so successful that it actually was launched into a real website

image

and you can find the code on Codeplex here

image

When you download the source code from the repository, you will notice a couple of things:

1) It is not a very big project – with only 1100 lines of code

image

2) There are 191 FxCop violations

image

3) It does compile coming out of source, but some of the unit tests fail

image

4) There is pretty low code coverage (21%)

image

Focusing on the code coverage issue, it makes sense that there is not much code coverage because there is not much code that can be covered.  There is maybe 15 lines of “business logic” if the term business logic is expanded to include input validation.  This is an example

image

Also, there is maybe ten lines of code that do some basic filtering

image

So step one in the quest to refactor nerd dinner to be a bit smarter was to rename the projects.  Since MVC is a UI framework, it made sense to call it that.  I then changed the namespaces to reflect the new structure

image

The next  step was to take the domain classes out of the UI and put them into the application.  First, I created another project

image

I then took all of the interfaces that was in the UI and placed them into the application

1 namespace NerdDinner.Models 2 3 open System 4 open System.Linq 5 open System.Linq.Expressions 6 7 type IRepository<'T> = 8 abstract All : IQueryable<'T> 9 abstract AllIncluding 10 : [<ParamArray>] includeProperties:Expression<Func<'T, obj>>[] -> IQueryable<'T> 11 abstract member Find: int -> 'T 12 abstract member InsertOrUpdate: 'T -> unit 13 abstract member Delete: int -> unit 14 abstract member SubmitChanges: unit -> unit 15 16 type IDinnerRepository = 17 inherit IRepository<Dinner> 18 abstract member FindByLocation: float*float -> IQueryable<Dinner> 19 abstract FindUpcomingDinners : unit -> IQueryable<Dinner> 20 abstract FindDinnersByText : string -> IQueryable<Dinner> 21 abstract member DeleteRsvp: 'T -> unit

I then tooks all of the data structures/models and placed them in the application.

1 namespace NerdDinner.Models 2 3 open System 4 open System.Web.Mvc 5 open System.Collections.Generic 6 open System.ComponentModel.DataAnnotations 7 open System.ComponentModel.DataAnnotations.Schema 8 9 type public LocationDetail (latitude,longitude,title,address) = 10 let mutable latitude = latitude 11 let mutable longitude = longitude 12 let mutable title = title 13 let mutable address = address 14 15 member public this.Latitude 16 with get() = latitude 17 and set(value) = latitude <- value 18 19 member public this.Longitude 20 with get() = longitude 21 and set(value) = longitude <- value 22 23 member public this.Title 24 with get() = title 25 and set(value) = title <- value 26 27 member public this.Address 28 with get() = address 29 and set(value) = address <- value 30 31 type public RSVP () = 32 let mutable rsvpID = 0 33 let mutable dinnerID = 0 34 let mutable attendeeName = "" 35 let mutable attendeeNameId = "" 36 let mutable dinner = null 37 38 member public self.RsvpID 39 with get() = rsvpID 40 and set(value) = rsvpID <- value 41 42 member public self.DinnerID 43 with get() = dinnerID 44 and set(value) = dinnerID <- value 45 46 member public self.AttendeeName 47 with get() = attendeeName 48 and set(value) = attendeeName <- value 49 50 member public self.AttendeeNameId 51 with get() = attendeeNameId 52 and set(value) = attendeeNameId <- value 53 54 member public self.Dinner 55 with get() = dinner 56 and set(value) = dinner <- value 57 58 59 and public Dinner () = 60 let mutable dinnerID = 0 61 let mutable title = "" 62 let mutable eventDate = DateTime.MinValue 63 let mutable description = "" 64 let mutable hostedBy = "" 65 let mutable contactPhone = "" 66 let mutable address = "" 67 let mutable country = "" 68 let mutable latitude = 0. 69 let mutable longitude = 0. 70 let mutable hostedById = "" 71 let mutable rsvps = List<RSVP>() :> ICollection<RSVP> 72 73 [<HiddenInput(DisplayValue=false)>] 74 member public self.DinnerID 75 with get() = dinnerID 76 and set(value) = dinnerID <- value 77 78 [<Required(ErrorMessage="Title Is Required")>] 79 [<StringLength(50,ErrorMessage="Title may not be longer than 50 characters")>] 80 member public self.Title 81 with get() = title 82 and set(value) = title <- value 83 84 [<Required(ErrorMessage="EventDate Is Required")>] 85 [<Display(Name="Event Date")>] 86 member public self.EventDate 87 with get() = eventDate 88 and set(value) = eventDate <- value 89 90 [<Required(ErrorMessage="Description Is Required")>] 91 [<StringLength(256,ErrorMessage="Description may not be longer than 256 characters")>] 92 [<DataType(DataType.MultilineText)>] 93 member public self.Description 94 with get() = description 95 and set(value) = description <- value 96 97 [<StringLength(256,ErrorMessage="Hosted By may not be longer than 256 characters")>] 98 [<Display(Name="Hosted By")>] 99 member public self.HostedBy 100 with get() = hostedBy 101 and set(value) = hostedBy <- value 102 103 [<Required(ErrorMessage="Contact Phone Is Required")>] 104 [<StringLength(20,ErrorMessage="Contact Phone may not be longer than 20 characters")>] 105 [<Display(Name="Contact Phone")>] 106 member public self.ContactPhone 107 with get() = contactPhone 108 and set(value) = contactPhone <- value 109 110 [<Required(ErrorMessage="Address Is Required")>] 111 [<StringLength(20,ErrorMessage="Address may not be longer than 50 characters")>] 112 [<Display(Name="Address")>] 113 member public self.Address 114 with get() = address 115 and set(value) = address <- value 116 117 [<UIHint("CountryDropDown")>] 118 member public this.Country 119 with get() = country 120 and set(value) = country <- value 121 122 [<HiddenInput(DisplayValue=false)>] 123 member public self.Latitude 124 with get() = latitude 125 and set(value) = latitude <- value 126 127 [<HiddenInput(DisplayValue=false)>] 128 member public v.Longitude 129 with get() = longitude 130 and set(value) = longitude <- value 131 132 [<HiddenInput(DisplayValue=false)>] 133 member public self.HostedById 134 with get() = hostedById 135 and set(value) = hostedById <- value 136 137 member public self.RSVPs 138 with get() = rsvps 139 and set(value) = rsvps <- value 140 141 member public self.IsHostedBy (userName:string) = 142 System.String.Equals(hostedBy,userName,System.StringComparison.Ordinal) 143 144 member public self.IsUserRegistered(userName:string) = 145 rsvps |> Seq.exists(fun r -> r.AttendeeName = userName) 146 147 148 [<UIHint("Location Detail")>] 149 [<NotMapped()>] 150 member public self.Location 151 with get() = new LocationDetail(self.Latitude,self.Longitude,self.Title,self.Address) 152 and set(value:LocationDetail) = 153 let latitude = value.Latitude 154 let longitude = value.Longitude 155 let title = value.Title 156 let address = value.Address 157 ()

Unlike C# where there is a class per file, all of the related elements are placed into a the same location.  Also, notice that the absence of semi-colons, curly braces, and other distracting characters, and finally you can see that because were are in the .NET framework, all of the data annotations are the same.  Sure enough, pointing the MVC UI to the application and hitting run, the application just works.

image

With the separation complete, it was time time to make our app much smarter.  The first thing that I thought of was when the person creates an account, they enter their first and last name

 

This seems like an excellent opportunity to add some user manipulation personalization to our site.  Going back to this analysis of names gives to newborns in the United States, if I know your first name, I have a pretty good chance of guessing your age/gender/and state of birth.  For example ‘Jose’ is probably a male born in his twenties in either Texas or California.  ‘James’ is probably a male in his 40s or 50s.

I added 6 pictures to the site for young,middleAged, and old males and females.

image

 

I then modified the logonStatus partial view like so

1 @using NerdDinner.UI; 2 3 4 @if(Request.IsAuthenticated) { 5 <text>Welcome <b>@(((NerdIdentity)HttpContext.Current.User.Identity).FriendlyName)</b>! 6 [ @Html.ActionLink("Log Off", "LogOff", "Account") ]</text> 7 } 8 else { 9 @:[ @Html.ActionLink("Log On", "LogOn", new { controller = "Account", returnUrl = HttpContext.Current.Request.RawUrl }) ] 10 } 11 12 @if (Session["adUri"] != null) 13 { 14 <img alt="product placement" title="product placement" src="@Session["adUri"]" height="40" /> 15 }

Then, I created a session variable called adUri that the picture will reference in the Logon controller

1 public ActionResult LogOn(LogOnModel model, string returnUrl) 2 { 3 if (ModelState.IsValid) 4 { 5 if (ValidateLogOn(model.UserName, model.Password)) 6 { 7 // Make sure we have the username with the right capitalization 8 // since we do case sensitive checks for OpenID Claimed Identifiers later. 9 string userName = MembershipService.GetCanonicalUsername(model.UserName); 10 11 FormsAuth.SignIn(userName, model.RememberMe); 12 13 AdProvider adProvider = new AdProvider(); 14 String catagory = adProvider.GetCatagory(userName); 15 Session["adUri"] = "/Content/images/" + catagory + ".png"; 16

And finally, I added an implementation of the adProvider back in the application:

1 type AdProvider () = 2 member this.GetCatagory personName: string = 3 "middleAgedMale"

So running the app, we have a product placement for a Middle Aged Male

image

So the last thing to do is to turn names into those categories.  I thought of a couple of different implementations: loading the entire census data set and searching it on demand,  I then thought about using Azure ML and making a API request each time, I then decided into just creating a lookup table that can be searched.  In any event, since I am using an interface, swapping out implementations is easy and since I am using F#, creating implementations is easy.

I went back to my script file that analyzed the baby names from the US census and created a new script.  I loaded the names into memory like before

1 #r "C:/Git/NerdChickenChicken/04_mvc3_Working/packages/FSharp.Data.2.0.14/lib/net40/FSharp.Data.dll" 2 3 open FSharp.Data 4 5 type censusDataContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/AK.TXT"> 6 type stateCodeContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv"> 7 8 let stateCodes = stateCodeContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv"); 9 10 let fetchStateData (stateCode:string)= 11 let uri = System.String.Format("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/{0}.TXT",stateCode) 12 censusDataContext.Load(uri) 13 14 let usaData = stateCodes.Rows 15 |> Seq.collect(fun r -> fetchStateData(r.Abbreviation).Rows) 16 |> Seq.toArray 17

I then created a function that tells the probability of male

1 let genderSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.F) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.filter(fun (g,c,p) -> g = "M") 11 |> Seq.map(fun (g,c,p) -> p) 12 |> Seq.head 13 14 genderSearch "James" 15

image

I then created a function that calculated the year the last name was popular (using 1 standard deviation away)

1 let ageSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.``1910``) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 |> Seq.toArray 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.toArray 11 12 let variance (source:float seq) = 13 let mean = Seq.average source 14 let deltas = Seq.map(fun x -> pown(x-mean) 2) source 15 Seq.average deltas 16 17 let standardDeviation(values:float seq) = 18 sqrt(variance(values)) 19 20 let standardDeviation' name = ageSearch name 21 |> Seq.map(fun (y,c,p) -> float c) 22 |> standardDeviation 23 24 let average name = ageSearch name 25 |> Seq.map(fun (y,c,p) -> float c) 26 |> Seq.average 27 28 let attachmentPoint name = (average name) + (standardDeviation' name) 29 30 let popularYears name = 31 let allYears = ageSearch name 32 let attachmentPoint' = attachmentPoint name 33 let filteredYears = allYears 34 |> Seq.filter(fun (y,c,p) -> float c > attachmentPoint') 35 |> Seq.sortBy(fun (y,c,p) -> y) 36 filteredYears 37 38 let lastPopularYear name = popularYears name |> Seq.last 39 let firstPopularYear name = popularYears name |> Seq.head 40 41 lastPopularYear "James" 42

image

 

And then created a function that takes in the gender probability of being male and the last year the name was poular and assigns the name into a category:

1 let nameAssignment (malePercent, lastYearPopular) = 2 match malePercent > 0.75, malePercent < 0.75, lastYearPopular < 1945, lastYearPopular > 1980 with 3 | true, false, true, false -> "oldMale" 4 | true, false, false, false -> "middleAgedMale" 5 | true, false, false, true -> "youngMale" 6 | false, true, true, false -> "oldFemale" 7 | false, true, false, false -> "middleAgedFemale" 8 | false, true, false, true -> "youngFeMale" 9 | _,_,_,_ -> "unknown"

And then it was a matter of tying the functions together for each of the names in the master list:

1 let nameList = usaData 2 |> Seq.map(fun r -> r.Mary) 3 |> Seq.distinct 4 5 nameList 6 |> Seq.map(fun n -> n, genderSearch n) 7 |> Seq.map(fun (n,mp) -> n,mp, lastPopularYear n) 8 |> Seq.map(fun (n,mp,(y,c,p)) -> n, mp, y) 9 10 let nameList' = nameList 11 |> Seq.map(fun n -> n, genderSearch n) 12 |> Seq.map(fun (n,mp) -> n,mp, lastPopularYear n) 13 |> Seq.map(fun (n,mp,(y,c,p)) -> n, mp, y) 14 |> Seq.map(fun (n,mp,y) -> n,nameAssignment(mp,y)) 15

image

And then write the list out to a file

1 open System.IO 2 let outFile = new StreamWriter(@"c:\data\nameList.csv") 3 4 nameList' |> Seq.iter(fun (n,c) -> outFile.WriteLine(sprintf "%s,%s" n c)) 5 outFile.Flush 6 outFile.Close()

Thanks to this stack overflow post for the file write (I wish the csv type provider had this ability).  With the file created, I can then use the file as a lookup for my name function back in the MVC app using a csv type provider

1 type nameMappingContext = CsvProvider<"C:/data/nameList.csv"> 2 3 type AdProvider () = 4 member this.GetCatagory personName: string = 5 let nameList = nameMappingContext.Load("C:/data/nameList.csv") 6 let foundName = nameList.Rows 7 |> Seq.filter(fun r -> r.Annie = personName) 8 |> Seq.map(fun r -> r.oldFemale) 9 |> Seq.toArray 10 if foundName.Length > 0 then 11 foundName.[0] 12 else 13 "middleAgedMale"

And now I have some (basic) personalization to Nerd Dinner. (Emma is a young female name so they get a picturer of a campground)

image

So this a rather crude.  There is no provision for nicknames, case-sensitivity, etc.  But the site is along the way to becoming smarter…

The code can be found on github here.

Wake County Restaurant Inspection Data with Azure ML and F#

With Azure ML now available, I was thinking about some of the analysis I did last year and how I could do even more things with the same data set.  One such analysis that came to mind was the restaurant inspection data that I analyzed last year.  You can see the prior analysis here.

I uploaded the restaurant data into Azure and thought of a simple question –> can we predict inspection scores based on some easily available data?  This is an interesting dataset because there are some categorical data elements (zip code, restaurant type, etc…) and there are some continuous ones (priority foundation, etc…).

Here is the base dataset:

image

I created a new experiment and I used a boosted regression model and a neural network regression and used a 70/30 train/test split.

image

After running the models and inspecting the model evaluation, I don’t have a very good model

image

I then decided to go back and pull some of the X variables out of the dataset and concentrate on only a couple of variables.  I added a project column module and then selected Restaurant Type and Zip Code as the X variables and left the Inspection Score as the Y variable. 

image

With this done, I added a couple of more models (Bayesian Linear Regression and a Decision Forest Regression) and gave it a whirl

image

image

Interesting, adding these models did not give us any better of a prediction and dropping the variables to two made a less accurate model.  Without doing any more analysis, I picked the model with the lowest MAE )Boosted Decision Tree Regression) and published it at a web service:

image

I published it as a web service and now I can consume if from a client app.   I used the code that I used for voting analysis found here as a template and sure enough:

["27519","Restaurant","0","96.0897827148438"]

["27612","Restaurant","0","95.5728530883789"]

So restaurants in Cary,NC have a higher inspection score than the ones found in Northwest Raleigh.   However, before we start  alerting the the Cary Chamber of Commerce to create a marketing campaign (“Eat in Cary, we are safer”), the difference is within the MAE.

In any event, it would be easy to create a  phone app and you don’t know a restaurant score, you can punch in the establishment type and the zip code and have a good idea about the score of the restaurant. 

This is an academic exercise b/c the establishments have to show you their card and yelp has their score on them, but a fun exercise none the less.  Happy eating.

Consuming Azure ML With F#

(This post is a continuation of this one)

So with a model that works well enough,  I selected only that model and saved it

image

 

image

Created a new experiment and used that model with the base data.  I then marked the project columns as the input and the score as the output (green and blue circle respectively)

image

After running it, I published it as a web service

image

And voila, an endpoint ready to go.  I then took the auto generated script and opened up a new Visual Studio F# project to use it.  The problem was that this is the data structure that the model needs

FeatureVector = new Dictionary<string, string>() { { "Precinct", "0" }, { "VRN", "0" }, { "VRstatus", "0" }, { "VRlastname", "0" }, { "VRfirstname", "0" }, { "VRmiddlename", "0" }, { "VRnamesufx", "0" }, { "VRstreetnum", "0" }, { "VRstreethalfcode", "0" }, { "VRstreetdir", "0" }, { "VRstreetname", "0" }, { "VRstreettype", "0" }, { "VRstreetsuff", "0" }, { "VRstreetunit", "0" }, { "VRrescity", "0" }, { "VRstate", "0" }, { "Zip Code", "0" }, { "VRfullresstreet", "0" }, { "VRrescsz", "0" }, { "VRmail1", "0" }, { "VRmail2", "0" }, { "VRmail3", "0" }, { "VRmail4", "0" }, { "VRmailcsz", "0" }, { "Race", "0" }, { "Party", "0" }, { "Gender", "0" }, { "Age", "0" }, { "VRregdate", "0" }, { "VRmuni", "0" }, { "VRmunidistrict", "0" }, { "VRcongressional", "0" }, { "VRsuperiorct", "0" }, { "VRjudicialdistrict", "0" }, { "VRncsenate", "0" }, { "VRnchouse", "0" }, { "VRcountycomm", "0" }, { "VRschooldistrict", "0" }, { "11/6/2012", "0" }, { "Voted Ind", "0" }, }, GlobalParameters = new Dictionary<string, string>() { } };

And since I am only using 6 of the columns, it made sense to reload the Wake County Voter Data with just the needed columns.  I went back to the original CSV and did that.  Interestingly, I could not set the original dataset as the publish input so I added a project column module that does nothing

image

With that in place, I republished the service and opened Visual Studio.  I decided to start with a script.  I was struggling though the async when Tomas P helped me on Stack Overflow here.  I’ll say it again, the F# community is tops.  In any event, here is the initial script:

#r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll" #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll" open System open System.Net.Http open System.Net.Http.Headers open System.Net.Http.Formatting open System.Collections.Generic type scoreData = {FeatureVector:Dictionary<string,string>;GlobalParameters:Dictionary<string,string>} type scoreRequest = {Id:string; Instance:scoreData} let invokeService () = async { let apiKey = "" let uri = "https://ussouthcentral.services.azureml.net/workspaces/19a2e623b6a944a3a7f07c74b31c3b6d/services/f51945a42efa42a49f563a59561f5014/score" use client = new HttpClient() client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Bearer",apiKey) client.BaseAddress <- new Uri(uri) let input = new Dictionary<string,string>() input.Add("Zip Code","27519") input.Add("Race","W") input.Add("Party","UNA") input.Add("Gender","M") input.Add("Age","45") input.Add("Voted Ind","1") let instance = {FeatureVector=input; GlobalParameters=new Dictionary<string,string>()} let scoreRequest = {Id="score00001";Instance=instance} let! response = client.PostAsJsonAsync("",scoreRequest) |> Async.AwaitTask let! result = response.Content.ReadAsStringAsync() |> Async.AwaitTask if response.IsSuccessStatusCode then printfn "%s" result else printfn "FAILED: %s" result response |> ignore } invokeService() |> Async.RunSynchronously

 

Unfortunately, when I run it, it fails.  Below is the Fiddler trace:

image

 

So it looks like the Json Serializer is postpending the “@” symbol.  I changed the records to types and voila:

image

You can see the final script here.

So then throwing in some different numbers. 

  • A millennial: ["27519","W","D","F","25","1","1","0.62500011920929"]
  • A senior citizen: ["27519","W","D","F","75","1","1","0.879632294178009"]

I wonder why social security never gets cut?

In any event, just to check the model:

  • A 15 year old: ["27519","W","D","F","15","1","0","0.00147285079583526"]

Azure ML and Wake County Election Data

I have been spending the last couple of weeks using Azure ML and I think it is one of the most exciting technologies for business developers and analysts since ODBC and FSharp type providers.   If you remember, when ODBC came out, every relational database in the world became accessible and therefore usable/analyzable.   When type providers came out, programming, exploring, and analyzing data sources became much easier and it expanded from RDBMS to all formats (notably Json).  So getting data was no longer a problem, but analyzing it still was.

Enter Azure ML. 

I downloaded the Wake County Voter History data from here.  I took the Excel spreadsheet and converted it to a .csv locally.  I then logged into Azure ML and imported the data

image

I then created an experiment and added the dataset to the canvas

image

 

And looked at the basic statistics of the data set

image

(Note that I find that using the FSharp REPL  a better way to explore the data as I can just dot each element I am interested in and view the results).

In any event, the first question I want to answer is

“given a person’s ZipCode, Race, Party,Gender, and Age, can I predict if they will vote in November”

To that end, I first narrowed down the columns using a Column Projection and picked only the columns I care about.  I picked “11/6/2012” and the X variable because that was the last  national election and that is what we are going to have in November.  I prob should have done 2010 b/c that is a national without a President, but that can be analyzed at a later date.

image

image

I then ran my experiment so the data would be available in the Project Column step.

image

 

I then renamed the columns to make them a bit readable by using a series Metadata Editors (it does not look like you can do all renames in 1 step.  Equally as annoying is that you have to add each module, run it, then add the next.)

image

(one example)

image

 

I then added a Missing Values scrubber for the voted column.  So instead of a null field, people who didn’t vote get a “N”

image

The problem is that it doesn’t work –> looks like we can’t change the values per column.

image

I asked the question on the forum but in the interest of time, I decided to change the voted column from a categorical column to an indicator. That way I can do binary analysis.  That also failed.  I went back to the original spreadsheet and added a Indicator column and then also renamed the column headers so I am not cluttering up my canvas with those meta data transforms.  Finally, I realized I want only active voters but there does not seems to be a filtering ability (remove rows only works for missing) so I removed those also from the original dataset.  I think the ability to scrub and munge data is an area for improvement, but since this is release 1, I understand.

After re-importing the data, I changed my experiment like so

image

I then split the dataset into Training/Validation/And Testing using a 60/20/20 split

image

So the left point on the second split is 60% of the original dataset, the right point on the second split is 20% of the original dataset (or 75%/25% of the 80% of the first split)

I then added a SVM with a train and score module.  Note that I am training with 60% of the original dataset and I am validating with 20%

 

image

After it runs, there are 2 new columns in the dataset –> Scored labels and probabilities so each row now has a score.

 

image

With the model in place, I can then evaluate it using an evaluation model

image

And we can see an AUC of .666, which immediately made me think of this

image

In any event, I added a Logisitc Regression and a Boosted Decision Tree to the canvas and hooked them up to the training and validation sets

image

And this is what we have

image image

 

SVM: .666 AUC

Regression: .689 AUC

Boosted Decision Tree: .713 AUC

So with Boosted Decision Tree ahead, I added a Sweep Parameter module to see if I can tune it more.  I am using AUC as the performance metric

image

image

So the best AUC I am going to get is .7134 with the highlighted parameters.  I then added 1 more Model that uses those parameters against the entire training dataset (80% of the total) and then evaluates it against the remaining 20%.

image

With the final answer of

image

With that in hand, I can create a new experiment that will be the bases of a real time voting app.

Fun with Statistics and Charts

I am preparing my Raleigh Code Camp submission ‘Nerd Dinner With Brains” this weekend.  If you are not familiar, Nerd Dinner is the canonical example of a MVC application and is very familiar to Web Devs who want to learn MVC the Microsoft way.  You can see the walkthrough here.   For everything that Nerd Dinner is, it is not … smart.  There is no business rules outside of some basic input validation, which is pretty representative of many “Boring Line Of Business Applications (BLOBAs according to Scott Waschlan).  Not coincidently, the lack of business logic is the biggest  reason many BLOBAs don’t have many unit tests –> if all you are doing is wire framing a database, what business logic needs to be tested? 

The talk is going to take the Nerd Diner wireframe and inject some analytics to the application.  To that end, I first considered the person who is attending the dinner.  All we know about them is their name and possibly their location.  So what can a name tell you?  Turns out, plenty.

As I showed in this post, there is a great source of the number of names given by gender, yearOfBrith, and stateOfBirth from the US census.  Picking up where that post left off, I loaded in the entire data set into memory.

My first question was, “given a name, can I tell what gender the person is?”  This is very straight forward to calculate.

1 let genderSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.F) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.toArray 11 12 genderSearch "James" 13

And the REPL shows me that is is very likely that “James” is a male:

image

I can then set up in the web.config file a confidence point where there name is a male/female, I am thinking 75%.  Once we have that, the app can respond differently.  Perhaps we have a product-placement advertisement that becomes a male-focused if we are reasonably certain that the user is a male.  Perhaps we can be more subtle and change the theme of the site, or the page navigation, to induce the person to do additional things on the site.

In any event, I then wanted to tackle age.  I spun up some code to isolate a person’s age

1 let ageSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.``1910``) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 |> Seq.toArray 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.toArray

I had no idea if names have a certain age connotation so I decided to do some basic charting.  Isaac Abraham pointed me to FSharp.Chart which is a great way to do some basic charting for discovery.

1 let chartData = ageSearch "James" 2 |> Seq.map(fun (y,c,p) -> y, c) 3 |> Seq.sortBy(fun (y,c) -> y) 4 5 Chart.Line(chartData).ShowChart()

And sure enough, the name “James” has a real ebb and flow for its popularity.

image

so if the user has a name of “James”, you can make a reasonable assumption they are male and probably born before 1975.  Cue up the Van Halen!

And yes, because I had to:

1 let chartData = ageSearch "Britney" 2 |> Seq.map(fun (y,c,p) -> y, c) 3 |> Seq.sortBy(fun (y,c) -> y)

image

Kinda does match her career, no?

Anyway, back to the task at hand.  In terms of analytics, I want to be a bit more precise then eyeballing a chart.  I started with the following code:

1 ageSearch "James" 2 |> Seq.map(fun (y,c,p) -> float c) 3 |> Seq.average 4 5 ageSearch "James" 6 |> Seq.map(fun (y,c,p) -> float c) 7 |> Seq.min 8 9 ageSearch "James" 10 |> Seq.map(fun (y,c,p) -> float c) 11 |> Seq.max 12

image

With these basic statistics out of the way, I then wanted to look at when the name was no longer popular.  I decided to use 1 standard deviation away from the average to determine an outlier.  First the standard deviation:

1 let variance (source:float seq) = 2 let mean = Seq.average source 3 let deltas = Seq.map(fun x -> pown(x-mean) 2) source 4 Seq.average deltas 5 6 let standardDeviation(values:float seq) = 7 sqrt(variance(values)) 8 9 ageSearch "James" 10 |> Seq.map(fun (y,c,p) -> float c) 11 |> standardDeviation 12 13 let standardDeviation' = ageSearch "James" 14 |> Seq.map(fun (y,c,p) -> float c) 15 |> standardDeviation 16 17 let average = ageSearch "James" 18 |> Seq.map(fun (y,c,p) -> float c) 19 |> Seq.average 20 21 let attachmentPoint = average+standardDeviation'

image

And then I can get the last year that the name was within 1 standard deviation above the average (greater than 71,180 names given):

1 2 let popularYears = ageSearch "James" 3 |> Seq.map(fun (y,c,p) -> y, float c) 4 |> Seq.filter(fun (y,c) -> c > attachmentPoint) 5 |> Seq.sortBy(fun (y,c) -> y) 6 |> Seq.last

image

So “James” is very likely a male and likely born before 1964.  Cue up the Pink Floyd!

The last piece was the state of birth –> can I guess the state of birth for a user?  I first looked at the states on a plot

1 let chartData' = stateSearch "James" 2 |> Seq.map(fun (s,c,p) -> s,c) 3 4 Chart.Column(chartData').ShowChart() 5

image

Nothing really stands out at me –> states with the most births have the most names.  I could do an academic exercise of seeing what states favor certain names, but that does not help me with Nerd Dinner in guessing the state of birth when given a name.

I pressed on to look at the top 10 states:

1 let topTenStates = stateSearch "James" 2 |> Seq.sortBy(fun (s,c,p) -> -c-1) 3 |> Seq.take 10 4 5 let topTenTotal = topTenStates 6 |> Seq.sumBy(fun (s,c,p) -> c) 7 let total = stateSearch "James" 8 |> Seq.sumBy(fun (s,c,p) -> c) 9 10 float topTenTotal/float total

image

So 50% of “James” were born in 10 states.  Again, I am not sure there is any actionable information here.  For example, if a majority of “James” were born in MI, I might have something (cue up the Bob Seger). 

Interestingly, there are certain number of names where the state of birth does matter.  For example, consider “Jose”:

image

Unsurprisingly, the two states are CA and TX.  Just using James and Jose as an example:

  • James is a male born before 1964
  • Jose is a male born before 2008 in either TX or CA

As an academic exercise, we could construct a random forest to find the names with the greatest state affinity.  However, that won’t help us on Nerd Dinner so I am leaving that out for another day.

This analysis does not account for a host of factors (person not born in the USA, nicknames, etc..), but it is still better than the nothing that Nerd Dinner currently has.  This analysis is not particular sophisticated but I often find that even the most basic statistics can be very powerful if used correctly.  That will be the next part of the talk…

 

 

 

 

 

Neural Networks

I picked up James McCaffrey’s Neural Networks Using C# a couple of weeks ago and decided to see if I could rewrite the code in F#.  Unfortunately, the source code is not available (as far as I could tell), so I did some C# then F# coding to see if I could get functional equivalence.

My first stop was chapter one.  I made the decision to get the F# code working for the sample data that McCaffrey provided first and then refactor it to a more general program that would work with inputs and values of different datasets.  My final upgrade will be use Deedle instead of any other data structure.  But first things first, I want to get the examples working so I fired up a script file and opened my REPL.

McCaffrey defines a sample dataset like this

  1. string[] sourceData = new string[] { "Sex Age Locale Income Politics",
  2.     "==============================================",
  3.     "Male 25 Rural 63,000.00 Conservative",
  4.     "Female 36 Suburban 55,000.00 Liberal", "Male 40 Urban 74,000.00 Moderate",
  5.     "Female 23 Rural 28,000.00 Liberal" };

He then creates a parser for the comma-delimited string values into a double[][].  I just created the dataset as a List of tuples.

  1. let chapter1TestData = [("Male",25.,"Rural",63000.00,"Conservative");
  2.                 ("Female",36.,"Suburban",55000.00,"Liberal");
  3.                 ("Male",40.,"Urban",74000.00,"Moderate");
  4.                 ("Female",23.,"Rural",28000.00,"Liberal")]

 

I did try an implementation using a record type but for reasons below, I am using Tuples.  With the equivalent data loaded into  the REPL, I tackled the first supporting function: MinMax.  Here is the C# code that McCaffrey wrote:

  1. static void MinMaxNormal(double[][] data, int column)
  2. {
  3.     int j = column;
  4.     double min = data[0][j];
  5.     double max = data[0][j];
  6.     for (int i = 0; i < data.Length; ++i)
  7.     {
  8.         if (data[i][j] < min) min = data[i][j];
  9.         if (data[i][j] > max) max = data[i][j];
  10.     }
  11.     double range = max – min;
  12.     if (range == 0.0) // ugly
  13.     { for (int i = 0; i < data.Length; ++i)
  14.         data[i][j] = 0.5;
  15.         return; }
  16.     for (int i = 0; i < data.Length; ++i)
  17.         data[i][j] = (data[i][j] – min) / range;
  18. }

and here is the equivalent F# code.

  1. let minMax (fullSet, i) =
  2.     let min = fullSet |> Seq.min
  3.     let max = fullSet |> Seq.max
  4.     (i-min)/(max-min)

 

Note that McCaffrey does not have any unit tests but when I ran the dummy data through the F# implementation, the results matched his screen shots so that will work well enough.  If you ever need a reason to use F#, consider those 2 code samples.  Granted McCaffrey’s code is more abstract because it can be any column in double array, but my counterpoint is that the function is really doing too much and it is trivial in F# to pick a given column.  Is there any doubt what the F# code is doing?  Is there any certainty of what the C# code is doing?

In any event, moving along to the next functions, McCaffrey created two functions that do all of the encoding of the string values to appropriate numeric ones.  Depending on if the value is a X value (independent) or Y value (dependent), there is a different encoding scheme:

  1.  static string EffectsEncoding(int index, int N)
  2.  {
  3.      // If N = 3 and index = 0 -> 1,0.
  4.      // If N = 3 and index = 1 -> 0,1.
  5.      // If N = 3 and index = 2 -> -1,-1.
  6.      if (N == 2)
  7.      // Special case.
  8.      { if (index == 0) return "-1"; else if (index == 1) return "1"; }
  9.      int[] values = new int[N – 1];
  10.      if (index == N – 1)
  11.      // Last item is all -1s.
  12.      { for (int i = 0; i < values.Length; ++i) values[i] = -1; }
  13.      else
  14.      {
  15.          values[index] = 1;
  16.          // 0 values are already there.
  17.      } string s = values[0].ToString();
  18.      for (int i = 1; i < values.Length; ++i) s += "," + values[i]; return s;
  19.  }
  20.  
  21.  static string DummyEncoding(int index, int N)
  22.  {
  23.      int[] values = new int[N]; values[index] = 1;
  24.      string s = values[0].ToString();
  25.      for (int i = 1; i < values.Length; ++i) s += "," + values[i];
  26.      return
  27. }

In my F# project, I decided to domain-specific encoding.  I plan to refactor this to something more abstract. 

  1. //Transform Sex
  2. let testData' = chapter1TestData |> Seq.map(fun (s,a,l,i,p) -> match s with
  3.                                                                | "Male"-> -1.0,a,l,i,p
  4.                                                              | "Female" -> 1.0,a,l,i,p
  5.                                                              | _ -> failwith "Invalid sex")
  6. //Normalize Age
  7. let testData'' =
  8.     let fullSet =  testData' |> Seq.map(fun (s,a,l,i,p) -> a)
  9.     testData' |> Seq.map(fun (s,a,l,i,p) -> s,minMax(fullSet,a),l,i,p)
  10.  
  11. //Transform Locale
  12. let testData''' = testData'' |> Seq.map(fun (s,a,l,i,p) -> match l with
  13.                                                                 | "Rural" -> s,a,1.,0.,i,p
  14.                                                                 | "Suburban" -> s,a,0.,1.,i,p
  15.                                                                 | "Urban" -> s,a,-1.,-1.,i,p
  16.                                                                 | _ -> failwith "Invalid locale")
  17. //Transform and Normalize Income
  18. let testData'''' =
  19.     let fullSet =  testData''' |> Seq.map(fun (s,a,l0,l1,i,p) -> i)
  20.     testData''' |> Seq.map(fun (s,a,l0,l1,i,p) -> s,a,l0,l1,minMax(fullSet,i),p)
  21.  
  22. //Transform Politics
  23. let testData''''' = testData'''' |> Seq.map(fun (s,a,l0,l1,i,p) -> match p with
  24.                                                                 | "Conservative" -> s,a,l0,l1,i,1.,0.,0.
  25.                                                                 | "Liberal" -> s,a,l0,l1,i,0.,1.,0.
  26.                                                                 | "Moderate" -> s,a,l0,l1,i,0.,0.,1.
  27.                                                                 | _ -> failwith "Invalid politics")

When I execute the script:

image

Which is the same as McCaffrey’s.

image

Note that he used Gaussian normalization on column 2 and I did Min/Max based on his advice in the book.

 

 

TRINUG F# Analytics Prep: Part 2

I finished up the second part of the F#/Analytics lab scheduled for August.  It is a continuation of going through Astborg’s F# for Quantitative Finance that we started last month.  Here is my fist blog post on it.

In this lab, we are going to tackle the more advanced statistical calculations: the Black-Scholes formula, the Greeks, and Monte Carlo simulation. Using the same solution and projects, I started the script file to figure out the Black Scholes formula.  Astborg uses a couple of supporting functions which I knocked out first: Power and CumulativeDistribution.  I first created his function verbatim like this:

  1. let pow x n = exp(n*log(x))

and then refactored it to make it more readable like this

  1. let power baseNumber exponent = exp(exponent * log(baseNumber))

and then I realized it is the same thing as using pown which is already found in FSharp.Core. 

image

In any event, I then attacked the cumulativeDistribution method.  I downloaded the source from his website and then refactored it so that each step is clearly laid out.  Here is the refactored function:

  1. let cumulativeDistribution (x) =
  2.         let a1 =  0.31938153
  3.         let a2 = -0.356563782
  4.         let a3 =  1.781477937
  5.         let a4 = -1.821255978
  6.         let a5 =  1.330274429
  7.         let pi = 3.141592654
  8.         let l  = abs(x)
  9.         let k  = 1.0 / (1.0 + 0.2316419 * l)
  10.  
  11.         let a1' = a1*k
  12.         let a2' = a2*k*k
  13.         let a3' = a3*(power k 3.0)
  14.         let a4' = a4*(power k 4.0)
  15.         let a5' = a5*(power k 5.0)
  16.         let w1 = 1.0/sqrt(2.0*pi)
  17.         let w2 = exp(-l*l/2.0)
  18.         let w3 = a1'+a2'+a3'+a4'+a5'
  19.         let w  = 1.0-w1*w2*w3
  20.         if x < 0.0 then 1.0 – w else w

And here is some test values from the REPL:

image

Finally, the Black Scholes formula.  I did create a separate POCO for the input data like this:

  1. type putCallFlag = Put | Call
  2.  
  3. type blackScholesInputData =
  4.     {stockPrice:float;
  5.     strikePrice:float;
  6.     timeToExpiry:float;
  7.     interestRate:float;
  8.     volatility:float}

And I refactored his code to make it more readable like this:

  1. let blackScholes(inputData:blackScholesInputData, putCallFlag:putCallFlag)=
  2.    let sx = log(inputData.stockPrice / inputData.strikePrice)
  3.    let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
  4.    let rvt = rv*inputData.timeToExpiry
  5.    let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
  6.    let d1=(sx + rvt)/vt
  7.    let d2=d1-vt
  8.     
  9.    match putCallFlag with
  10.     | Put ->
  11.         let xrt = inputData.strikePrice*exp(-inputData.interestRate*inputData.timeToExpiry)
  12.         let cdD1 = xrt*cumulativeDistribution(-d2)
  13.         let cdD2 = inputData.stockPrice*cumulativeDistribution(-d1)
  14.         cdD1-cdD2
  15.     | Call ->
  16.         let xrt = inputData.strikePrice*exp(-inputData.interestRate*inputData.timeToExpiry)
  17.         let cdD1 = inputData.stockPrice*cumulativeDistribution(d1)
  18.         let cdD2 = xrt*cumulativeDistribution(d2)
  19.         cdD1-cdD2

And since I was in the script environment, I put in test data that matches the sample that Astborg used in the book:

  1. let inputData = {stockPrice=58.60;strikePrice=60.;timeToExpiry=0.5;interestRate=0.01;volatility=0.3}
  2. let runBSCall = blackScholes(inputData,Call)
  3. let runBSPut = blackScholes(inputData,Put)

And voila, the results match the book:

image

With the Black-Scholes out of the way, I then implemented the Greeks.  Note that I did add helper functions for clarity, and the results match the book:

  1. let blackScholesDelta (inputData:blackScholesInputData, putCallFlag:putCallFlag) =
  2.     let sx = log(inputData.stockPrice / inputData.strikePrice)
  3.     let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
  4.     let rvt = rv*inputData.timeToExpiry
  5.     let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
  6.     let d1=(sx + rvt)/vt
  7.     match putCallFlag with
  8.     | Put -> cumulativeDistribution(d1) – 1.0
  9.     | Call -> cumulativeDistribution(d1)
  10.  
  11. let deltaPut = blackScholesDelta(inputData, Put)
  12. let deltaCall = blackScholesDelta(inputData, Call)
  13.  
  14. let blackScholesGamma (inputData:blackScholesInputData) =
  15.     let sx = log(inputData.stockPrice / inputData.strikePrice)
  16.     let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
  17.     let rvt = rv*inputData.timeToExpiry
  18.     let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
  19.     let d1=(sx + rvt)/vt
  20.     normalDistribution.Density(d1)
  21.  
  22. let gamma = blackScholesGamma(inputData)
  23.  
  24. let blackScholesVega (inputData:blackScholesInputData) =
  25.     let sx = log(inputData.stockPrice / inputData.strikePrice)
  26.     let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
  27.     let rvt = rv*inputData.timeToExpiry
  28.     let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
  29.     let d1=(sx + rvt)/vt   
  30.     inputData.stockPrice*normalDistribution.Density(d1)*sqrt(inputData.timeToExpiry)
  31.  
  32. let vega = blackScholesVega(inputData)
  33.  
  34. let blackScholesTheta (inputData:blackScholesInputData, putCallFlag:putCallFlag) =
  35.     let sx = log(inputData.stockPrice / inputData.strikePrice)
  36.     let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
  37.     let rvt = rv*inputData.timeToExpiry
  38.     let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
  39.     let d1=(sx + rvt)/vt   
  40.     let d2=d1-vt
  41.     match putCallFlag with
  42.     | Put ->
  43.         let ndD1 = inputData.stockPrice*normalDistribution.Density(d1)*inputData.volatility
  44.         let ndD1' = ndD1/(2.0*sqrt(inputData.timeToExpiry))
  45.         let rx = inputData.interestRate*inputData.strikePrice
  46.         let rt = exp(-inputData.interestRate*inputData.timeToExpiry)
  47.         let cdD2 = rx*rt*cumulativeDistribution(-d2)
  48.         -(ndD1')+cdD2
  49.     | Call ->
  50.         let ndD1 = inputData.stockPrice*normalDistribution.Density(d1)*inputData.volatility
  51.         let ndD1' = ndD1/(2.0*sqrt(inputData.timeToExpiry))
  52.         let rx = inputData.interestRate*inputData.strikePrice
  53.         let rt = exp(-inputData.interestRate*inputData.timeToExpiry)
  54.         let cdD2 = cumulativeDistribution(d2)
  55.         -(ndD1')-rx*rt*cdD2
  56.  
  57. let thetaPut = blackScholesTheta(inputData, Put)
  58. let thetaCall = blackScholesTheta(inputData, Call)
  59.  
  60. let blackScholesRho (inputData:blackScholesInputData, putCallFlag:putCallFlag) =
  61.     let sx = log(inputData.stockPrice / inputData.strikePrice)
  62.     let rv = inputData.interestRate+inputData.volatility*inputData.volatility*0.5
  63.     let rvt = rv*inputData.timeToExpiry
  64.     let vt = (inputData.volatility*sqrt(inputData.timeToExpiry))
  65.     let d1=(sx + rvt)/vt   
  66.     let d2=d1-vt
  67.     match putCallFlag with
  68.     | Put ->
  69.         let xt = inputData.strikePrice*inputData.timeToExpiry
  70.         let rt = exp(-inputData.interestRate*inputData.timeToExpiry)  
  71.         -xt*rt*cumulativeDistribution(-d2)
  72.     | Call ->
  73.         let xt = inputData.strikePrice*inputData.timeToExpiry
  74.         let rt = exp(-inputData.interestRate*inputData.timeToExpiry)          
  75.         xt*rt*cumulativeDistribution(d2)
  76.  
  77. let rhoPut = blackScholesRho(inputData, Put)
  78. let rhoCall = blackScholesRho(inputData, Call)

 

image

Finally, I threw in the Monte Carlo, which also used a POCO:

  1. type monteCarloInputData =
  2.     {stockPrice:float;
  3.     strikePrice:float;
  4.     timeToExpiry:float;
  5.     interestRate:float;
  6.     volatility:float}
  7.  
  8. let priceAtMaturity (inputData:monteCarloInputData, randomValue:float) =
  9.     let s = inputData.stockPrice
  10.     let rv = (inputData.interestRate-inputData.volatility*inputData.volatility/2.0)
  11.     let rvt = rv*inputData.timeToExpiry
  12.     let vr = inputData.volatility*randomValue
  13.     let t = sqrt(inputData.timeToExpiry)
  14.     s*exp(rvt+vr*t)
  15.     
  16. let maturityPriceInputData = {stockPrice=58.60;strikePrice=60.0;timeToExpiry=0.5;interestRate=0.01;volatility=0.3}
  17. priceAtMaturity(maturityPriceInputData, 10.0)
  18.  
  19. let monteCarlo(inputData: monteCarloInputData, randomValues:seq<float>) =
  20.     randomValues
  21.         |> Seq.map(fun randomValue -> priceAtMaturity(inputData,randomValue) – inputData.strikePrice )
  22.         |> Seq.average
  23.  
  24.  
  25. let random = new System.Random()
  26. let rnd() = random.NextDouble()
  27. let data = [for i in 1 .. 1000 -> rnd() * 1.0]
  28.  
  29. let monteCarloInputData = {stockPrice=58.60;strikePrice=60.0;timeToExpiry=0.5;interestRate=0.01;volatility=0.3;}
  30. monteCarlo(monteCarloInputData,data)

image

One thing I really like about Astborg is that the Monte Carlo function does not new up the array of random numbers, rather they are passed in.  This makes the function much more testable and is the right way to right it (IMHO).  In fact, I think that seeing “new Random” or “DateTime.Now” hard-coded into functions is an anti-pattern that is all too common.

With the last of the functions done in the script file, I moved them into the .fs file and created covering unit tests based on the sample data that I did in the REPL.

  1. [TestMethod]
  2. public void PowerUsingValidData_ReturnsExpected()
  3. {
  4.     var calculations = new Calculations();
  5.     Double expected = 8;
  6.     Double actual = Math.Round(calculations.Power(2.0, 3.0), 0);
  7.     Assert.AreEqual(expected, actual);
  8. }
  9.  
  10. [TestMethod]
  11. public void CumulativeDistributionUsingValidData_ReturnsExpected()
  12. {
  13.     var calculations = new Calculations();
  14.     Double expected = .84134;
  15.     Double actual = Math.Round(calculations.CumulativeDistribution(1.0),5);
  16.     Assert.AreEqual(expected, actual);
  17. }
  18.  
  19. [TestMethod]
  20. public void BlackScholesCallUsingValidData_ReturnsExpected()
  21. {
  22.     var calculations = new Calculations();
  23.     Double expected = 4.4652;
  24.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  25.     Double actual = Math.Round(calculations.BlackScholes(inputData,PutCallFlag.Call), 5);
  26.     Assert.AreEqual(expected, actual);
  27. }
  28.  
  29. [TestMethod]
  30. public void BlackScholesPutUsingValidData_ReturnsExpected()
  31. {
  32.     var calculations = new Calculations();
  33.     Double expected = 5.56595;
  34.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  35.     Double actual = Math.Round(calculations.BlackScholes(inputData, PutCallFlag.Put), 5);
  36.     Assert.AreEqual(expected, actual);
  37. }
  38.  
  39. [TestMethod]
  40. public void DaysToYearsUsingValidData_ReturnsExpected()
  41. {
  42.     var calculations = new Calculations();
  43.     Double expected = .08214;
  44.     Double actual = Math.Round(calculations.DaysToYears(30), 5);
  45.     Assert.AreEqual(expected, actual);
  46. }
  47.  
  48. [TestMethod]
  49. public void BlackScholesDeltaCallUsingValidData_ReturnsExpected()
  50. {
  51.     var calculations = new Calculations();
  52.     Double expected = .50732;
  53.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  54.     Double actual = Math.Round(calculations.BlackScholesDelta(inputData, PutCallFlag.Call), 5);
  55.     Assert.AreEqual(expected, actual);
  56. }
  57.  
  58. [TestMethod]
  59. public void BlackScholesDeltaPutUsingValidData_ReturnsExpected()
  60. {
  61.     var calculations = new Calculations();
  62.     Double expected = -.49268;
  63.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  64.     Double actual = Math.Round(calculations.BlackScholesDelta(inputData, PutCallFlag.Put), 5);
  65.     Assert.AreEqual(expected, actual);
  66. }
  67.  
  68. [TestMethod]
  69. public void BlackScholesGammaUsingValidData_ReturnsExpected()
  70. {
  71.     var calculations = new Calculations();
  72.     Double expected = .39888;
  73.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  74.     Double actual = Math.Round(calculations.BlackScholesGamma(inputData), 5);
  75.     Assert.AreEqual(expected, actual);
  76. }
  77.  
  78. [TestMethod]
  79. public void BlackScholesVegaUsingValidData_ReturnsExpected()
  80. {
  81.     var calculations = new Calculations();
  82.     Double expected = 16.52798;
  83.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  84.     Double actual = Math.Round(calculations.BlackScholesVega(inputData), 5);
  85.     Assert.AreEqual(expected, actual);
  86. }
  87.  
  88. [TestMethod]
  89. public void BlackScholesThetaCallUsingValidData_ReturnsExpected()
  90. {
  91.     var calculations = new Calculations();
  92.     Double expected = -5.21103;
  93.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  94.     Double actual = Math.Round(calculations.BlackScholesTheta(inputData, PutCallFlag.Call), 5);
  95.     Assert.AreEqual(expected, actual);
  96. }
  97.  
  98. [TestMethod]
  99. public void BlackScholesThetaPutUsingValidData_ReturnsExpected()
  100. {
  101.     var calculations = new Calculations();
  102.     Double expected = -4.61402;
  103.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  104.     Double actual = Math.Round(calculations.BlackScholesTheta(inputData, PutCallFlag.Put), 5);
  105.     Assert.AreEqual(expected, actual);
  106. }
  107.  
  108. [TestMethod]
  109. public void BlackScholesRhoCallUsingValidData_ReturnsExpected()
  110. {
  111.     var calculations = new Calculations();
  112.     Double expected = 12.63174;
  113.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  114.     Double actual = Math.Round(calculations.BlackScholesRho(inputData, PutCallFlag.Call), 5);
  115.     Assert.AreEqual(expected, actual);
  116. }
  117.  
  118. [TestMethod]
  119. public void BlackScholesRhoPutUsingValidData_ReturnsExpected()
  120. {
  121.     var calculations = new Calculations();
  122.     Double expected = -17.21863;
  123.     var inputData = new BlackScholesInputData(58.6, 60.0, .5, .01, .3);
  124.     Double actual = Math.Round(calculations.BlackScholesRho(inputData, PutCallFlag.Put), 5);
  125.     Assert.AreEqual(expected, actual);
  126. }
  127.  
  128.  
  129. [TestMethod]
  130. public void PriceAtMaturityUsingValidData_ReturnsExpected()
  131. {
  132.     var calculations = new Calculations();
  133.     Double expected = 480.36923;
  134.     var inputData = new MonteCarloInputData(58.6, 60.0, .5, .01, .3);
  135.     Double actual = Math.Round(calculations.PriceAtMaturity(inputData, 10.0), 5);
  136.     Assert.AreEqual(expected, actual);
  137. }
  138.  
  139. [TestMethod]
  140. public void MonteCarloUsingValidData_ReturnsExpected()
  141. {
  142.     var calculations = new Calculations();
  143.     var inputData = new MonteCarloInputData(58.6, 60.0, .5, .01, .3);
  144.     var random = new System.Random();
  145.     List<Double> randomData = new List<double>();
  146.     for (int i = 0; i < 1000; i++)
  147.     {
  148.         randomData.Add(random.NextDouble());
  149.     }
  150.  
  151.     Double actual = Math.Round(calculations.MonteCarlo(inputData, randomData), 5);
  152.     var greaterThanFour = actual > 4.0;
  153.     var lessThanFive = actual < 5.0;
  154.  
  155.     Assert.AreEqual(true, greaterThanFour);
  156.     Assert.AreEqual(true, lessThanFive);
  157. }

 

With all of the tests running green, I then turned my attention to the UI.  I created more real state on the MainWindow  and added some additional data structures to the results of the analytics that lend themselves to charting and graphing.  For example:

  1. public class GreekData
  2. {
  3.     public Double StrikePrice { get; set; }
  4.     public Double DeltaCall { get; set; }
  5.     public Double DeltaPut { get; set; }
  6.     public Double Gamma { get; set; }
  7.     public Double Vega { get; set; }
  8.     public Double ThetaCall { get; set; }
  9.     public Double ThetaPut { get; set; }
  10.     public Double RhoCall { get; set; }
  11.     public Double RhoPut { get; set; }
  12.  
  13. }

And in the code behind of the MainWindow, I added some calcs based on the prior code that was already in it:

  1. var theGreeks = new List<GreekData>();
  2. for (int i = 0; i < 5; i++)
  3. {
  4.     var greekData = new GreekData();
  5.     greekData.StrikePrice = closestDollar – i;
  6.     theGreeks.Add(greekData);
  7.     greekData = new GreekData();
  8.     greekData.StrikePrice = closestDollar + i;
  9.     theGreeks.Add(greekData);
  10. }
  11. theGreeks.Sort((greek1,greek2)=>greek1.StrikePrice.CompareTo(greek2.StrikePrice));
  12.  
  13. foreach (var greekData in theGreeks)
  14. {
  15.     var inputData =
  16.         new BlackScholesInputData(adjustedClose, greekData.StrikePrice, .5, .01, .3);
  17.     greekData.DeltaCall = calculations.BlackScholesDelta(inputData, PutCallFlag.Call);
  18.     greekData.DeltaPut = calculations.BlackScholesDelta(inputData, PutCallFlag.Put);
  19.     greekData.Gamma = calculations.BlackScholesGamma(inputData);
  20.     greekData.RhoCall = calculations.BlackScholesRho(inputData, PutCallFlag.Call);
  21.     greekData.RhoPut = calculations.BlackScholesRho(inputData, PutCallFlag.Put);
  22.     greekData.ThetaCall = calculations.BlackScholesTheta(inputData, PutCallFlag.Call);
  23.     greekData.ThetaPut = calculations.BlackScholesTheta(inputData, PutCallFlag.Put);
  24.     greekData.Vega = calculations.BlackScholesVega(inputData);
  25.  
  26. }
  27.  
  28. this.TheGreeksDataGrid.ItemsSource = theGreeks;
  29.  
  30.  
  31. var blackScholes = new List<BlackScholesData>();
  32. for (int i = 0; i < 5; i++)
  33. {
  34.     var blackScholesData = new BlackScholesData();
  35.     blackScholesData.StrikePrice = closestDollar – i;
  36.     blackScholes.Add(blackScholesData);
  37.     blackScholesData = new BlackScholesData();
  38.     blackScholesData.StrikePrice = closestDollar + i;
  39.     blackScholes.Add(blackScholesData);
  40. }
  41. blackScholes.Sort((bsmc1, bsmc2) => bsmc1.StrikePrice.CompareTo(bsmc2.StrikePrice));
  42.  
  43. var random = new System.Random();
  44. List<Double> randomData = new List<double>();
  45. for (int i = 0; i < 1000; i++)
  46. {
  47.     randomData.Add(random.NextDouble());
  48. }
  49.  
  50. foreach (var blackScholesMonteCarlo in blackScholes)
  51. {
  52.     var blackScholesInputData =
  53.         new BlackScholesInputData(adjustedClose, blackScholesMonteCarlo.StrikePrice, .5, .01, .3);
  54.     var monteCarloInputData =
  55.         new MonteCarloInputData(adjustedClose, blackScholesMonteCarlo.StrikePrice, .5, .01, .3);
  56.  
  57.     blackScholesMonteCarlo.Call = calculations.BlackScholes(blackScholesInputData, PutCallFlag.Call);
  58.     blackScholesMonteCarlo.Put = calculations.BlackScholes(blackScholesInputData, PutCallFlag.Put);
  59.     blackScholesMonteCarlo.MonteCarlo = calculations.MonteCarlo(monteCarloInputData, randomData);
  60. }
  61.  
  62. this.BlackScholesDataGrid.ItemsSource = blackScholes;

And Whammo, the UI.

 

image

Fortunately Conrad D’Cruz is a member of TRINUG and an options trader and is going to explain what the heck we are looking at when the SIG gets together again.

 

Using Subsets for Association Rule Learning

I finished up writing the association rule program from MSDN in F# last week.  One of the things bothering me about the way I implemented the algorithms is that I hard-coded the combinations (antecedent and consequent) from the item-sets:

  1. static member GetCombinationsForDouble(itemSet: int[]) =
  2.     let combinations =  new List<int[]*int[]*int[]>()
  3.     combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
  4.     combinations
  5.  
  6. static member GetCombinationsForTriple(itemSet: int[]) =
  7.     let combinations =  new List<int[]*int[]*int[]>()
  8.     combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
  9.     combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
  10.     combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
  11.     combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
  12.     combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
  13.     combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
  14.     combinations

I thought it would be a fun exercise to make a function that returns the combinations for an N number of itemSets.  My first several attempts failed because  I started off with the wrong vocabulary.  I spent several days trying to determine how to create all of the combinations and/or permutations from the itemSet.  It then hit me that I would be looking at getting all subsets and what do you know, there are some excellent examples out there.

So if I was going to use the yield and yield-bang method of calculating the subsets in my class, I first needed to remove the rec and just let the class call itself.

  1. static member Subsets s =
  2.     set [
  3.         yield s
  4.         for e in s do
  5.             yield! AssociationRuleProgram2.Subsets (Set.remove e s) ]

I then needed a way of translating the itemSet which is a an int array into a set and back again.  Fortunately, the set module has ofArray and toArray functions so I wrote my code exactly the way I just described the problem:

  1. static member GetAntcentAndConsequent(itemSet: int[]) =
  2.     let combinations =  new List<int[]*int[]*int[]>()
  3.     let itemSet' = Set.ofArray itemSet
  4.     let subSets = AssociationRuleProgram2.Subsets itemSet'
  5.     let subSets' = Set.toArray subSets
  6.     let subSets'' = Array.map(fun s-> Set.toArray s)
  7.     let subSets''' = Array.map(fun s -> Seq.toArray s, AssociationRuleProgram2.GetAntcentAndConsequent s)

 

Note that I had to call toArray twice because the Subsets returns a Set<Set<Int>>.

In any event, I then needed a way of spitting the itemSet into antecedents and consequents (called combinations) based on the current subset.  I toyed around with a couple different ways of solving the problem when I stumbled upon a way that makes alot of sense to me.  I changed the itemset from an array of int to an array of tuple<int*bool>.  If the subset is in the itemSet, then the bool flag is true, if not it is false.  Then, I would apply an Seq.filter to the array and separate it out into antecedents and consequents.

  1. static member GetCombination array subArray =
  2.     let array' = array |> Seq.map(fun i -> i, subArray |> Array.exists(fun j -> i = j))
  3.     let antecedent = array' |> Seq.filter(fun (i,j) -> j = true) |> Seq.toArray
  4.     let consquent = array' |> Seq.filter(fun (i,j) -> j = false) |> Seq.toArray
  5.     let antecedent' = antecedent|> Seq.map(fun (i,j) -> i)
  6.     let consquent' = consquent|> Seq.map(fun (i,j) -> i)
  7.     Seq.toArray antecedent', Seq.toArray consquent'

The major downside of this approach is that I am using Array.exists for my filter flag so if there is more than one of the same value in the itemset, it does not work.  However, the original example had each itemset being unique so think I am OK.

So with these tow methods, I now have a way of dealing with N number of itemsets.  Interestingly, the amount of code (even with my verbose F#) is significantly less than the C# equivalent and closer to how I think I think.

 

 

 

Association Rule Problem: Part 3

After spending a couple of weeks working though the imperative code, I decided to approach the problem from a F#/functional point of view.  Going back to the original article, there are several steps that McCaffrey walks through:

  • Get a series of transactions
  • Get the frequent item-sets for the transactions
  • For each item-set, get all possible combinations.  Each combination is broken into an antecedent and consequent
  • Apply the frequency of each antecedent in all transactions
  • If the frequency of the combination is greater than the confidence level, include it in the final set

For the purposes of this article, Step #1 and Step #2 were already done.  My code starts with step #3.  Instead of for..eaching and if..thening my way though the item-sets, I decided to look at how permutations and combinations are done in F#.  Interestingly, one of the first articles on permutations and combinations on Google is from McCaffrey in MSDN from four years ago.  Unfortunately, this article was of limited use because the code is decidedly non-functional so it might as well been written in C# (this was pointed out in the comments).  So going to Stack Overflow, there are plenty of good examples of getting combinations in F# on SO and elsewhere.  After playing with the code samples for a bit (my favorite one was this), it hit me that the ordinal positions are the same for an array of X size.  So going back to McCaffrey’s example, there is only item-sets of 2 and 3 length.  Therefore, I can hard-code the results and leave the actual calculation for another time.

  1. static member GetCombinationsForDouble(itemSet: int[]) =
  2.     let combinations =  new List<int[]*int[]*int[]>()
  3.     combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
  4.     combinations
  5.  
  6. static member GetCombinationsForTriple(itemSet: int[]) =
  7.     let combinations =  new List<int[]*int[]*int[]>()
  8.     combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
  9.     combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
  10.     combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
  11.     combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
  12.     combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
  13.     combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
  14.     combinations

I used a tuple to represent the antecedent array and consequent array values.  I then spun up a unit test to compare results based on McCaffrey’s detailed example:

  1. [TestMethod]
  2. public void GetValuesForATriple_ReturnsExpectedValue()
  3. {
  4.     var expected = new List<Tuple<int[], int[]>>();
  5.     expected.Add(Tuple.Create<int[], int[]>(new int[1] { 3 }, new int[2] { 4, 7 }));
  6.     expected.Add(Tuple.Create<int[], int[]>(new int[1] { 4 }, new int[2] { 3, 7 }));
  7.     expected.Add(Tuple.Create<int[], int[]>(new int[1] { 7 }, new int[2] { 3, 4 }));
  8.     expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 4 }, new int[1] { 7 }));
  9.     expected.Add(Tuple.Create<int[], int[]>(new int[2] { 3, 7 }, new int[1] { 4 }));
  10.     expected.Add(Tuple.Create<int[], int[]>(new int[2] { 4, 7 }, new int[1] { 3 }));
  11.  
  12.     var itemSet = new int[3] { 3,4,7};
  13.     var actual = FS.AssociationRuleProgram2.GetCombinationsForTriple(itemSet);
  14.  
  15.     Assert.AreEqual(expected.Count, actual.Count);
  16. }

A couple of things to note about the unit test:

1) The rules about variable naming and whatnot that apply in business application development quickly fall down when applied to scientific computing.  For example, there is no way that this

List<Tuple<int[], int[]>> expected = new List<Tuple<int[], int[]>>();

is more readable that this

var expected = new List<Tuple<int[], int[]>>();

In fact, it is less readable.  The use of complex data structures and algorithms force a different set of naming conventions.  Applying Fx-Cop or other framework naming conventions to scientific programming is as useful as applying scientific naming conventions to framework development.  If it is a screw, use a screwdriver.  If it is a nail, user a hammer…

2) I don’t have a good way of comparing the results of a tuple of paired arrays for equivalence – there is certainly nothing out of the box in Microsoft.VisualStudio.TestTools.UnitTesting.  I toyed (briefly) with creating a method to compare equivalence in arrays but I did not in the interest of time.  That would be a welcome additional to the testing namespace.

Sure enough, running the unit test using McCaffrey’s data all run green.

With step 3 knocked out, I now needed to determine the frequency of the antecedent in the transactions list.  This step is better broken down into a couple of sub-steps.  I used McCaffrey’s detailed example of 3,4,7 as proof of correctness in my unit tests:

image

I need a way of taking the antecedent of 3, and comparing it to all transactions (which are arrays) to see how often it appears.  As an additional layer of complexity, that 3 is not an int, it is an array (all be it an array of one).  I could not find a equivalent question on StackOverflow (meaning I probably am asking the wrong question), so I went ahead of made a mental model where I would map the TryFindIndex function against each item of subset and see if that value is in the original set.  The result is a tuple with the original value and the ordinal position in the set.  The key thing is that if the item was not found, it returns “None”.  So I just have to filter on that flag and if the result of the filter is greater than 1, I know that something was not found and the functional can return false

image


In code, it pretty much looks like the way I just described it:

  1. static member SetContainsSubset(set: int[], subset: int[]) =
  2.     let notIncluded = subset
  3.                         |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
  4.                         |> Seq.filter(fun (i,j) -> j = None )
  5.                         |> Seq.toArray
  6.     if notIncluded.Length > 0 then false else true

And I generated my unit tests out of the example too: 

  1. [TestMethod]
  2. public void SetContainsSubsetUsingMatched_ReturnsTrue()
  3. {
  4.     var set = new int[4] { 1, 3, 4, 7 };
  5.     var subset = new int[3] { 3, 4, 7 };
  6.  
  7.     Boolean expected = true;
  8.     Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
  9.  
  10.     Assert.AreEqual(expected, actual);
  11. }
  12.  
  13. [TestMethod]
  14. public void SetContainsSubsetUsingUnMatched_ReturnsFalse()
  15. {
  16.     var set = new int[3] { 1, 4, 7 };
  17.     var subset = new int[3] { 3, 4, 7 };
  18.  
  19.     Boolean expected = false;
  20.     Boolean actual = FS.AssociationRuleProgram2.SetContainsSubset(set, subset);
  21.  
  22.     Assert.AreEqual(expected, actual);
  23.  
  24. }

With this supporting function ready, I can then apply it to an array and see how many trues I get.  That is the Count value in Figure 2 of the article.  Seq.Map fits this task perfectly. 

  1. static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
  2.     transactions
  3.         |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
  4.         |> Seq.filter(fun (t,f) -> f = true)
  5.         |> Seq.length

And the subsequent unit test also runs green

  1. [TestMethod]
  2. public void CountItemSetInTransactions_ReturnsExpected()
  3. {
  4.     List<int[]> transactions = new List<int[]>();
  5.     transactions.Add(new int[] { 0, 3, 4, 11 });
  6.     transactions.Add(new int[] { 1, 4, 5 });
  7.     transactions.Add(new int[] { 3, 4, 6, 7 });
  8.     transactions.Add(new int[] { 3, 4, 6, 7 });
  9.     transactions.Add(new int[] { 0, 5 });
  10.     transactions.Add(new int[] { 3, 5, 9 });
  11.     transactions.Add(new int[] { 2, 3, 4, 7 });
  12.     transactions.Add(new int[] { 2, 5, 8 });
  13.     transactions.Add(new int[] { 0, 1, 2, 5, 10 });
  14.     transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
  15.  
  16.     var itemSet = new int[1] { 3 };
  17.  
  18.     Int32 expected = 6;
  19.     Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
  20.  
  21.     Assert.AreEqual(expected, actual);
  22.  
  23. }

So with this in place, I am ready for the next column, the confidence column.  McCaffrey used the numerator of 3 which is shown here:

image

So I assume that this count is the number of times 3,4,7 show up in the the transaction set.  If so, the supporting function ItemSetCountInTransactions can also be used.  I created a unit test and it ran green

  1. [TestMethod]
  2. public void CountItemSetInTransactionsUsing347_ReturnsThree()
  3. {
  4.     List<int[]> transactions = new List<int[]>();
  5.     transactions.Add(new int[] { 0, 3, 4, 11 });
  6.     transactions.Add(new int[] { 1, 4, 5 });
  7.     transactions.Add(new int[] { 3, 4, 6, 7 });
  8.     transactions.Add(new int[] { 3, 4, 6, 7 });
  9.     transactions.Add(new int[] { 0, 5 });
  10.     transactions.Add(new int[] { 3, 5, 9 });
  11.     transactions.Add(new int[] { 2, 3, 4, 7 });
  12.     transactions.Add(new int[] { 2, 5, 8 });
  13.     transactions.Add(new int[] { 0, 1, 2, 5, 10 });
  14.     transactions.Add(new int[] { 2, 3, 5, 6, 7, 9 });
  15.  
  16.     var itemSet = new int[3] { 3,4,7 };
  17.  
  18.     Int32 expected = 3;
  19.     Int32 actual = FS.AssociationRuleProgram2.ItemSetCountInTransactions(itemSet, transactions);
  20.  
  21.     Assert.AreEqual(expected, actual);
  22.  
  23. }

So the last piece was to put it together in the GetHighConfRules method.  I did not change the signature

  1. static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
  2.     let returnValue = new List<Rule>()
  3.     let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a))
  4.     combinations
  5.         |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
  6.         |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
  7.         |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
  8.         |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
  9.         |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
  10.     returnValue

 

Note that I did add a helper function to get Combinations based on the length of the array

  1. static member GetCombinations(itemSet: int[]) =
  2.     if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
  3.     else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)

And when I run that from the console:

image

So this is pretty close.  McCaffrey allows for inversion of the numbers in the array (3:4 is not the same as 4:3) and I do not – but his supporting detail does not show that so I am not sure what is the correct answer.  In any event, this is pretty good.  The F# code can be refactored so that all combinations can be sent from an array.  In the mean time, here is all 43 lines of the program. 

  1. open System
  2. open System.Collections.Generic
  3.  
  4. type AssociationRuleProgram2 =
  5.  
  6.     static member GetHighConfRules(frequentItemSets: List<int[]>, transactions: List<int[]>, minConfidencePct:float) =
  7.         let returnValue = new List<Rule>()
  8.         let combinations = frequentItemSets |> Seq.collect (fun (a) -> AssociationRuleProgram2.GetCombinations(a))
  9.         combinations
  10.             |> Seq.map(fun (i,a,c ) -> i,a,c,AssociationRuleProgram2.ItemSetCountInTransactions(i,transactions))
  11.             |> Seq.map(fun (i,a,c,fisc) -> a,c,fisc,AssociationRuleProgram2.ItemSetCountInTransactions(a,transactions))
  12.             |> Seq.map(fun (a,c,fisc,cc) -> a,c,float fisc/float cc)
  13.             |> Seq.filter(fun (a,c,cp) -> cp > minConfidencePct)
  14.             |> Seq.iter(fun (a,c,cp) -> returnValue.Add(new Rule(a,c,cp)))
  15.         returnValue
  16.  
  17.     static member ItemSetCountInTransactions(itemSet: int[], transactions: List<int[]>) =
  18.         transactions
  19.             |> Seq.map(fun (t) -> t, AssociationRuleProgram2.SetContainsSubset(t,itemSet))
  20.             |> Seq.filter(fun (t,f) -> f = true)
  21.             |> Seq.length
  22.  
  23.     static member SetContainsSubset(set: int[], subset: int[]) =
  24.         let notIncluded = subset
  25.                             |> Seq.map(fun i -> i, set |> Seq.tryFindIndex(fun j -> j = i))
  26.                             |> Seq.filter(fun (i,j) -> j = None )
  27.                             |> Seq.toArray
  28.         if notIncluded.Length > 0 then false else true
  29.  
  30.     static member GetCombinations(itemSet: int[]) =
  31.         if itemSet.Length = 2 then AssociationRuleProgram2.GetCombinationsForDouble(itemSet)
  32.         else AssociationRuleProgram2.GetCombinationsForTriple(itemSet)
  33.  
  34.     static member GetCombinationsForDouble(itemSet: int[]) =
  35.         let combinations =  new List<int[]*int[]*int[]>()
  36.         combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1]|])
  37.         combinations
  38.  
  39.     static member GetCombinationsForTriple(itemSet: int[]) =
  40.         let combinations =  new List<int[]*int[]*int[]>()
  41.         combinations.Add(itemSet, [|itemSet.[0]|],[|itemSet.[1];itemSet.[2]|])
  42.         combinations.Add(itemSet, [|itemSet.[1]|],[|itemSet.[0];itemSet.[2]|])
  43.         combinations.Add(itemSet, [|itemSet.[2]|],[|itemSet.[0];itemSet.[1]|])
  44.         combinations.Add(itemSet, [|itemSet.[0];itemSet.[1]|],[|itemSet.[2]|])
  45.         combinations.Add(itemSet, [|itemSet.[0];itemSet.[2]|],[|itemSet.[1]|])
  46.         combinations.Add(itemSet, [|itemSet.[1];itemSet.[2]|],[|itemSet.[0]|])
  47.         combinations

Note how the code in the GetHighConfRules function matches almost one for one to the bullet points at the beginning of the post.  F# is a language where the code follows how you think, not the other way around.  Also note how the 43 lines of code compares to 136 lines of code in the C# example –> less noise, more signal.