Combining Wake County Real Estate Lookup with Wake County School Assignment

As a follow up to this post and this post, I want to combine looking up Wake County Real Estate valuation with the Wake County School Assignment.  The matching values between the two datasets is the house address.

The first thing I did was to create a new script file in the project.  I then added a reference to the script that does the WCPSS lookup.  I then added a Json provider that will server as the type of the Wake County Real Estate Valuation data that was stored previously in a DocumentDb instance.

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 #r "../packages/Microsoft.Azure.Documents.Client.0.9.2-preview/lib/net40/Microsoft.Azure.Documents.Client.dll" 3 #r "../packages/Newtonsoft.Json.4.5.11/lib/net40/Newtonsoft.Json.dll" 4 5 #load "SchoolAssignments.fsx" 6 7 open System 8 open System.IO 9 open FSharp.Data 10 open System.Linq 11 open SchoolAssignments 12 open Microsoft.Azure.Documents 13 open Microsoft.Azure.Documents.Client 14 open Microsoft.Azure.Documents.Linq 15 16 type HouseValuation = JsonProvider<"../data/HouseValuationSample.json">

The house valuation json looks like this:

{

  "index": 1,

  "addressOne": "1506 WAKE FOREST RD ",

  "addressTwo": "RALEIGH NC 27604-1331",

  "addressThree": " ",

  "assessedValue": "$34,848",

  "id": "c0e931de-68b8-452e-8365-66d3a4a93483",

  "_rid": "pmVVALZMZAEBAAAAAAAAAA==",

  "_ts": 1423934277,

  "_self": "dbs/pmVVAA==/colls/pmVVALZMZAE=/docs/pmVVALZMZAEBAAAAAAAAAA==/",

  "_etag": "\"0000c100-0000-0000-0000-54df83450000\"",

  "_attachments": "attachments/"

}

 

The first method pulls the data from the DocumentDb and serializes it into an instance of the type:

1 let getPropertyValue(id: int)= 2 let endpointUrl = "" 3 let authKey = "" 4 let client = new DocumentClient(new Uri(endpointUrl), authKey) 5 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 6 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault() 7 let documentLink = collection.SelfLink 8 let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = " + id.ToString() 9 let query = client.CreateDocumentQuery(documentLink,queryString) 10 let firstValue = query |> Seq.head 11 HouseValuation.Parse(firstValue.ToString()) 12

The next method uses the School Look script to pull the data from the WCPSS site.  The only real gotchas was that the space deliminator (char32) was not the only way to split the address.  The WCPSS site also added in a the hard break (char160).  It took me about a hour to figure out wht “” was not breaking into a array of words via splitting on “ “.  <sigh>

1 let createSchoolAssignmentSearchCriteria(houseValuation: option<HouseValuation.Root>) = 2 match houseValuation.IsSome with 3 | true -> let deliminators = [|(char)32;(char)160|] 4 let addressOneTokens = houseValuation.Value.AddressOne.Split(deliminators) 5 let streetNumber = addressOneTokens.[0] 6 let streetTemplateValue = addressOneTokens.[1] 7 let streetName = addressOneTokens.[1..] |> Array.reduce(fun acc t -> acc + "+" + t) 8 let addressTwoTokens = houseValuation.Value.AddressTwo.Split(deliminators) 9 let city = addressTwoTokens.[0] 10 let streetName' = streetName + city 11 Some {SearchCriteria.streetTemplateValue=streetTemplateValue; 12 streetName=streetName'; 13 streetNumber=streetNumber;} 14 | false -> None 15

In any event, the last piece was to take the value and push it back up to another DocumentDb collection:

1 let writeSchoolAssignmentToDocumentDb(houseAssignment:option<HouseAssignment>) = 2 match houseAssignment.IsSome with 3 | true -> 4 let endpointUrl = "" 5 let authKey = "" 6 let client = new DocumentClient(new Uri(endpointUrl), authKey) 7 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 8 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 9 let documentLink = collection.SelfLink 10 client.CreateDocumentAsync(documentLink, houseAssignment.Value) |> ignore 11 | false -> () 12 13

With that in place, the final function puts it all together:

1 let createHouseAssignment(id:int)= 2 let houseValuation = getPropertyValue(id) 3 let schools = houseValuation 4 |> createSchoolAssignmentSearchCriteria 5 |> createSearchCriteria' 6 |> createPage2QueryString 7 |> getSchoolData 8 match schools.IsSome with 9 | true -> Some {houseIndex=houseValuation.Value.Index; schools=schools.Value} 10 | false -> None 11

and now we have an end to end way of combing the content on two different sites:

1 //#time 2 //[1..100] |> Seq.iter(fun id -> generateHouseAssignment id)

gives this:

imageimage

You can see the gist here

Parsing Wake County School System Attendance Assignment Site With F#

As a follow up to this post, I then turned my attention to parsing the Wake County Public School Assignment Site.  If you are not familiar, large schools districts in America have a concept of ‘nodes’ where a child is assigned to a school pyramid (elementary, middle, high schools) based on their home address.  This gives the school attendance tremendous power because a house’s value is directly tied to how “good” (real or perceived) their assigned school pyramid.  WCPSS has a site here where you can enter in your address and find out the school pyramid.

Since there is not a public Api or even a publically available dataset, I decided to see if I could screen scrape the site.  The first challenge is that you need to navigate through 2 pages to get to your answer.  Here is the Fiddler trace

image

The first mistake you will notice is that they are using php.  The second is that they are using the same uri and they are parameterizing the requests via the form value:

image

Finally, their third mistake is that the pages comes back in an non-consistent way, making the DOM traversal more challenging.

Undaunted, I fired up Visual Studio. Because there are 2 pages that need to be used, I imported both of them as a the model for the HtmlTypeProvider

image

I then pulled out the form query string and placed them into some values.  The code so far:

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 3 open System.Net 4 open FSharp.Data 5 6 type context = HtmlProvider<"../data/HouseSearchSample01.html"> 7 type context' = HtmlProvider<"../data/HouseSearchSample02.html"> 8 9 let uri = "http://wwwgis2.wcpss.net/addressLookup/index.php" 10 let streetLookup = "StreetTemplateValue=STRATH&StreetName=Strathorn+Dr+Cary&StreetNumber=904&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 11 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode=CA+0198.2&StreetName=Strathorn+Dr+Cary&StreetTemplateValue=STRATH&StreetNumber=904&StreetZipCode=27519" 12

Skipping the 1st page, I decided to make a request and see if I could get the school information out of the DOM.  It well enough but you can see the immediate problem –> the page’s structure varies so just tagging the n element of the table will not work

1 let webClient = new WebClient() 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded") 3 let result = webClient.UploadString(uri,"POST",streetLookup') 4 let body = context'.Parse(result).Html.Body() 5 6 let tables = body.Descendants("TABLE") |> Seq.toList 7 let schoolTable = tables.[0] 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList 9 let elementaryDatas = schoolRows.[0].Descendants("TD") |> Seq.toList 10 let elementarySchool = elementaryDatas.[1].InnerText() 11 let middleSchoolDatas = schoolRows.[1].Descendants("TD") |> Seq.toList 12 let middleSchool = middleSchoolDatas.[1].InnerText() 13 //Need to skip for the enrollement cap message 14 let highSchoolDatas = schoolRows.[3].Descendants("TD") |> Seq.toList 15 let highSchool = highSchoolDatas.[1].InnerText() 16

 

image

I decided to take the dog for a walk and that time away from the keyboard was very helpful because I realized that although the table is not consistent, I don’t need it to be for my purposes.  All I need are the schools names for a given address.  What I need to do it remove all of the noise and just find the rows of the table with useful data:

1 let webClient = new WebClient() 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded") 3 let result = webClient.UploadString(uri,"POST",streetLookup') 4 let body = context'.Parse(result).Html.Body() 5 6 let tables = body.Descendants("TABLE") |> Seq.toList 7 let schoolTable = tables.[0] 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList 9 let schoolData = schoolRows |> Seq.collect(fun r -> r.Descendants("TD")) |>Seq.toList 10 let schoolData' = schoolData |> Seq.map(fun d -> d.InnerText().Trim()) 11 let schoolData'' = schoolData' |> Seq.filter(fun s -> s <> System.String.Empty) 12 13 //Strip out noise 14 let removeNonEssentialData (s:string) = 15 let markerPosition = s.IndexOf('(') 16 match markerPosition with 17 | -1 -> s 18 | _ -> s.Substring(0,markerPosition).Trim() 19 20 let schoolData''' = schoolData'' |> Seq.map(fun s -> removeNonEssentialData(s)) 21 22 let unimportantPhrases = [|"Neighborhood Busing";"This school has an enrollment cap"|] 23 let containsUnimportantPhrase (s:string) = 24 unimportantPhrases |> Seq.exists(fun p -> s.Contains(p)) 25 26 let schoolData'''' = schoolData''' |> Seq.filter(fun s -> containsUnimportantPhrase(s) = false ) 27 28 schoolData''''

And Boom goes the dynamite:

image

So working backwards, I need to parse the 1st page to get the CatchmentCode for an address, build the second’s page form data and then parse the results.  Parsing the 1st page for the catachmentCode was very straight forward:

1 let result = webClient.UploadString(uri,"POST",streetLookup) 2 let body = context.Parse(result).Html.Body() 3 let inputs = body.Descendants("INPUT") |> Seq.toList

image

1 let catchmentCode = inputs' |> Seq.filter(fun (n,v) -> n = "CatchmentCode") 2 |> Seq.map(fun (n,v) -> v) 3 |> Seq.head 4 let streetName = inputs' |> Seq.filter(fun (n,v) -> n = "StreetName") 5 |> Seq.map(fun (n,v) -> v) 6 |> Seq.head 7 let streetTemplateValue = inputs' |> Seq.filter(fun (n,v) -> n = "StreetTemplateValue") 8 |> Seq.map(fun (n,v) -> v) 9 |> Seq.head 10 let streetNumber = inputs' |> Seq.filter(fun (n,v) -> n = "StreetNumber") 11 |> Seq.map(fun (n,v) -> v) 12 |> Seq.head 13 let streetZipCode = inputs' |> Seq.filter(fun (n,v) -> n = "StreetZipCode") 14 |> Seq.map(fun (n,v) -> v) 15 |> Seq.head

 

image

So the answer is there, just the code sucks.  I refactored it to a single function and

1 let getValueFromInput(nameToFind:string) = 2 inputs' |> Seq.filter(fun (n,v) -> n = nameToFind) 3 |> Seq.map(fun (n,v) -> v) 4 |> Seq.head 5 let catchmentCode = getValueFromInput("CatchmentCode") 6 let streetName = getValueFromInput("StreetName") 7 let streetTemplateValue = getValueFromInput("StreetTemplateValue") 8 let streetNumber =getValueFromInput("StreetNumber") 9 let streetZipCode = getValueFromInput("StreetZipCode")

With the page 1 out of the way, I was ready to start altering the form query string.  I pulled the values out of the string and set up like this:

1 let streetTemplateValue = "STRAT" 2 let street = "Strathorn" 3 let suffix = "Dr" 4 let city = "Cary" 5 let streetNumber = "904" 6 let streetName = street+"+"+suffix+"+"+city 7 let streetLookup = "StreetTemplateValue="+streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 8

1 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode="+catchmentCode+"&StreetName="+streetName+"&StreetTemplateValue="+streetTemplateValue+"&StreetNumber="+streetNumber+"&StreetZipCode="+streetZipCode 2

So now it was just a matter of creating some data structures to pass into the 1st query string

1 type SearchCriteria = {streetTemplateValue:string;street:string;suffix:string;city:string;streetNumber:string;} 2 3 let searchCriteria = {streetTemplateValue="STRAT";street="Strathorn";suffix="Dr";city="Cary";streetNumber="904"} 4 //Page1 Query String 5 let streetName = searchCriteria.street+"+"+searchCriteria.suffix+"+"+searchCriteria.city 6 let streetLookup = "StreetTemplateValue="+searchCriteria.streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+searchCriteria.streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 7

and we now have the basis for a series of functions to do the school lookup.  You can see the gist here.

Parsing Wake County Tax Site With F#

Based on the response of my last post on Wake County School scores, I decided to look at each school’s revenue base.   Instead of looking at free and reduced lunch as a correlating factor for school scores, I wanted to look at the aggregate home valuations of each school’s population.

To do that, I thought of Wake County Tax Department’s web site found here, which you can look up an address and see the tax value of the property.  Although they don’t have an api, their web site’s search result page has a predictable uri like this: http://services.wakegov.com/realestate/Account.asp?id=0000001 so by placing in a 7-character integer, I could theoretically look at all of the tax records for the county.  Also, the HTML of the result page is standardized so parsing it should be fairly straightforward.

So I fired up Visual Studio and opened up the F# REPL. The first thing I did was to bring in the Html type provider and wire up a standard page for the type.

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 open FSharp.Data 3 type context = HtmlProvider<"../data/RealEstateSample.html"> 4

I then could bring down all of the DOM elements for the page: and find all of the <Table> elements

1 let uri = "http://services.wakegov.com/realestate/Account.asp?id=0000001" 2 let body = context.Load(uri).Html.Body() 3 let tables = body.Descendants("TABLE") |> Seq.toList 4 tables |> Seq.length 5

image

So there are 14 tables on the page.  After some manual inspection, the table that holds the address information is table number 7:

1 let addressTable = tables.[7] 2

image

My first thought was to parse the text to see if there are key words that I can search on

1 let baseText = taxTable.ToString() 2 let marker = baseText.IndexOf("Total Value Assessed") 3 let remainingText = baseText.Substring(marker) 4 let marker' = remainingText.IndexOf("$") 5 let remainingText' = remainingText.Substring(marker') 6 let marker'' = remainingText'.IndexOf("<") 7 let finalText = remainingText'.Substring(0,marker'')

I then thought, “Jamie you are being stupid”.  Since the DOM is structured consistently,  I can just use the type provider and search on tags:

1 let addressTable = tables.[7] 2 let fonts = addressTable.Descendants("font") |> Seq.toList 3 let addressOne = fonts.[1].InnerText() 4 let addressTwo = fonts.[2].InnerText() 5 let addressThree = fonts.[3].InnerText() 6

and sure enough

image

And then going to table number 11, I can get the assessed value:

1 let taxTable = tables.[11] 2 let fonts' = taxTable.Descendants("font") |> Seq.toList 3 let assessedValue = fonts'.[3].InnerText() 4

and how cool is this?

image

So with the data elements in place, I need a way of saving the data.  Fortunately, the Json type provider is also in FSharp.Data so I could do this:

1 let valuation = JsonValue.Record [| 2 "addressOne", JsonValue.String addressOne 3 "addressTwo", JsonValue.String addressTwo 4 "addressThree", JsonValue.String addressThree 5 "assessedValue", JsonValue.String assessedValue |] 6 open System.IO 7 File.AppendAllText(@"C:\Data\dataTest.json",valuation.ToString()) 8

And in the file:

image

So now I have the pieces to make requests to the Wake County site and put the values into a json file.  I decided to push the data to the file after each request so if there is a reentrant fault, I would not lose everything:  So here is the gist and here is the results:

image

I then decided to see how long it will take to download the 1st 1,000 Ints.

1 #time 2 [1..100] |> Seq.iter(fun id -> doValuation id)

and with fiddler running

image

It took about 5 minutes for 1,000 ints

image

so extrapolating the max possible (9,999,999), it would take 83 hours.

image

Two thoughts come to mind for the next step

1) Use MBrace with some VMs on Azure to do the requests in parallel

2) Do a binary search to see the actual upper number for Wake County.

Tune in next week so see if that works.

Wake County School Report Cards Using R

Recently Wake County School Systems released a “school report card” that can be used to compare how well a school is doing relative to the other schools in the state. As expected, it made front-page news in our local newspaper here.  The key theme was that schools that have kids from poorer families have worse results than schools from more affluent families.  Though this shouldn’t come as a surprise, the follow-up op eds were equally predictable: more money for poorer schools, changing the rating scale, etc..

I thought it would be an interesting data set to analyze to see if the conclusion that the N&O came up with was, in fact, the only conclusion you can get out of the dataset..  Since they did simple crosstab analysis, perhaps there was some other analysis that could be done?  I know that news paper articles are at a pretty low level reading level and perhaps they are also written at a low analytical level also?  I went to the website to download the data here and quickly ran into two items:

1) The dataset is very limited –> there are only 3 creditable variables in the dataset (county, free and reduced percent, and the school score).  It is almost as if the dataset was purposely limited to only support the conclusion.

2) The dataset is shown in a way that alternative analysis is very hard.  You have to install Tableau if you want to look the data yourself.  Parsing Tableau was a pain because even with Fiddler, they don’t render the results as HTML with some tags but as images.

Side Note –> My guess is that Tableau is trying to be the Flash for the analytics space.  I find it curious that companies/organizations that think they are just “one tool away” from good analytics.   Even the age of Watson,  it is never the tooling – it is always the analyst that determines the usefulness of a dataset.  It would much better if WCPSS embraced open data and had higher expectations of the people using the datasets.

In any event, with the 14 day trial of Tableau, I could download into Access.  I then exported the data into a .txt file (where it should have been in the 1st place).  I the pumped it into R Studio like so:

image

I then created 2 variables from the FreeAndReducedLunch and SchoolScores vectors.  When I ran the correlation the 1st time, I got an NA, meaning that there are some mal-formed data. 

image

I re-ran the correlation using only complete data and sure enough, there is a creditable correlation –> higher the percent of free and reduced lunch, the lower the score.  The N&O is right. 

image

I then added a filter to only look at Wake County and there is even a stronger correlation in Wake County than the state as a whole:

image

As I mentioned earlier, the dataset was set up for a pre-decided conclusion by limited the number of independent variables and the choice of using Tableau as the reporting mechanism.  I decided to augment the dataset with additional information.  My son plays in TYO and I unsuccessful tried to set up an orchestra at our local elementary school 8 years ago.  I also thought of this article where  some families tried to get more orchestras in Wake County schools.  Fortunately, the list of schools with orchestra can be found here and it did not take very long to add an “HasAnStringsProgram” field to the dataset.

image

Running a correlation for just the WCPSS schools shows that there is no relationship  between a school having an orchestra and their performance grade. 

image

So the statement by the parents in the N&O like this

… that music students have higher graduation rates, grades and test scores …

might be true for all music but a specialized strings program does not seem to impact the school’s score –> at least with this data.

Apriori Algorithm and F# Using Elevator Inspection Data

Now that I have the elevator dataset in a workable state, I wanted to see what I could see with the data.  I was reading Machine Learning In Action and the authors suggested that an Apriori Algorithm as a way to quantify associations among data points.  I read both Harrington’s code and Wikipedia’s description and I found both the be impenetrable – the former because their code was unreadable and the later because  the mathematical formulas depended on a level of algebra that I don’t have.

Fortunately, I found a C# project on Codeproject that had both an excellent example/introduction and C# code.  I used the examples on the website to formulate my F# implementation.

The first thing I did was create a class that matched the 1st grid in the example

image

  1. namespace ChickenSoftware.ElevatorChicken.Analysis
  2.  
  3. open System.Collections.Generic
  4.  
  5. type Transaction = {TID: string; Items: List<string> }
  6.  
  7. type Apriori(database: List<Transaction>, support: float, confidence: float) =
  8.     member this.Database = database
  9.     member this.Support = support
  10.     member this.Confidence = confidence

Note that because F# is immutable by default, the properties are read-only.  I then created a unit test project that makes sure the constructor works without exceptions.  The data matches the example:

  1. public AprioriTests()
  2. {
  3.     var database = new List<Transaction>();
  4.     database.Add(new Transaction("100", new List<string>() { "A", "C", "D" }));
  5.     database.Add(new Transaction("200", new List<string>() { "B", "C", "E" }));
  6.     database.Add(new Transaction("300", new List<string>() { "A", "B", "C", "E" }));
  7.     database.Add(new Transaction("400", new List<string>() { "B", "E" }));
  8.  
  9.     _apriori = new Apriori(database, .5, .80);
  10.  
  11. }
  12.  
  13. [TestMethod]
  14. public void ConstructorUsingValidArguments_ReturnsExpected()
  15. {
  16.     Assert.IsNotNull(_apriori);
  17. }

I then need a function to count up all of the items in the Itemsets.  I refused to use loops, so I first started using Seq.Fold, but I was having zero luck because I was trying to fold a Seq of List.  I then started experimenting with other functions when I found Seq.Collect – which was perfect.  So I created a function like this:

  1. member this.GetC1() =
  2.     database
  3.  
  4. member this.GetL1() =
  5.     let numberOfTransactions = this.GetC1().Count
  6.  
  7.     this.GetC1()
  8.         |> Seq.collect(fun d -> d.Items)
  9.         |> Seq.countBy(fun i -> i)
  10.         |> Seq.map(fun (t,i) -> t, i, float i/ float numberOfTransactions)
  11.         |> Seq.filter(fun (t,i,p) -> p >= support)
  12.         |> Seq.map(fun (t,i,p) -> t,i)
  13.         |> Seq.sort
  14.         |> Seq.toList

Note that the numberOfTransactions is for the database, not the individual items in the List<Item>.  And the results match the example:

imageimage

So this is great.  My next stop was to build a list of pair combinations of the remaining values

image

The trick is that is not a Cartesian join of the original sets – it is only the surviving sets that are needed.  My first attempt looked like:

  1. let C1 = database
  2.  
  3. let L1 = C1
  4.         |> Seq.map(fun t -> t.Items)
  5.         |> Seq.collect(fun i -> i)
  6.         |> Seq.countBy(fun i -> i)
  7.         |> Seq.map(fun (t,i) -> t, i, float i/ float numberOftransactions)
  8.         |> Seq.filter(fun (t,i,p) -> p >= support)
  9.         |> Seq.toArray
  10. let C2A = L1
  11.             |> Seq.map(fun (x,y,z) -> x)
  12.             |> Seq.toArray
  13. let C2B = L1
  14.             |> Seq.map(fun (x,y,z) -> x)
  15.             |> Seq.toArray
  16. let C2 = C2A |> Seq.collect(fun x -> C2B |> Seq.map(fun y -> x+y))
  17. C2   

With the output like this:

image

I was running out of Saturday morning so I went over to stack overflow and got a couple of responses.  I was on the right track with the concat, but I didn’t think about the List.Filter(), which would prune my list.  With this in mind, I copied Mark’s code and got what I was looking for

  1. member this.GetC2() =
  2.     let l1Itemset = this.GetL1()
  3.                     |> Seq.map(fun (i,s) -> i)
  4.  
  5.     let itemset =
  6.         l1Itemset
  7.             |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
  8.             |> Seq.concat
  9.             |> Seq.filter(fun (x,y) -> x < y)
  10.             |> Seq.sort
  11.             |> Seq.toList         
  12.     
  13.     let listContainsItem(l:List<string>, a,b) =
  14.             l.Contains(a) && l.Contains(b)
  15.     
  16.     let someFunctionINeedToRename(l1:List<string>, l2)=
  17.             l2 |> Seq.map(fun (x,y) -> listContainsItem(l1,x,y))
  18.  
  19.     let itemsetMatches = this.GetC1()
  20.                             |> Seq.map(fun t -> t.Items)
  21.                             |> Seq.map(fun i -> someFunctionINeedToRename(i,itemset))
  22.  
  23.     let itemSupport = itemsetMatches
  24.                             |> Seq.map(Seq.map(fun i -> if i then 1 else 0))
  25.                             |> Seq.reduce(Seq.map2(+))
  26.  
  27.     itemSupport
  28.         |> Seq.zip(itemset)
  29.         |> Seq.toList

So now I have C2 filling correctly:

image

 

Taking the results, I needed to get L2.

image

That was much simpler that getting C2 –> here is the code:

  1. member this.GetL2() =
  2.     let numberOfTransactions = this.GetC1().Count
  3.     
  4.     this.GetC2()
  5.             |> Seq.map(fun (i,n) -> i,n,float n/float numberOfTransactions)
  6.             |> Seq.filter(fun (i,n,p) -> p >= support)
  7.             |> Seq.map(fun (t,i,p) -> t,i)
  8.             |> Seq.sort
  9.             |> Seq.toList    

And when I run it – it matches this example exactly:

image

Finally, I added in a C# and L3.  This code is identical to the C2/L2 code with one exception: mapping a triple and not a tuple:  The C2 code maps like this

  1. let itemset =
  2.     l1Itemset
  3.         |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
  4.         |> Seq.concat
  5.         |> Seq.filter(fun (x,y) -> x < y)
  6.         |> Seq.sort
  7.         |> Seq.toList     

and the C3 code looks like this (took me 15 minutes to figure out line 3 below):

  1. let itemset =
  2.     l2Itemset
  3.         |> Seq.map(fun x -> l2Itemset |> Seq.map(fun y-> l2Itemset |> Seq.map(fun z->(fst x,fst y,snd z))))
  4.         |> Seq.concat
  5.         |> Seq.collect(fun d -> d)
  6.         |> Seq.filter(fun (x,y,z) -> x < y && y < z)
  7.         |> Seq.distinct
  8.         |> Seq.sort
  9.         |> Seq.toList    

With the C3 and L3 matching the example also:

image

image

 

I was now ready to put in the elevator data into the analysis.  I think I am getting better at F# because I did the mapping, filtering, and transformation of the data from the server without looking at any other material and it look only 15 minutes.

  1. type public ElevatorBuilder() =
  2.     let connectionString = ConfigurationManager.ConnectionStrings.["localData2"].ConnectionString;
  3.  
  4.     member public this.GetElevatorTransactions() =
  5.         let transactions = this.GetElevators()
  6.                               |> Seq.map(fun e ->this.ConvertElevatorToTransaction(e))
  7.         let transactionsList = new System.Collections.Generic.List<Transaction>(transactions)
  8.         transactionsList
  9.  
  10.     member public this.ConvertElevatorToTransaction(i: string, t:string, c:string, s:string) =
  11.         let items = new System.Collections.Generic.List<String>()
  12.         items.Add(t)
  13.         items.Add(c)
  14.         items.Add(s)
  15.         let transaction = {TID=i; Items=items}
  16.         transaction
  17.  
  18.     member public this.GetElevators () =
  19.         SqlConnection.GetDataContext(connectionString).ElevatorData201402
  20.             |> Seq.map(fun e -> e.ID, e.EquipType,e.Capacity,e.Speed)
  21.             |> Seq.filter(fun (i,et,c,s) -> not(String.IsNullOrEmpty(et)))
  22.             |> Seq.filter(fun (i,et,c,s) -> c.HasValue)
  23.             |> Seq.filter(fun (i,et,c,s) -> s.HasValue)
  24.             |> Seq.map(fun (i,t,c,s) -> i, this.CatagorizeEquipmentType(t),c,s)
  25.             |> Seq.map(fun (i,t,c,s) -> i,t,this.CatagorizeCapacity(c.Value),s)
  26.             |> Seq.map(fun (i,t,c,s) -> i,t,c,this.CatagorizeSpeed(s.Value))
  27.             |> Seq.map(fun (i,t,c,s) -> i.ToString(),t,c,s)

The longest part was aggregating the free-form text of the Equipment Type field (here is partial snip, you get the idea…)

  1. member public this.CatagorizeEquipmentType(et: string) =
  2.     match et.Trim() with
  3.         | "OTIS" -> "OTIS"
  4.         | "OTIS (1-2)" -> "OTIS"
  5.         | "OTIS (2-1)" -> "OTIS"
  6.         | "OTIS hydro" -> "OTIS"
  7.         | "OTIS, HYD" -> "OTIS"
  8.         | "OTIS/ ASHEVILLE " -> "OTIS"
  9.         | "OTIS/ MOUNTAIN " -> "OTIS"
  10.         | "OTIS/#1" -> "OTIS"
  11.         | "OTIS/#19 " -> "OTIS"

Assigning categories for speed and capacity was a snap using F#

  1. member public this.CatagorizeCapacity(c: int) =
  2.     let lowerBound = (c/25 * 25) + 1
  3.     let upperBound = lowerBound + 24
  4.     lowerBound.ToString() + "-" + upperBound.ToString()        
  5.  
  6. member public this.CatagorizeSpeed(s: int) =
  7.     let lowerBound = (s/50 * 50) + 1
  8.     let upperBound = lowerBound + 49
  9.     lowerBound.ToString() + "-" + upperBound.ToString()    

With this in hand, I created a Console app that takes the 27K records and pushes them though the apriori algorithm:

  1. private static void RunElevatorAnalysis()
  2. {
  3.     Stopwatch stopwatch = new Stopwatch();
  4.     stopwatch.Start();
  5.     ElevatorBuilder builder = new ElevatorBuilder();
  6.     var transactions = builder.GetElevatorTransactions();
  7.     stopwatch.Stop();
  8.     Console.WriteLine("Building " + transactions.Count + " transactions took: " + stopwatch.Elapsed.TotalSeconds);
  9.     var apriori = new Apriori(transactions, .1, .75);
  10.     var c2 = apriori.GetC2();
  11.     stopwatch.Reset();
  12.     stopwatch.Start();
  13.     var l1 = apriori.GetL1();
  14.     Console.WriteLine("Getting L1 took: " + stopwatch.Elapsed.TotalSeconds);
  15.     var l2 = apriori.GetL2();
  16.     Console.WriteLine("Getting L2 took: " + stopwatch.Elapsed.TotalSeconds);
  17.     var l3 = apriori.GetL3();
  18.     Console.WriteLine("Getting L3 took: " + stopwatch.Elapsed.TotalSeconds);
  19.     stopwatch.Stop();
  20.     Console.WriteLine("–L1");
  21.     foreach (var t in l1)
  22.     {
  23.         Console.WriteLine(t.Item1 + ":" + t.Item2);
  24.     }
  25.     Console.WriteLine("–L2");
  26.     foreach (var t in l2)
  27.     {
  28.         Console.WriteLine(t.Item1 + ":" + t.Item2);
  29.     }
  30.     Console.WriteLine("–L3");
  31.     foreach (var t in l3)
  32.     {
  33.         Console.WriteLine(t.Item1 + ":" + t.Item2);
  34.     }
  35. }

I then made an offering to the F# Gods and hit F5:

image

Doh!  The gods were not pleased.  I then went back to my initial filtering function and added a Seq.Take(25000) and the results:

image

So there a couple of things to draw from this exercise.

1) Apriori Algorithm is the wrong classification technique for this dataset.  I had to bring the support way down (10%) to even get any readings.  Also, there is too much dispersion of the values.  This kind of algorithm is much better with N number of a smaller set of data values versus a fixed number of large values.

2) Even so, how cool is this?  Compare the files just to make the C#/OO work versus with F#

imageimage

And the Total LOC is 539 for C# versus 120 for F# – and the F# can be optimized using a better way to create search and itemsets.  Hard-coding each level was a hack I did to get thing working and give me an understanding of how AA works.  I bet this can be consolidated to well under 75 lines without sacrificing readability

3) I think the StackOverflow exception is because I am doing a Cartesian join and then paring the result.  Using one of the other techniques suggested on SO will give much better results.

I any event, what a fun project!  I can’t wait to optimize this and perhaps throw a different algorithm at the dataset in the coming weeks.

 

 

 

Elevator App: Part 1 – Data Layer Using F#

 

At Open Data Day, fellow TRINUGER Elaine Cahill told me about a website where you can get all of the elevator inspection data for the state.  It is found here.  She went ahead and put the Wake County data onto Socrata.  I wanted to look at the entire state so I went to the report page like so:

 

image

Unfortunately, when you try and pull down the entire state, you cause a server exception:

 

image

 

So I split the download in half.  I then Imported it into Access and then SSISed it into Azure Sql.  I then created a project to server the data and I decided to use F# type providers as a replacement for Entity Framework for my ORM.  I could either use the SqlEntity TP or the SqlDataConnection TP to access the Sql Database on Azure.  Both do not work out of the box.

SqlDataConnection

I could not get  SqlDataConnection to work at all.  When I hooked it up to a standard connection string in the config file, I got:

image

So when I copy and paste the connection string into the TP directly, it does make the connection to Azure, but then it comes back with this exception:

image

Without looking at the source. my guess is that the TP has hard-coded a reference to ‘syscomments’ and alas, Azure does not have that table.

SqlEntity

I then headed over to the SlqEntityTP to see if I could have better luck.  Fortunately, the SqlEntity does work with both an Azure connection string in the .config file and can make a connection to an Azure database.

The problem I ran into was when I wanted to expose the SqlConnection the the WebAPI project that I wrote in C#.  You can not mark SqlEntityTPs as public:

image

Note that the SqlDataConnection can be marked as public. <sigh>.  I marked the SqlEntityTP as internal and then created a POCO to map between the SqlEntity type and a type that can be consumed by the outside world:

  1. type public Elevator ={
  2.         ID: int
  3.         County: string
  4.         StateId: string
  5.         Type: string
  6.         Operation: string
  7.         Owner: string
  8.         O_Address1: string
  9.         O_Address2: string
  10.         O_City: string
  11.         O_State: string
  12.         O_Zip: string
  13.         User: string
  14.         U_Address1: string
  15.         U_Address2: string
  16.         U_City: string
  17.         U_State: string
  18.         U_Zip: string
  19.         U_Lat: double
  20.         U_Long: double
  21.         Installed: DateTime
  22.         Complied: DateTime
  23.         Capacity: int
  24.         CertStatus: int
  25.         EquipType: string
  26.         Drive: string
  27.         Volts: string
  28.         Speed: int
  29.         FloorTo: string
  30.         FloorFrom: string
  31.         Landing: string
  32.         Entrances: string
  33.         Ropes: string
  34.         RopeSize: string
  35.     }
  36.  
  37. type public DataRepository() =
  38.     let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;
  39.  
  40.     member public this.GetElevators () =
  41.         SqlConnection.GetDataContext(connectionString).ElevatorData201402
  42.         |> Seq.map(fun x -> this.GetElevatorFromElevatorData(x))
  43.  
  44.     member public this.GetElevator (id: int) =
  45.         SqlConnection.GetDataContext(connectionString).ElevatorData201402
  46.         |> Seq.where(fun x -> x.ID = id)
  47.         |> Seq.map(fun x -> this.GetElevatorFromElevatorData(x))
  48.         |> Seq.head
  49.  
  50.     member internal this.GetElevatorFromElevatorData(elevatorData: SqlConnection.ServiceTypes.ElevatorData201402) =
  51.         let elevator = {ID= elevatorData.ID;
  52.             County=elevatorData.County;
  53.             StateId=elevatorData.StateID;
  54.             Type=elevatorData.Type;
  55.             Operation=elevatorData.Operation;
  56.             Owner=elevatorData.Owner;
  57.             O_Address1=elevatorData.O_Address1;
  58.             O_Address2=elevatorData.O_Address2;
  59.             O_City=elevatorData.O_City;
  60.             O_State=elevatorData.O_St;
  61.             O_Zip=elevatorData.O_Zip;
  62.             User=elevatorData.User;
  63.             U_Address1=elevatorData.U_Address1;
  64.             U_Address2=elevatorData.U_Address2;
  65.             U_City=elevatorData.U_City;
  66.             U_State=elevatorData.U_St;
  67.             U_Zip=elevatorData.U_Zip;
  68.             U_Lat=elevatorData.U_lat;
  69.             U_Long=elevatorData.U_long;
  70.             Installed=elevatorData.Installed.Value;
  71.             Complied=elevatorData.Complied.Value;
  72.             Capacity=elevatorData.Capacity.Value;
  73.             CertStatus=elevatorData.CertStatus.Value;
  74.             EquipType=elevatorData.EquipType;
  75.             Drive=elevatorData.Drive;
  76.             Volts=elevatorData.Volts;
  77.             Speed=int elevatorData.Speed;
  78.             FloorTo=elevatorData.FloorTo;
  79.             FloorFrom=elevatorData.FloorFrom;
  80.             Landing=elevatorData.Landing;
  81.             Entrances=elevatorData.Entrances;
  82.             Ropes=elevatorData.Ropes;
  83.             RopeSize=elevatorData.RopeSize
  84.         }
  85.         elevator

I am not happy about writing any of this code.  I have 84 lines of code for a single class.  I might have well used the code code gen of EF.  I could have taken the performance hit and used System.Reflection to map field of the same names (I have done that on other projects) , but that also feels like a hack.   In any event, I then added a reference to my F# project in my C# WebAPI project.  I did have to add a reference to FSharp.Core in the C# project (which further vexed me), but then I created a couple of GET methods to expose the data:

 

  1. public class ElevatorController : ApiController
  2. {
  3.     // GET api/Elevator
  4.     public IEnumerable<Elevator> Get()
  5.     {
  6.         DataRepository repository = new DataRepository();
  7.         return repository.GetElevators();
  8.     }
  9.  
  10.     // GET api/Elevator/5
  11.     public Elevator Get(int id)
  12.     {
  13.         DataRepository repository = new DataRepository();
  14.         return repository.GetElevator(id);
  15.     }
  16.  
  17. }

 

When I viewed the JSON from a handy browser, it looks like, well, junk:

image

So now I have to get rid of that random characters (x0040 suffix)– yet a 3rd POCO, this one in C#:

  1. public class ElevatorController : ApiController
  2. {
  3.     // GET api/Elevator
  4.     public IEnumerable<CS.Elevator> Get()
  5.     {
  6.         List<CS.Elevator> elevators = new List<CS.Elevator>();
  7.         FS.DataRepository repository = new FS.DataRepository();
  8.         var fsElevators = repository.GetElevators();
  9.         foreach (var fsElevator in fsElevators)
  10.         {
  11.             elevators.Add(GetElevatorFromFSharpElevator(fsElevator));
  12.         }
  13.         return elevators;
  14.     }
  15.  
  16.     // GET api/Elevator/5
  17.     public CS.Elevator Get(int id)
  18.     {
  19.         FS.DataRepository repository = new FS.DataRepository();
  20.         return GetElevatorFromFSharpElevator(repository.GetElevator(id));
  21.     }
  22.  
  23.     internal CS.Elevator GetElevatorFromFSharpElevator(FS.Elevator fsElevator)
  24.     {
  25.         CS.Elevator elevator = new CS.Elevator();
  26.         elevator.ID = fsElevator.ID;
  27.         elevator.County = fsElevator.County;
  28.         elevator.StateId = fsElevator.StateId;
  29.         elevator.Type = fsElevator.Type;
  30.         elevator.Operation = fsElevator.Operation;
  31.         elevator.Owner = fsElevator.Owner;
  32.         elevator.O_Address1 = fsElevator.O_Address1;
  33.         elevator.O_Address2 = fsElevator.O_Address2;
  34.         elevator.O_City = fsElevator.O_City;
  35.         elevator.O_State = fsElevator.O_State;
  36.         elevator.O_Zip = fsElevator.O_Zip;
  37.         elevator.User = fsElevator.User;
  38.         elevator.U_Address1 = fsElevator.U_Address1;
  39.         elevator.U_Address2 = fsElevator.U_Address2;
  40.         elevator.U_City = fsElevator.U_City;
  41.         elevator.U_State = fsElevator.U_State;
  42.         elevator.U_Zip = fsElevator.U_Zip;
  43.         elevator.Installed = fsElevator.Installed;
  44.         elevator.Complied = fsElevator.Complied;
  45.         elevator.Capacity = fsElevator.Capacity;
  46.         elevator.CertStatus = fsElevator.CertStatus;
  47.         elevator.EquipType = fsElevator.EquipType;
  48.         elevator.Drive = fsElevator.Drive;
  49.         elevator.Volts = fsElevator.Volts;
  50.         elevator.Speed = fsElevator.Speed;
  51.         elevator.FloorTo = fsElevator.FloorTo;
  52.         elevator.FloorFrom = fsElevator.FloorFrom;
  53.         elevator.Landing = fsElevator.Landing;
  54.         elevator.Entrances = fsElevator.Entrances;
  55.         elevator.Ropes = fsElevator.Ropes;
  56.         elevator.RopeSize = fsElevator.RopeSize;
  57.         return elevator;
  58.     }
  59.  
  60. }

 

So that gives me that I want…

image

As a side note, I learned the hard way that the only way to force the SqlEntityTP to update based on a schema  change in the DB is to change the connection string in the .config file

Finally, when I published the WebAPI project to Azure, I got an exception. 

<Error><Message>An error has occurred.</Message><ExceptionMessage>Could not load file or assembly 'FSharp.Core, Version=4.3.1.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' or one of its dependencies. The system cannot find the file specified.</ExceptionMessage><ExceptionType>System.IO.FileNotFoundException</ExceptionType><StackTrace> at System.Web.Http.ApiController.<InvokeActionWithExceptionFilters>d__1.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Web.Http.Dispatcher.HttpControllerDispatcher.<SendAsync>d__0.MoveNext()</StackTrace

Turns out you need to not only add a reference to the F# project and FSharp.Core, you have to deploy the .dlls to Azure also.  Thanks to hocho on SO for that one.

In conclusion, I love the promise of TPs.  I want nothing more than to throw away all of the EF code-gen, .tt files, seeding for code-first nonsense, etc… and replace it with a single line TP.  I have done this on a local project, but when I did it with an Azure, things were harder than they should be.  Since it is easier to throw hand grenades than catch them, I made a list of the things I want to help the open source FSharp.Data project accomplish in the coming months:

1) SqlDatabaseConnection working with Azure Sql Storage

2) MSAccessConnection needed

3) ActiveDirectoryConnection needed

4) Json and WsdlService ability to handle proxies

5) SqlEntityConnection exposing classes publicly

Regardless of what the open-source community does, MSFT will still have to make a better commitment to F# on Azure, IMHO…

Restaurant Classification Via the Yellow Pages API Using F#

As part of the restaurant analysis I did for open data day, I built a crude classifier to identify Chinese restaurants.  The classifier looked at the name of the establishment and if certain key words were in the name, it was tagged as a Chinese restaurant.

  1. member public x.IsEstablishmentAChineseRestraurant (establishmentName:string) =
  2.     let upperCaseEstablishmentName = establishmentName.ToUpper()
  3.     let numberOfMatchedWords = upperCaseEstablishmentName.Split(' ')
  4.                                 |> Seq.map(fun x -> match x with
  5.                                                         | "ASIA" -> 1
  6.                                                         | "ASIAN" -> 1
  7.                                                         | "CHINA" -> 1
  8.                                                         | "CHINESE" -> 1
  9.                                                         | "PANDA" -> 1
  10.                                                         | "PEKING" -> 1
  11.                                                         | "WOK" -> 1
  12.                                                         | _ -> 0)
  13.                                 |> Seq.sum
  14.     match numberOfMatchedWords with
  15.         | 0 -> false
  16.         | _ -> true

Although this worked well enough for the analysis, I was interested in seeing if there was a way of using something that is more precise.  To that end, I thought of the Yellow Pages – they classify restaurants into categories and assuming that the restaurant is in the yellow pages, it is a better way to determine the restaurant category versus just a name search.

The first thing I did was head over to the Yellow Pages (YP.com) website and sure enough, they have an API and a developers program.  I signed up and had an API key within a couple of minutes.

The first thing I did was to try and search for a restaurant in the browser.  I picked the first restaurant I came across in the dataset – Jumbo China #5.  I created a request uri based on their API like so

http://pubapi.atti.com/search-api/search/devapi/search?term=Jumbo+China+5&searchloc=6108+Falls+Of+Neuse+Rd+27609&format=json&key=XXXXXXXXXX

When I plugged the name into the browser, I got this:

image

After screwing around with the code for about ten minutes thinking it was my API Key (Invalid Key would lead you to believe that, no?), Mike Thomas came over and told me that the url encoding was messing with my request – specifically the ‘#’ in Jumbo China #5.  When I removed the # symbol, I got Json back:

image

Throwing the Json into Json2CSharp, the results look great:

image

I then took this URL and tried to load it into a F# type provider, I couldn’t understand why I was getting a red squiggly line of approbation (Json and XML):

image

 

so I pulled out Fiddler to see I was getting a 400.  Digging into the response value, I found that “User Agent” was a required field. 

image

The problem was then compounded because the FSharp Json type provider does not allow you to enter a User Agent into the constructor.  I headed over to Stack Overflow where Thomas Petricek was kind enough to answer the question – basically you have to use the FSharp Http class to make the request (which you can add the user agent to) and then parse the response via the JsonProvider using the “Parse” versus the “Load” method.  So spinning up the method like so:

image

This gave me the results back that I wanted.  I then created a couple of methods to clean up any characters that might screw up the url encoding, added some argument validation, and I had a pretty good module to consume the YP.com listings:

  1. namespace ChickenSoftware.RestaurantClassifier
  2.  
  3. open System
  4. open FSharp.Data
  5. open FSharp.Net
  6.  
  7. type ypProvider = JsonProvider< @"YP.txt">
  8.  
  9. type RestaurantCatagoryRepository() =
  10.    member this.GetCatagories(restaurantName: string, restaurantAddress: string) =
  11.         if(String.IsNullOrEmpty(restaurantName)) then
  12.             failwith("restaurantName cannot be null or empty.")
  13.         if(String.IsNullOrEmpty(restaurantAddress)) then
  14.             failwith("restaurantAddress cannot be null or empty.")
  15.         let cleanedName = this.CleanName(restaurantName)
  16.         let cleanedAddress = this.CleanAddress(restaurantAddress);
  17.         let uri = "http://pubapi.atti.com/search-api/search/devapi/search?term=&quot;+cleanedName+"&searchloc="+cleanedAddress+"&format=json&key=XXXXXX"
  18.         let response = FSharp.Net.Http.Request(uri, headers=["user-agent", "None"])
  19.         let ypResult = ypProvider.Parse(response)
  20.         try
  21.             ypResult.SearchResult.SearchListings.SearchListing.[0].Categories
  22.         with
  23.             | ex -> String.Empty
  24.  
  25.     member this.CleanName(name: string) =
  26.                 name.Replace("#","").Replace(" ","+")
  27.  
  28.     member this.CleanAddress(address: string)=
  29.                 address.Replace("#","").Replace(" ","+")
  30.     
  31.     member this.IsCatagoryInCatagories(catagories: string, catagory: string) =
  32.         if(String.IsNullOrEmpty(catagories)) then false
  33.         else if (String.IsNullOrEmpty(catagory)) then false
  34.         else catagories.Contains(catagory)
  35.  
  36.     member this.IsRestaurantInCatagory(restaurantName: string, restaurantAddress: string, restaurantCatagory: string) =
  37.         if(String.IsNullOrEmpty(restaurantName)) then
  38.             failwith("restaurantName cannot be null or empty.")
  39.         if(String.IsNullOrEmpty(restaurantAddress)) then
  40.             failwith("restaurantAddress cannot be null or empty.")
  41.         if(String.IsNullOrEmpty(restaurantCatagory)) then
  42.             failwith("restaurantCatagory cannot be null or empty.")
  43.  
  44.         System.Threading.Thread.Sleep(new System.TimeSpan(0,0,1))
  45.         let catagories = this.GetCatagories(restaurantName, restaurantAddress)
  46.         if(String.IsNullOrEmpty(catagories)) then false
  47.         else this.IsCatagoryInCatagories(catagories,restaurantCatagory)
  48.  
  49.     member this.IsRestaurantInCatagoryAsync(restaurantName: string, restaurantAddress: string, restaurantCatagory: string) =
  50.         async {
  51.             if(String.IsNullOrEmpty(restaurantName)) then
  52.                 failwith("restaurantName cannot be null or empty.")
  53.             if(String.IsNullOrEmpty(restaurantAddress)) then
  54.                 failwith("restaurantAddress cannot be null or empty.")
  55.             if(String.IsNullOrEmpty(restaurantCatagory)) then
  56.                 failwith("restaurantCatagory cannot be null or empty.")
  57.  
  58.             let catagories = this.GetCatagories(restaurantName, restaurantAddress)
  59.             if(String.IsNullOrEmpty(catagories)) then return false
  60.             else return this.IsCatagoryInCatagories(catagories,restaurantCatagory)
  61.         }

The associated unit and integration tests that I made in building this module look like this:

  1. [TestClass]
  2. public class CatagoryBuilderTests
  3. {
  4.  
  5.     [TestMethod]
  6.     public void CleanName_ReturnsExpectedValue()
  7.     {
  8.         RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  9.         String restaurantName = "Jumbo China #5";
  10.  
  11.         String expected = "Jumbo+China+5";
  12.         String actual = repository.CleanName(restaurantName);
  13.         Assert.AreEqual(expected, actual);
  14.     }
  15.  
  16.     [TestMethod]
  17.     public void CleanAddress_ReturnsExpectedValue()
  18.     {
  19.         RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  20.         String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
  21.  
  22.         String expected = "6108+Falls+Of+Neuse+Rd+27609";
  23.         String actual = repository.CleanAddress(restaurantAddress);
  24.         Assert.AreEqual(expected, actual);
  25.     }
  26.  
  27.  
  28.     [TestMethod]
  29.     public void GetCatagories_ReturnsExpectedValue()
  30.     {
  31.         string restaurantName = "Jumbo China #5";
  32.         String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
  33.  
  34.         RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  35.         var result = repository.GetCatagories(restaurantName, restaurantAddress);
  36.         Assert.IsNotNull(result);
  37.     }
  38.  
  39.     [TestMethod]
  40.     public void CatagoryIsContainedInCatagoriesUsingValidTrueData_ReturnsExpectedValue()
  41.     {
  42.         RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  43.  
  44.         String catagories = "Chinese Restaurants|Restaurants|";
  45.         String catagory = "Chinese";
  46.  
  47.         Boolean expected = true;
  48.         Boolean actual = repository.IsCatagoryInCatagories(catagories, catagory);
  49.  
  50.         Assert.AreEqual(expected, actual);
  51.     }
  52.  
  53.     [TestMethod]
  54.     public void CatagoryIsContainedInCatagoriesUsingValidFalseData_ReturnsExpectedValue()
  55.     {
  56.         RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  57.  
  58.         String catagories = "Chinese Restaurants|Restaurants|";
  59.         String catagory = "Seafood";
  60.  
  61.         Boolean expected = false;
  62.         Boolean actual = repository.IsCatagoryInCatagories(catagories, catagory);
  63.  
  64.         Assert.AreEqual(expected, actual);
  65.     }
  66.  
  67.     [TestMethod]
  68.     public void IsJumboChinaAChineseRestaurant_ReturnsTrue()
  69.     {
  70.         RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  71.  
  72.         string restaurantName = "Jumbo China #5";
  73.         String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
  74.         String restaurantCatagory = "Chinese";
  75.  
  76.         Boolean expected = true;
  77.         Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
  78.  
  79.         Assert.AreEqual(expected, actual);
  80.     }
  81.  
  82.     [TestMethod]
  83.     public void IsJumboChinaAnItalianRestaurant_ReturnsFalse()
  84.     {
  85.         RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  86.  
  87.         string restaurantName = "Jumbo China #5";
  88.         String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
  89.         String restaurantCatagory = "Italian";
  90.  
  91.         Boolean expected = false;
  92.         Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
  93.  
  94.         Assert.AreEqual(expected, actual);
  95.     }
  96.  
  97.     [TestMethod]
  98.     public void IsUnknownAnItalianRestaurant_ReturnsFalse()
  99.     {
  100.         RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  101.  
  102.         string restaurantName = "Some Unknown Restaurant";
  103.         String restaurantAddress = "Some Address";
  104.         String restaurantCatagory = "Italian";
  105.  
  106.         Boolean expected = false;
  107.         Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
  108.  
  109.         Assert.AreEqual(expected, actual);
  110.     }
  111.  
  112.  
  113.  
  114.     [TestMethod]
  115.     public void CatagoryIsContainedInCatagoriesUsingEmptyCatagory_ReturnsExpectedValue()
  116.     {
  117.         RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  118.  
  119.         String catagories = "Chinese Restaurants|Restaurants|";
  120.         String catagory = String.Empty;
  121.  
  122.         Boolean expected = false;
  123.         Boolean actual = repository.IsCatagoryInCatagories(catagories, catagory);
  124.  
  125.         Assert.AreEqual(expected, actual);
  126.     }

The hardest test to get run green was the negative test – passing in a restaurant name that is not recognized

  1. [TestMethod]
  2. public void IsUnknownAnItalianRestaurant_ReturnsFalse()
  3. {
  4.     RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
  5.  
  6.     string restaurantName = "Some Unknown Restaurant";
  7.     String restaurantAddress = "Some Address";
  8.     String restaurantCatagory = "Italian";
  9.  
  10.     Boolean expected = false;
  11.     Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
  12.  
  13.     Assert.AreEqual(expected, actual);
  14. }

To code around the fact that a different set of Json came back and the original code is expecting a specific structure, I finally resorted to a try…catch

  1. try
  2.     ypResult.SearchResult.SearchListings.SearchListing.[0].Categories
  3. with
  4.     | ex -> String.Empty

I feel dirty, but I don’t know how else to get around it.  In any event, I then coded up a module that pulled the list of restaurants from Azure and put them through the classifier.

  1. namespace ChickenSoftware.RestaurantClassifier
  2.  
  3. open FSharp.Data
  4. open System.Linq
  5. open System.Configuration
  6. open Microsoft.FSharp.Linq
  7. open Microsoft.FSharp.Data.TypeProviders
  8.  
  9. type internal SqlConnection = SqlEntityConnection<ConnectionStringName="azureData">
  10.  
  11. type public RestaurantBuilder () =
  12.     
  13.     let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;
  14.     
  15.     member public this.GetRestaurants () =
  16.         SqlConnection.GetDataContext(connectionString).Restaurants
  17.             |> Seq.map(fun x -> x.EstablishmentName, x.EstablishmentAddress + " " + x.EstablishmnetZipCode)
  18.             |> Seq.toArray
  19.             
  20.     member public this.GetChineseRestaurants () =
  21.         let catagoryRepository = new RestaurantCatagoryRepository()
  22.         let catagory = "Chinese"
  23.         this.GetRestaurants()
  24.                 |> Seq.filter(fun (name, address) -> catagoryRepository.IsRestaurantInCatagory(name, address,catagory))
  25.                 |> Seq.toList

This code is almost identical to the code I posted 2 weeks ago.  Sure enough, When I threw my integration tests at the functions, check out fiddler. 

image

I was getting responses.  I ran into the problem on the 50th request though.

image

To get around this occasional timeout issue, I threw in a second delay between each request, which seemed the solve the problem.

  1. System.Threading.Thread.Sleep(new System.TimeSpan(0,0,1))
  2. let catagories = this.GetCatagories(restaurantName, restaurantAddress)
  3. if(String.IsNullOrEmpty(catagories)) then false
  4. else this.IsCatagoryInCatagories(catagories,restaurantCatagory)

However, this then introduced a new problem.  There are 4,000 or so restaurants, so that is over 66 minutes of running.  Not good.  Next week, I hope to add some parallelism to speed things up…