Open Data | Jamie Dixon's Home

Combining Wake County Real Estate Lookup with Wake County School Assignment

March 10, 2015 1 Comment

As a follow up to this post and this post, I want to combine looking up Wake County Real Estate valuation with the Wake County School Assignment. The matching values between the two datasets is the house address.

The first thing I did was to create a new script file in the project. I then added a reference to the script that does the WCPSS lookup. I then added a Json provider that will server as the type of the Wake County Real Estate Valuation data that was stored previously in a DocumentDb instance.

 1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll"
 2 #r "../packages/Microsoft.Azure.Documents.Client.0.9.2-preview/lib/net40/Microsoft.Azure.Documents.Client.dll"
 3 #r "../packages/Newtonsoft.Json.4.5.11/lib/net40/Newtonsoft.Json.dll"
 4 
 5 #load "SchoolAssignments.fsx"
 6 
 7 open System
 8 open System.IO
 9 open FSharp.Data
10 open System.Linq
11 open SchoolAssignments
12 open Microsoft.Azure.Documents
13 open Microsoft.Azure.Documents.Client
14 open Microsoft.Azure.Documents.Linq
15 
16 type HouseValuation = JsonProvider<"../data/HouseValuationSample.json">

The house valuation json looks like this:

{

"index": 1,

"addressOne": "1506 WAKE FOREST RD ",

"addressTwo": "RALEIGH NC 27604-1331",

"addressThree": " ",

"assessedValue": "$34,848",

"id": "c0e931de-68b8-452e-8365-66d3a4a93483",

"_rid": "pmVVALZMZAEBAAAAAAAAAA==",

"_ts": 1423934277,

"_self": "dbs/pmVVAA==/colls/pmVVALZMZAE=/docs/pmVVALZMZAEBAAAAAAAAAA==/",

"_etag": "\"0000c100-0000-0000-0000-54df83450000\"",

"_attachments": "attachments/"

}

The first method pulls the data from the DocumentDb and serializes it into an instance of the type:

 1 let getPropertyValue(id: int)=
 2         let endpointUrl = ""
 3         let authKey = ""
 4         let client = new DocumentClient(new Uri(endpointUrl), authKey) 
 5         let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault()
 6         let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault()
 7         let documentLink = collection.SelfLink
 8         let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = " + id.ToString()
 9         let query = client.CreateDocumentQuery(documentLink,queryString)
10         let firstValue = query |> Seq.head
11         HouseValuation.Parse(firstValue.ToString())
12

The next method uses the School Look script to pull the data from the WCPSS site. The only real gotchas was that the space deliminator (char32) was not the only way to split the address. The WCPSS site also added in a the hard break (char160). It took me about a hour to figure out wht “” was not breaking into a array of words via splitting on “ “. <sigh>

 1 let createSchoolAssignmentSearchCriteria(houseValuation: option<HouseValuation.Root>) =
 2     match houseValuation.IsSome with
 3     | true -> let deliminators = [|(char)32;(char)160|]
 4               let addressOneTokens = houseValuation.Value.AddressOne.Split(deliminators)
 5               let streetNumber = addressOneTokens.[0]
 6               let streetTemplateValue = addressOneTokens.[1]
 7               let streetName = addressOneTokens.[1..] |> Array.reduce(fun acc t -> acc + "+" + t)
 8               let addressTwoTokens = houseValuation.Value.AddressTwo.Split(deliminators)
 9               let city = addressTwoTokens.[0]
10               let streetName' = streetName + city
11               Some {SearchCriteria.streetTemplateValue=streetTemplateValue;
12                streetName=streetName';
13                streetNumber=streetNumber;}
14     | false -> None
15

In any event, the last piece was to take the value and push it back up to another DocumentDb collection:

 1 let writeSchoolAssignmentToDocumentDb(houseAssignment:option<HouseAssignment>) =
 2     match houseAssignment.IsSome with
 3     | true -> 
 4         let endpointUrl = ""
 5         let authKey = ""
 6         let client = new DocumentClient(new Uri(endpointUrl), authKey) 
 7         let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault()
 8         let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault()
 9         let documentLink = collection.SelfLink
10         client.CreateDocumentAsync(documentLink, houseAssignment.Value) |> ignore
11     | false -> ()
12 
13

With that in place, the final function puts it all together:

 1 let createHouseAssignment(id:int)=
 2     let houseValuation = getPropertyValue(id)
 3     let schools = houseValuation
 4                      |> createSchoolAssignmentSearchCriteria
 5                      |> createSearchCriteria'
 6                      |> createPage2QueryString
 7                      |> getSchoolData
 8     match schools.IsSome with
 9     | true -> Some {houseIndex=houseValuation.Value.Index; schools=schools.Value}
10     | false -> None
11

and now we have an end to end way of combing the content on two different sites:

1 //#time
2 //[1..100] |> Seq.iter(fun id -> generateHouseAssignment id)

gives this:

You can see the gist here

Filed under F#, Open Data

Parsing Wake County School System Attendance Assignment Site With F#

February 24, 2015 1 Comment

As a follow up to this post, I then turned my attention to parsing the Wake County Public School Assignment Site. If you are not familiar, large schools districts in America have a concept of ‘nodes’ where a child is assigned to a school pyramid (elementary, middle, high schools) based on their home address. This gives the school attendance tremendous power because a house’s value is directly tied to how “good” (real or perceived) their assigned school pyramid. WCPSS has a site here where you can enter in your address and find out the school pyramid.

Since there is not a public Api or even a publically available dataset, I decided to see if I could screen scrape the site. The first challenge is that you need to navigate through 2 pages to get to your answer. Here is the Fiddler trace

The first mistake you will notice is that they are using php. The second is that they are using the same uri and they are parameterizing the requests via the form value:

Finally, their third mistake is that the pages comes back in an non-consistent way, making the DOM traversal more challenging.

Undaunted, I fired up Visual Studio. Because there are 2 pages that need to be used, I imported both of them as a the model for the HtmlTypeProvider

I then pulled out the form query string and placed them into some values. The code so far:

 1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll"
 2 
 3 open System.Net
 4 open FSharp.Data
 5 
 6 type context = HtmlProvider<"../data/HouseSearchSample01.html">
 7 type context' = HtmlProvider<"../data/HouseSearchSample02.html">
 8 
 9 let uri = "http://wwwgis2.wcpss.net/addressLookup/index.php"
10 let streetLookup = "StreetTemplateValue=STRATH&StreetName=Strathorn+Dr+Cary&StreetNumber=904&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage"
11 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode=CA+0198.2&StreetName=Strathorn+Dr+Cary&StreetTemplateValue=STRATH&StreetNumber=904&StreetZipCode=27519"
12

Skipping the 1st page, I decided to make a request and see if I could get the school information out of the DOM. It well enough but you can see the immediate problem –> the page’s structure varies so just tagging the n element of the table will not work

 1 let webClient = new WebClient()
 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded")
 3 let result = webClient.UploadString(uri,"POST",streetLookup')
 4 let body = context'.Parse(result).Html.Body()
 5 
 6 let tables = body.Descendants("TABLE") |> Seq.toList
 7 let schoolTable = tables.[0]
 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList
 9 let elementaryDatas = schoolRows.[0].Descendants("TD") |> Seq.toList
10 let elementarySchool = elementaryDatas.[1].InnerText()
11 let middleSchoolDatas = schoolRows.[1].Descendants("TD") |> Seq.toList
12 let middleSchool = middleSchoolDatas.[1].InnerText()
13 //Need to skip for the enrollement cap message
14 let highSchoolDatas = schoolRows.[3].Descendants("TD") |> Seq.toList
15 let highSchool = highSchoolDatas.[1].InnerText()
16

I decided to take the dog for a walk and that time away from the keyboard was very helpful because I realized that although the table is not consistent, I don’t need it to be for my purposes. All I need are the schools names for a given address. What I need to do it remove all of the noise and just find the rows of the table with useful data:

 1 let webClient = new WebClient()
 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded")
 3 let result = webClient.UploadString(uri,"POST",streetLookup')
 4 let body = context'.Parse(result).Html.Body()
 5 
 6 let tables = body.Descendants("TABLE") |> Seq.toList
 7 let schoolTable = tables.[0]
 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList
 9 let schoolData = schoolRows |> Seq.collect(fun r -> r.Descendants("TD")) |>Seq.toList
10 let schoolData' = schoolData |> Seq.map(fun d -> d.InnerText().Trim()) 
11 let schoolData'' = schoolData' |> Seq.filter(fun s -> s <> System.String.Empty) 
12 
13 //Strip out noise
14 let removeNonEssentialData (s:string) =
15     let markerPosition = s.IndexOf('(')
16     match markerPosition with
17     | -1 -> s
18     | _ -> s.Substring(0,markerPosition).Trim()
19 
20 let schoolData''' = schoolData'' |> Seq.map(fun s -> removeNonEssentialData(s))
21 
22 let unimportantPhrases = [|"Neighborhood Busing";"This school has an enrollment cap"|]
23 let containsUnimportantPhrase (s:string) =
24     unimportantPhrases |> Seq.exists(fun p -> s.Contains(p))
25 
26 let schoolData'''' = schoolData''' |> Seq.filter(fun s -> containsUnimportantPhrase(s) = false )
27 
28 schoolData''''

And Boom goes the dynamite:

So working backwards, I need to parse the 1st page to get the CatchmentCode for an address, build the second’s page form data and then parse the results. Parsing the 1st page for the catachmentCode was very straight forward:

1 let result = webClient.UploadString(uri,"POST",streetLookup)
2 let body = context.Parse(result).Html.Body()
3 let inputs = body.Descendants("INPUT") |> Seq.toList

 1 let catchmentCode = inputs' |> Seq.filter(fun (n,v) -> n = "CatchmentCode") 
 2                             |> Seq.map(fun (n,v) -> v)
 3                             |> Seq.head
 4 let streetName = inputs' |> Seq.filter(fun (n,v) -> n = "StreetName") 
 5                             |> Seq.map(fun (n,v) -> v)
 6                             |> Seq.head
 7 let streetTemplateValue = inputs' |> Seq.filter(fun (n,v) -> n = "StreetTemplateValue") 
 8                             |> Seq.map(fun (n,v) -> v)
 9                             |> Seq.head
10 let streetNumber = inputs' |> Seq.filter(fun (n,v) -> n = "StreetNumber") 
11                             |> Seq.map(fun (n,v) -> v)
12                             |> Seq.head
13 let streetZipCode = inputs' |> Seq.filter(fun (n,v) -> n = "StreetZipCode") 
14                             |> Seq.map(fun (n,v) -> v)
15                             |> Seq.head

So the answer is there, just the code sucks. I refactored it to a single function and

1 let getValueFromInput(nameToFind:string) =
2     inputs' |> Seq.filter(fun (n,v) -> n = nameToFind) 
3                                 |> Seq.map(fun (n,v) -> v)
4                                 |> Seq.head
5 let catchmentCode = getValueFromInput("CatchmentCode") 
6 let streetName = getValueFromInput("StreetName") 
7 let streetTemplateValue = getValueFromInput("StreetTemplateValue") 
8 let streetNumber =getValueFromInput("StreetNumber") 
9 let streetZipCode = getValueFromInput("StreetZipCode")

With the page 1 out of the way, I was ready to start altering the form query string. I pulled the values out of the string and set up like this:

1 let streetTemplateValue = "STRAT"
2 let street = "Strathorn"
3 let suffix = "Dr"
4 let city = "Cary"
5 let streetNumber = "904"
6 let streetName = street+"+"+suffix+"+"+city
7 let streetLookup = "StreetTemplateValue="+streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage"
8

1 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode="+catchmentCode+"&StreetName="+streetName+"&StreetTemplateValue="+streetTemplateValue+"&StreetNumber="+streetNumber+"&StreetZipCode="+streetZipCode
2

So now it was just a matter of creating some data structures to pass into the 1st query string

1 type SearchCriteria = {streetTemplateValue:string;street:string;suffix:string;city:string;streetNumber:string;}
2 
3 let searchCriteria = {streetTemplateValue="STRAT";street="Strathorn";suffix="Dr";city="Cary";streetNumber="904"}
4 //Page1 Query String
5 let streetName = searchCriteria.street+"+"+searchCriteria.suffix+"+"+searchCriteria.city
6 let streetLookup = "StreetTemplateValue="+searchCriteria.streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+searchCriteria.streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage"
7

and we now have the basis for a series of functions to do the school lookup. You can see the gist here.

Filed under F#, Open Data

Parsing Wake County Tax Site With F#

February 17, 2015 1 Comment

Based on the response of my last post on Wake County School scores, I decided to look at each school’s revenue base. Instead of looking at free and reduced lunch as a correlating factor for school scores, I wanted to look at the aggregate home valuations of each school’s population.

To do that, I thought of Wake County Tax Department’s web site found here, which you can look up an address and see the tax value of the property. Although they don’t have an api, their web site’s search result page has a predictable uri like this: http://services.wakegov.com/realestate/Account.asp?id=0000001 so by placing in a 7-character integer, I could theoretically look at all of the tax records for the county. Also, the HTML of the result page is standardized so parsing it should be fairly straightforward.

So I fired up Visual Studio and opened up the F# REPL. The first thing I did was to bring in the Html type provider and wire up a standard page for the type.

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll"
2 open FSharp.Data
3 type context = HtmlProvider<"../data/RealEstateSample.html">
4

I then could bring down all of the DOM elements for the page: and find all of the <Table> elements

1 let uri = "http://services.wakegov.com/realestate/Account.asp?id=0000001"
2 let body = context.Load(uri).Html.Body()
3 let tables = body.Descendants("TABLE") |> Seq.toList
4 tables |> Seq.length
5

So there are 14 tables on the page. After some manual inspection, the table that holds the address information is table number 7:

1 let addressTable = tables.[7]
2

My first thought was to parse the text to see if there are key words that I can search on

1 let baseText = taxTable.ToString()
2 let marker = baseText.IndexOf("Total Value Assessed")
3 let remainingText = baseText.Substring(marker)
4 let marker' = remainingText.IndexOf("$")
5 let remainingText' = remainingText.Substring(marker')
6 let marker'' = remainingText'.IndexOf("<")
7 let finalText = remainingText'.Substring(0,marker'')

I then thought, “Jamie you are being stupid”. Since the DOM is structured consistently, I can just use the type provider and search on tags:

1 let addressTable = tables.[7]
2 let fonts = addressTable.Descendants("font") |> Seq.toList
3 let addressOne = fonts.[1].InnerText()
4 let addressTwo = fonts.[2].InnerText()
5 let addressThree = fonts.[3].InnerText()
6

and sure enough

And then going to table number 11, I can get the assessed value:

1 let taxTable = tables.[11]
2 let fonts' = taxTable.Descendants("font") |> Seq.toList
3 let assessedValue = fonts'.[3].InnerText()
4

and how cool is this?

So with the data elements in place, I need a way of saving the data. Fortunately, the Json type provider is also in FSharp.Data so I could do this:

1 let valuation =  JsonValue.Record [| 
2                     "addressOne", JsonValue.String addressOne
3                     "addressTwo", JsonValue.String addressTwo
4                     "addressThree", JsonValue.String addressThree
5                     "assessedValue", JsonValue.String assessedValue |] 
6 open System.IO
7 File.AppendAllText(@"C:\Data\dataTest.json",valuation.ToString())
8

And in the file:

So now I have the pieces to make requests to the Wake County site and put the values into a json file. I decided to push the data to the file after each request so if there is a reentrant fault, I would not lose everything: So here is the gist and here is the results:

I then decided to see how long it will take to download the 1st 1,000 Ints.

1 #time
2 [1..100] |> Seq.iter(fun id -> doValuation id)

and with fiddler running

It took about 5 minutes for 1,000 ints

so extrapolating the max possible (9,999,999), it would take 83 hours.

Two thoughts come to mind for the next step

1) Use MBrace with some VMs on Azure to do the requests in parallel

2) Do a binary search to see the actual upper number for Wake County.

Tune in next week so see if that works.

Filed under F#, Open Data

Wake County School Report Cards Using R

February 10, 2015 Leave a comment

Recently Wake County School Systems released a “school report card” that can be used to compare how well a school is doing relative to the other schools in the state. As expected, it made front-page news in our local newspaper here. The key theme was that schools that have kids from poorer families have worse results than schools from more affluent families. Though this shouldn’t come as a surprise, the follow-up op eds were equally predictable: more money for poorer schools, changing the rating scale, etc..

I thought it would be an interesting data set to analyze to see if the conclusion that the N&O came up with was, in fact, the only conclusion you can get out of the dataset.. Since they did simple crosstab analysis, perhaps there was some other analysis that could be done? I know that news paper articles are at a pretty low level reading level and perhaps they are also written at a low analytical level also? I went to the website to download the data here and quickly ran into two items:

1) The dataset is very limited –> there are only 3 creditable variables in the dataset (county, free and reduced percent, and the school score). It is almost as if the dataset was purposely limited to only support the conclusion.

2) The dataset is shown in a way that alternative analysis is very hard. You have to install Tableau if you want to look the data yourself. Parsing Tableau was a pain because even with Fiddler, they don’t render the results as HTML with some tags but as images.

Side Note –> My guess is that Tableau is trying to be the Flash for the analytics space. I find it curious that companies/organizations that think they are just “one tool away” from good analytics. Even the age of Watson, it is never the tooling – it is always the analyst that determines the usefulness of a dataset. It would much better if WCPSS embraced open data and had higher expectations of the people using the datasets.

In any event, with the 14 day trial of Tableau, I could download into Access. I then exported the data into a .txt file (where it should have been in the 1st place). I the pumped it into R Studio like so:

I then created 2 variables from the FreeAndReducedLunch and SchoolScores vectors. When I ran the correlation the 1st time, I got an NA, meaning that there are some mal-formed data.

I re-ran the correlation using only complete data and sure enough, there is a creditable correlation –> higher the percent of free and reduced lunch, the lower the score. The N&O is right.

I then added a filter to only look at Wake County and there is even a stronger correlation in Wake County than the state as a whole:

As I mentioned earlier, the dataset was set up for a pre-decided conclusion by limited the number of independent variables and the choice of using Tableau as the reporting mechanism. I decided to augment the dataset with additional information. My son plays in TYO and I unsuccessful tried to set up an orchestra at our local elementary school 8 years ago. I also thought of this article where some families tried to get more orchestras in Wake County schools. Fortunately, the list of schools with orchestra can be found here and it did not take very long to add an “HasAnStringsProgram” field to the dataset.

Running a correlation for just the WCPSS schools shows that there is no relationship between a school having an orchestra and their performance grade.

So the statement by the parents in the N&O like this

… that music students have higher graduation rates, grades and test scores …

might be true for all music but a specialized strings program does not seem to impact the school’s score –> at least with this data.

Filed under Open Data, R

Apriori Algorithm and F# Using Elevator Inspection Data

March 18, 2014 1 Comment

Now that I have the elevator dataset in a workable state, I wanted to see what I could see with the data. I was reading Machine Learning In Action and the authors suggested that an Apriori Algorithm as a way to quantify associations among data points. I read both Harrington’s code and Wikipedia’s description and I found both the be impenetrable – the former because their code was unreadable and the later because the mathematical formulas depended on a level of algebra that I don’t have.

Fortunately, I found a C# project on Codeproject that had both an excellent example/introduction and C# code. I used the examples on the website to formulate my F# implementation.

The first thing I did was create a class that matched the 1st grid in the example

namespace ChickenSoftware.ElevatorChicken.Analysis
 
open System.Collections.Generic
 
type Transaction = {TID: string; Items: List<string> }
 
type Apriori(database: List<Transaction>, support: float, confidence: float) = 
    member this.Database = database
    member this.Support = support
    member this.Confidence = confidence

Note that because F# is immutable by default, the properties are read-only. I then created a unit test project that makes sure the constructor works without exceptions. The data matches the example:

public AprioriTests()
{
    var database = new List<Transaction>();
    database.Add(new Transaction("100", new List<string>() { "A", "C", "D" }));
    database.Add(new Transaction("200", new List<string>() { "B", "C", "E" }));
    database.Add(new Transaction("300", new List<string>() { "A", "B", "C", "E" }));
    database.Add(new Transaction("400", new List<string>() { "B", "E" }));
 
    _apriori = new Apriori(database, .5, .80);
 
}
 
[TestMethod]
public void ConstructorUsingValidArguments_ReturnsExpected()
{
    Assert.IsNotNull(_apriori);
}

I then need a function to count up all of the items in the Itemsets. I refused to use loops, so I first started using Seq.Fold, but I was having zero luck because I was trying to fold a Seq of List. I then started experimenting with other functions when I found Seq.Collect – which was perfect. So I created a function like this:

member this.GetC1() =
    database
 
member this.GetL1() =
    let numberOfTransactions = this.GetC1().Count
 
    this.GetC1()
        |> Seq.collect(fun d -> d.Items)
        |> Seq.countBy(fun i -> i)
        |> Seq.map(fun (t,i) -> t, i, float i/ float numberOfTransactions)
        |> Seq.filter(fun (t,i,p) -> p >= support)
        |> Seq.map(fun (t,i,p) -> t,i)
        |> Seq.sort
        |> Seq.toList

Note that the numberOfTransactions is for the database, not the individual items in the List<Item>. And the results match the example:

So this is great. My next stop was to build a list of pair combinations of the remaining values

The trick is that is not a Cartesian join of the original sets – it is only the surviving sets that are needed. My first attempt looked like:

let C1 = database
 
let L1 = C1
        |> Seq.map(fun t -> t.Items)
        |> Seq.collect(fun i -> i)
        |> Seq.countBy(fun i -> i)
        |> Seq.map(fun (t,i) -> t, i, float i/ float numberOftransactions)
        |> Seq.filter(fun (t,i,p) -> p >= support)
        |> Seq.toArray
let C2A = L1 
            |> Seq.map(fun (x,y,z) -> x)
            |> Seq.toArray
let C2B = L1 
            |> Seq.map(fun (x,y,z) -> x)
            |> Seq.toArray
let C2 = C2A |> Seq.collect(fun x -> C2B |> Seq.map(fun y -> x+y))
C2   

With the output like this:

I was running out of Saturday morning so I went over to stack overflow and got a couple of responses. I was on the right track with the concat, but I didn’t think about the List.Filter(), which would prune my list. With this in mind, I copied Mark’s code and got what I was looking for

member this.GetC2() =
    let l1Itemset = this.GetL1() 
                    |> Seq.map(fun (i,s) -> i)
 
    let itemset = 
        l1Itemset
            |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
            |> Seq.concat
            |> Seq.filter(fun (x,y) -> x < y)
            |> Seq.sort
            |> Seq.toList         
    
    let listContainsItem(l:List<string>, a,b) =
            l.Contains(a) && l.Contains(b)
    
    let someFunctionINeedToRename(l1:List<string>, l2)=
            l2 |> Seq.map(fun (x,y) -> listContainsItem(l1,x,y))
 
    let itemsetMatches = this.GetC1()
                            |> Seq.map(fun t -> t.Items)
                            |> Seq.map(fun i -> someFunctionINeedToRename(i,itemset))
 
    let itemSupport = itemsetMatches
                            |> Seq.map(Seq.map(fun i -> if i then 1 else 0))
                            |> Seq.reduce(Seq.map2(+))
 
    itemSupport
        |> Seq.zip(itemset)
        |> Seq.toList

So now I have C2 filling correctly:

Taking the results, I needed to get L2.

That was much simpler that getting C2 –> here is the code:

member this.GetL2() = 
    let numberOfTransactions = this.GetC1().Count
    
    this.GetC2()
            |> Seq.map(fun (i,n) -> i,n,float n/float numberOfTransactions)
            |> Seq.filter(fun (i,n,p) -> p >= support)
            |> Seq.map(fun (t,i,p) -> t,i)
            |> Seq.sort
            |> Seq.toList    

And when I run it – it matches this example exactly:

Finally, I added in a C# and L3. This code is identical to the C2/L2 code with one exception: mapping a triple and not a tuple: The C2 code maps like this

let itemset = 
    l1Itemset
        |> Seq.map(fun x -> l1Itemset |> Seq.map(fun y -> (x,y)))
        |> Seq.concat
        |> Seq.filter(fun (x,y) -> x < y)
        |> Seq.sort
        |> Seq.toList     

and the C3 code looks like this (took me 15 minutes to figure out line 3 below):

let itemset = 
    l2Itemset
        |> Seq.map(fun x -> l2Itemset |> Seq.map(fun y-> l2Itemset |> Seq.map(fun z->(fst x,fst y,snd z))))
        |> Seq.concat
        |> Seq.collect(fun d -> d)
        |> Seq.filter(fun (x,y,z) -> x < y && y < z)
        |> Seq.distinct
        |> Seq.sort
        |> Seq.toList    

With the C3 and L3 matching the example also:

I was now ready to put in the elevator data into the analysis. I think I am getting better at F# because I did the mapping, filtering, and transformation of the data from the server without looking at any other material and it look only 15 minutes.

type public ElevatorBuilder() = 
    let connectionString = ConfigurationManager.ConnectionStrings.["localData2"].ConnectionString;
 
    member public this.GetElevatorTransactions() =
        let transactions = this.GetElevators() 
                              |> Seq.map(fun e ->this.ConvertElevatorToTransaction(e))
        let transactionsList = new System.Collections.Generic.List<Transaction>(transactions)
        transactionsList
 
    member public this.ConvertElevatorToTransaction(i: string, t:string, c:string, s:string) =
        let items = new System.Collections.Generic.List<String>()
        items.Add(t)
        items.Add(c)
        items.Add(s)
        let transaction = {TID=i; Items=items}
        transaction
 
    member public this.GetElevators () =
        SqlConnection.GetDataContext(connectionString).ElevatorData201402
            |> Seq.map(fun e -> e.ID, e.EquipType,e.Capacity,e.Speed)
            |> Seq.filter(fun (i,et,c,s) -> not(String.IsNullOrEmpty(et)))
            |> Seq.filter(fun (i,et,c,s) -> c.HasValue)
            |> Seq.filter(fun (i,et,c,s) -> s.HasValue)
            |> Seq.map(fun (i,t,c,s) -> i, this.CatagorizeEquipmentType(t),c,s)
            |> Seq.map(fun (i,t,c,s) -> i,t,this.CatagorizeCapacity(c.Value),s)
            |> Seq.map(fun (i,t,c,s) -> i,t,c,this.CatagorizeSpeed(s.Value))
            |> Seq.map(fun (i,t,c,s) -> i.ToString(),t,c,s)

The longest part was aggregating the free-form text of the Equipment Type field (here is partial snip, you get the idea…)

member public this.CatagorizeEquipmentType(et: string) =
    match et.Trim() with 
        | "OTIS" -> "OTIS"
        | "OTIS (1-2)" -> "OTIS"
        | "OTIS (2-1)" -> "OTIS"
        | "OTIS hydro" -> "OTIS"
        | "OTIS, HYD" -> "OTIS"
        | "OTIS/ ASHEVILLE " -> "OTIS"
        | "OTIS/ MOUNTAIN " -> "OTIS"
        | "OTIS/#1" -> "OTIS"
        | "OTIS/#19 " -> "OTIS"

Assigning categories for speed and capacity was a snap using F#

member public this.CatagorizeCapacity(c: int) =
    let lowerBound = (c/25 * 25) + 1
    let upperBound = lowerBound + 24
    lowerBound.ToString() + "-" + upperBound.ToString()        
 
member public this.CatagorizeSpeed(s: int) =
    let lowerBound = (s/50 * 50) + 1
    let upperBound = lowerBound + 49
    lowerBound.ToString() + "-" + upperBound.ToString()    

With this in hand, I created a Console app that takes the 27K records and pushes them though the apriori algorithm:

private static void RunElevatorAnalysis()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    ElevatorBuilder builder = new ElevatorBuilder();
    var transactions = builder.GetElevatorTransactions();
    stopwatch.Stop();
    Console.WriteLine("Building " + transactions.Count + " transactions took: " + stopwatch.Elapsed.TotalSeconds);
    var apriori = new Apriori(transactions, .1, .75);
    var c2 = apriori.GetC2();
    stopwatch.Reset();
    stopwatch.Start();
    var l1 = apriori.GetL1();
    Console.WriteLine("Getting L1 took: " + stopwatch.Elapsed.TotalSeconds);
    var l2 = apriori.GetL2();
    Console.WriteLine("Getting L2 took: " + stopwatch.Elapsed.TotalSeconds);
    var l3 = apriori.GetL3();
    Console.WriteLine("Getting L3 took: " + stopwatch.Elapsed.TotalSeconds);
    stopwatch.Stop();
    Console.WriteLine("–L1");
    foreach (var t in l1)
    {
        Console.WriteLine(t.Item1 + ":" + t.Item2);
    }
    Console.WriteLine("–L2");
    foreach (var t in l2)
    {
        Console.WriteLine(t.Item1 + ":" + t.Item2);
    }
    Console.WriteLine("–L3");
    foreach (var t in l3)
    {
        Console.WriteLine(t.Item1 + ":" + t.Item2);
    }
}

I then made an offering to the F# Gods and hit F5:

Doh! The gods were not pleased. I then went back to my initial filtering function and added a Seq.Take(25000) and the results:

So there a couple of things to draw from this exercise.

1) Apriori Algorithm is the wrong classification technique for this dataset. I had to bring the support way down (10%) to even get any readings. Also, there is too much dispersion of the values. This kind of algorithm is much better with N number of a smaller set of data values versus a fixed number of large values.

2) Even so, how cool is this? Compare the files just to make the C#/OO work versus with F#

And the Total LOC is 539 for C# versus 120 for F# – and the F# can be optimized using a better way to create search and itemsets. Hard-coding each level was a hack I did to get thing working and give me an understanding of how AA works. I bet this can be consolidated to well under 75 lines without sacrificing readability

3) I think the StackOverflow exception is because I am doing a Cartesian join and then paring the result. Using one of the other techniques suggested on SO will give much better results.

I any event, what a fun project! I can’t wait to optimize this and perhaps throw a different algorithm at the dataset in the coming weeks.

Filed under Analytics, F#, Open Data

Elevator App: Part 1 – Data Layer Using F#

March 11, 2014 2 Comments

At Open Data Day, fellow TRINUGER Elaine Cahill told me about a website where you can get all of the elevator inspection data for the state. It is found here. She went ahead and put the Wake County data onto Socrata. I wanted to look at the entire state so I went to the report page like so:

Unfortunately, when you try and pull down the entire state, you cause a server exception:

So I split the download in half. I then Imported it into Access and then SSISed it into Azure Sql. I then created a project to server the data and I decided to use F# type providers as a replacement for Entity Framework for my ORM. I could either use the SqlEntity TP or the SqlDataConnection TP to access the Sql Database on Azure. Both do not work out of the box.

SqlDataConnection

I could not get SqlDataConnection to work at all. When I hooked it up to a standard connection string in the config file, I got:

So when I copy and paste the connection string into the TP directly, it does make the connection to Azure, but then it comes back with this exception:

Without looking at the source. my guess is that the TP has hard-coded a reference to ‘syscomments’ and alas, Azure does not have that table.

SqlEntity

I then headed over to the SlqEntityTP to see if I could have better luck. Fortunately, the SqlEntity does work with both an Azure connection string in the .config file and can make a connection to an Azure database.

The problem I ran into was when I wanted to expose the SqlConnection the the WebAPI project that I wrote in C#. You can not mark SqlEntityTPs as public:

Note that the SqlDataConnection can be marked as public. <sigh>. I marked the SqlEntityTP as internal and then created a POCO to map between the SqlEntity type and a type that can be consumed by the outside world:

type public Elevator ={
        ID: int
        County: string
        StateId: string
        Type: string
        Operation: string
        Owner: string
        O_Address1: string
        O_Address2: string
        O_City: string
        O_State: string
        O_Zip: string
        User: string
        U_Address1: string
        U_Address2: string
        U_City: string
        U_State: string
        U_Zip: string
        U_Lat: double
        U_Long: double
        Installed: DateTime
        Complied: DateTime
        Capacity: int
        CertStatus: int
        EquipType: string
        Drive: string
        Volts: string
        Speed: int
        FloorTo: string
        FloorFrom: string
        Landing: string
        Entrances: string
        Ropes: string
        RopeSize: string
    }
 
type public DataRepository() = 
    let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;
 
    member public this.GetElevators () =
        SqlConnection.GetDataContext(connectionString).ElevatorData201402
        |> Seq.map(fun x -> this.GetElevatorFromElevatorData(x))
 
    member public this.GetElevator (id: int) =
        SqlConnection.GetDataContext(connectionString).ElevatorData201402
        |> Seq.where(fun x -> x.ID = id)
        |> Seq.map(fun x -> this.GetElevatorFromElevatorData(x))
        |> Seq.head
 
    member internal this.GetElevatorFromElevatorData(elevatorData: SqlConnection.ServiceTypes.ElevatorData201402) =
        let elevator = {ID= elevatorData.ID;
            County=elevatorData.County;
            StateId=elevatorData.StateID;
            Type=elevatorData.Type;
            Operation=elevatorData.Operation;
            Owner=elevatorData.Owner;
            O_Address1=elevatorData.O_Address1;
            O_Address2=elevatorData.O_Address2;
            O_City=elevatorData.O_City;
            O_State=elevatorData.O_St;
            O_Zip=elevatorData.O_Zip;
            User=elevatorData.User;
            U_Address1=elevatorData.U_Address1;
            U_Address2=elevatorData.U_Address2;
            U_City=elevatorData.U_City;
            U_State=elevatorData.U_St;
            U_Zip=elevatorData.U_Zip;
            U_Lat=elevatorData.U_lat;
            U_Long=elevatorData.U_long;
            Installed=elevatorData.Installed.Value;
            Complied=elevatorData.Complied.Value;
            Capacity=elevatorData.Capacity.Value;
            CertStatus=elevatorData.CertStatus.Value;
            EquipType=elevatorData.EquipType;
            Drive=elevatorData.Drive;
            Volts=elevatorData.Volts;
            Speed=int elevatorData.Speed;
            FloorTo=elevatorData.FloorTo;
            FloorFrom=elevatorData.FloorFrom;
            Landing=elevatorData.Landing;
            Entrances=elevatorData.Entrances;
            Ropes=elevatorData.Ropes;
            RopeSize=elevatorData.RopeSize
        }
        elevator

I am not happy about writing any of this code. I have 84 lines of code for a single class. I might have well used the code code gen of EF. I could have taken the performance hit and used System.Reflection to map field of the same names (I have done that on other projects) , but that also feels like a hack. In any event, I then added a reference to my F# project in my C# WebAPI project. I did have to add a reference to FSharp.Core in the C# project (which further vexed me), but then I created a couple of GET methods to expose the data:

public class ElevatorController : ApiController
{
    // GET api/Elevator
    public IEnumerable<Elevator> Get()
    {
        DataRepository repository = new DataRepository();
        return repository.GetElevators();
    }
 
    // GET api/Elevator/5
    public Elevator Get(int id)
    {
        DataRepository repository = new DataRepository();
        return repository.GetElevator(id);
    }
 
}

When I viewed the JSON from a handy browser, it looks like, well, junk:

So now I have to get rid of that random characters (x0040 suffix)– yet a 3rd POCO, this one in C#:

public class ElevatorController : ApiController
{
    // GET api/Elevator
    public IEnumerable<CS.Elevator> Get()
    {
        List<CS.Elevator> elevators = new List<CS.Elevator>();
        FS.DataRepository repository = new FS.DataRepository();
        var fsElevators = repository.GetElevators();
        foreach (var fsElevator in fsElevators)
        {
            elevators.Add(GetElevatorFromFSharpElevator(fsElevator));
        }
        return elevators;
    }
 
    // GET api/Elevator/5
    public CS.Elevator Get(int id)
    {
        FS.DataRepository repository = new FS.DataRepository();
        return GetElevatorFromFSharpElevator(repository.GetElevator(id));
    }
 
    internal CS.Elevator GetElevatorFromFSharpElevator(FS.Elevator fsElevator)
    {
        CS.Elevator elevator = new CS.Elevator();
        elevator.ID = fsElevator.ID;
        elevator.County = fsElevator.County;
        elevator.StateId = fsElevator.StateId;
        elevator.Type = fsElevator.Type;
        elevator.Operation = fsElevator.Operation;
        elevator.Owner = fsElevator.Owner;
        elevator.O_Address1 = fsElevator.O_Address1;
        elevator.O_Address2 = fsElevator.O_Address2;
        elevator.O_City = fsElevator.O_City;
        elevator.O_State = fsElevator.O_State;
        elevator.O_Zip = fsElevator.O_Zip;
        elevator.User = fsElevator.User;
        elevator.U_Address1 = fsElevator.U_Address1;
        elevator.U_Address2 = fsElevator.U_Address2;
        elevator.U_City = fsElevator.U_City;
        elevator.U_State = fsElevator.U_State;
        elevator.U_Zip = fsElevator.U_Zip;
        elevator.Installed = fsElevator.Installed;
        elevator.Complied = fsElevator.Complied;
        elevator.Capacity = fsElevator.Capacity;
        elevator.CertStatus = fsElevator.CertStatus;
        elevator.EquipType = fsElevator.EquipType;
        elevator.Drive = fsElevator.Drive;
        elevator.Volts = fsElevator.Volts;
        elevator.Speed = fsElevator.Speed;
        elevator.FloorTo = fsElevator.FloorTo;
        elevator.FloorFrom = fsElevator.FloorFrom;
        elevator.Landing = fsElevator.Landing;
        elevator.Entrances = fsElevator.Entrances;
        elevator.Ropes = fsElevator.Ropes;
        elevator.RopeSize = fsElevator.RopeSize;
        return elevator;
    }
 
}

So that gives me that I want…

As a side note, I learned the hard way that the only way to force the SqlEntityTP to update based on a schema change in the DB is to change the connection string in the .config file.

Finally, when I published the WebAPI project to Azure, I got an exception.

<Error><Message>An error has occurred.</Message><ExceptionMessage>Could not load file or assembly 'FSharp.Core, Version=4.3.1.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' or one of its dependencies. The system cannot find the file specified.</ExceptionMessage><ExceptionType>System.IO.FileNotFoundException</ExceptionType><StackTrace> at System.Web.Http.ApiController.<InvokeActionWithExceptionFilters>d__1.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Web.Http.Dispatcher.HttpControllerDispatcher.<SendAsync>d__0.MoveNext()</StackTrace

Turns out you need to not only add a reference to the F# project and FSharp.Core, you have to deploy the .dlls to Azure also. Thanks to hocho on SO for that one.

In conclusion, I love the promise of TPs. I want nothing more than to throw away all of the EF code-gen, .tt files, seeding for code-first nonsense, etc… and replace it with a single line TP. I have done this on a local project, but when I did it with an Azure, things were harder than they should be. Since it is easier to throw hand grenades than catch them, I made a list of the things I want to help the open source FSharp.Data project accomplish in the coming months:

1) SqlDatabaseConnection working with Azure Sql Storage

2) MSAccessConnection needed

3) ActiveDirectoryConnection needed

4) Json and WsdlService ability to handle proxies

5) SqlEntityConnection exposing classes publicly

Regardless of what the open-source community does, MSFT will still have to make a better commitment to F# on Azure, IMHO…

Filed under F#, Open Data

Restaurant Classification Via the Yellow Pages API Using F#

February 25, 2014 1 Comment

As part of the restaurant analysis I did for open data day, I built a crude classifier to identify Chinese restaurants. The classifier looked at the name of the establishment and if certain key words were in the name, it was tagged as a Chinese restaurant.

member public x.IsEstablishmentAChineseRestraurant (establishmentName:string) =
    let upperCaseEstablishmentName = establishmentName.ToUpper()
    let numberOfMatchedWords = upperCaseEstablishmentName.Split(' ')
                                |> Seq.map(fun x -> match x with
                                                        | "ASIA" -> 1
                                                        | "ASIAN" -> 1
                                                        | "CHINA" -> 1
                                                        | "CHINESE" -> 1
                                                        | "PANDA" -> 1
                                                        | "PEKING" -> 1
                                                        | "WOK" -> 1
                                                        | _ -> 0)
                                |> Seq.sum
    match numberOfMatchedWords with
        | 0 -> false
        | _ -> true

Although this worked well enough for the analysis, I was interested in seeing if there was a way of using something that is more precise. To that end, I thought of the Yellow Pages – they classify restaurants into categories and assuming that the restaurant is in the yellow pages, it is a better way to determine the restaurant category versus just a name search.

The first thing I did was head over to the Yellow Pages (YP.com) website and sure enough, they have an API and a developers program. I signed up and had an API key within a couple of minutes.

The first thing I did was to try and search for a restaurant in the browser. I picked the first restaurant I came across in the dataset – Jumbo China #5. I created a request uri based on their API like so

http://pubapi.atti.com/search-api/search/devapi/search?term=Jumbo+China+5&searchloc=6108+Falls+Of+Neuse+Rd+27609&format=json&key=XXXXXXXXXX

When I plugged the name into the browser, I got this:

After screwing around with the code for about ten minutes thinking it was my API Key (Invalid Key would lead you to believe that, no?), Mike Thomas came over and told me that the url encoding was messing with my request – specifically the ‘#’ in Jumbo China #5. When I removed the # symbol, I got Json back:

Throwing the Json into Json2CSharp, the results look great:

I then took this URL and tried to load it into a F# type provider, I couldn’t understand why I was getting a red squiggly line of approbation (Json and XML):

so I pulled out Fiddler to see I was getting a 400. Digging into the response value, I found that “User Agent” was a required field.

The problem was then compounded because the FSharp Json type provider does not allow you to enter a User Agent into the constructor. I headed over to Stack Overflow where Thomas Petricek was kind enough to answer the question – basically you have to use the FSharp Http class to make the request (which you can add the user agent to) and then parse the response via the JsonProvider using the “Parse” versus the “Load” method. So spinning up the method like so:

This gave me the results back that I wanted. I then created a couple of methods to clean up any characters that might screw up the url encoding, added some argument validation, and I had a pretty good module to consume the YP.com listings:

namespace ChickenSoftware.RestaurantClassifier
 
open System
open FSharp.Data
open FSharp.Net
 
type ypProvider = JsonProvider< @"YP.txt">
 
type RestaurantCatagoryRepository() = 
   member this.GetCatagories(restaurantName: string, restaurantAddress: string) =
        if(String.IsNullOrEmpty(restaurantName)) then
            failwith("restaurantName cannot be null or empty.")
        if(String.IsNullOrEmpty(restaurantAddress)) then
            failwith("restaurantAddress cannot be null or empty.")
        let cleanedName = this.CleanName(restaurantName)
        let cleanedAddress = this.CleanAddress(restaurantAddress);
        let uri = "http://pubapi.atti.com/search-api/search/devapi/search?term=&quot;+cleanedName+"&searchloc="+cleanedAddress+"&format=json&key=XXXXXX"
        let response = FSharp.Net.Http.Request(uri, headers=["user-agent", "None"])
        let ypResult = ypProvider.Parse(response)
        try
            ypResult.SearchResult.SearchListings.SearchListing.[0].Categories
        with
            | ex -> String.Empty
 
    member this.CleanName(name: string) =
                name.Replace("#","").Replace(" ","+")
 
    member this.CleanAddress(address: string)=
                address.Replace("#","").Replace(" ","+")
    
    member this.IsCatagoryInCatagories(catagories: string, catagory: string) =
        if(String.IsNullOrEmpty(catagories)) then false
        else if (String.IsNullOrEmpty(catagory)) then false
        else catagories.Contains(catagory)
 
    member this.IsRestaurantInCatagory(restaurantName: string, restaurantAddress: string, restaurantCatagory: string) =
        if(String.IsNullOrEmpty(restaurantName)) then
            failwith("restaurantName cannot be null or empty.")
        if(String.IsNullOrEmpty(restaurantAddress)) then
            failwith("restaurantAddress cannot be null or empty.")
        if(String.IsNullOrEmpty(restaurantCatagory)) then
            failwith("restaurantCatagory cannot be null or empty.")
 
        System.Threading.Thread.Sleep(new System.TimeSpan(0,0,1))
        let catagories = this.GetCatagories(restaurantName, restaurantAddress)
        if(String.IsNullOrEmpty(catagories)) then false
        else this.IsCatagoryInCatagories(catagories,restaurantCatagory)
 
    member this.IsRestaurantInCatagoryAsync(restaurantName: string, restaurantAddress: string, restaurantCatagory: string) =
        async {
            if(String.IsNullOrEmpty(restaurantName)) then
                failwith("restaurantName cannot be null or empty.")
            if(String.IsNullOrEmpty(restaurantAddress)) then
                failwith("restaurantAddress cannot be null or empty.")
            if(String.IsNullOrEmpty(restaurantCatagory)) then
                failwith("restaurantCatagory cannot be null or empty.")
 
            let catagories = this.GetCatagories(restaurantName, restaurantAddress)
            if(String.IsNullOrEmpty(catagories)) then return false
            else return this.IsCatagoryInCatagories(catagories,restaurantCatagory)
        }

The associated unit and integration tests that I made in building this module look like this:

[TestClass]
public class CatagoryBuilderTests
{
 
    [TestMethod]
    public void CleanName_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
        String restaurantName = "Jumbo China #5";
 
        String expected = "Jumbo+China+5";
        String actual = repository.CleanName(restaurantName);
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void CleanAddress_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
        String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
 
        String expected = "6108+Falls+Of+Neuse+Rd+27609";
        String actual = repository.CleanAddress(restaurantAddress);
        Assert.AreEqual(expected, actual);
    }
 
 
    [TestMethod]
    public void GetCatagories_ReturnsExpectedValue()
    {
        string restaurantName = "Jumbo China #5";
        String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
 
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
        var result = repository.GetCatagories(restaurantName, restaurantAddress);
        Assert.IsNotNull(result);
    }
 
    [TestMethod]
    public void CatagoryIsContainedInCatagoriesUsingValidTrueData_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        String catagories = "Chinese Restaurants|Restaurants|";
        String catagory = "Chinese";
 
        Boolean expected = true;
        Boolean actual = repository.IsCatagoryInCatagories(catagories, catagory);
 
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void CatagoryIsContainedInCatagoriesUsingValidFalseData_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        String catagories = "Chinese Restaurants|Restaurants|";
        String catagory = "Seafood";
 
        Boolean expected = false;
        Boolean actual = repository.IsCatagoryInCatagories(catagories, catagory);
 
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void IsJumboChinaAChineseRestaurant_ReturnsTrue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        string restaurantName = "Jumbo China #5";
        String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
        String restaurantCatagory = "Chinese";
 
        Boolean expected = true;
        Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
 
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void IsJumboChinaAnItalianRestaurant_ReturnsFalse()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        string restaurantName = "Jumbo China #5";
        String restaurantAddress = "6108 Falls Of Neuse Rd 27609";
        String restaurantCatagory = "Italian";
 
        Boolean expected = false;
        Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
 
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void IsUnknownAnItalianRestaurant_ReturnsFalse()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        string restaurantName = "Some Unknown Restaurant";
        String restaurantAddress = "Some Address";
        String restaurantCatagory = "Italian";
 
        Boolean expected = false;
        Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
 
        Assert.AreEqual(expected, actual);
    }
 
 
 
    [TestMethod]
    public void CatagoryIsContainedInCatagoriesUsingEmptyCatagory_ReturnsExpectedValue()
    {
        RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
        String catagories = "Chinese Restaurants|Restaurants|";
        String catagory = String.Empty;
 
        Boolean expected = false;
        Boolean actual = repository.IsCatagoryInCatagories(catagories, catagory);
 
        Assert.AreEqual(expected, actual);
    }

The hardest test to get run green was the negative test – passing in a restaurant name that is not recognized

[TestMethod]
public void IsUnknownAnItalianRestaurant_ReturnsFalse()
{
    RestaurantCatagoryRepository repository = new RestaurantCatagoryRepository();
 
    string restaurantName = "Some Unknown Restaurant";
    String restaurantAddress = "Some Address";
    String restaurantCatagory = "Italian";
 
    Boolean expected = false;
    Boolean actual = repository.IsRestaurantInCatagory(restaurantName, restaurantAddress, restaurantCatagory);
 
    Assert.AreEqual(expected, actual);
}

To code around the fact that a different set of Json came back and the original code is expecting a specific structure, I finally resorted to a try…catch

try
    ypResult.SearchResult.SearchListings.SearchListing.[0].Categories
with
    | ex -> String.Empty

I feel dirty, but I don’t know how else to get around it. In any event, I then coded up a module that pulled the list of restaurants from Azure and put them through the classifier.

namespace ChickenSoftware.RestaurantClassifier
 
open FSharp.Data
open System.Linq
open System.Configuration
open Microsoft.FSharp.Linq
open Microsoft.FSharp.Data.TypeProviders
 
type internal SqlConnection = SqlEntityConnection<ConnectionStringName="azureData">
 
type public RestaurantBuilder () =
    
    let connectionString = ConfigurationManager.ConnectionStrings.["azureData"].ConnectionString;
    
    member public this.GetRestaurants () =
        SqlConnection.GetDataContext(connectionString).Restaurants
            |> Seq.map(fun x -> x.EstablishmentName, x.EstablishmentAddress + " " + x.EstablishmnetZipCode)
            |> Seq.toArray
            
    member public this.GetChineseRestaurants () = 
        let catagoryRepository = new RestaurantCatagoryRepository()
        let catagory = "Chinese"
        this.GetRestaurants()
                |> Seq.filter(fun (name, address) -> catagoryRepository.IsRestaurantInCatagory(name, address,catagory))
                |> Seq.toList

This code is almost identical to the code I posted 2 weeks ago. Sure enough, When I threw my integration tests at the functions, check out fiddler.

I was getting responses. I ran into the problem on the 50th request though.

To get around this occasional timeout issue, I threw in a second delay between each request, which seemed the solve the problem.

System.Threading.Thread.Sleep(new System.TimeSpan(0,0,1))
let catagories = this.GetCatagories(restaurantName, restaurantAddress)
if(String.IsNullOrEmpty(catagories)) then false
else this.IsCatagoryInCatagories(catagories,restaurantCatagory)

However, this then introduced a new problem. There are 4,000 or so restaurants, so that is over 66 minutes of running. Not good. Next week, I hope to add some parallelism to speed things up…

Filed under F#, Open Data

Jamie Dixon's Home

Combining Wake County Real Estate Lookup with Wake County School Assignment

Parsing Wake County School System Attendance Assignment Site With F#

Parsing Wake County Tax Site With F#

Wake County School Report Cards Using R

Apriori Algorithm and F# Using Elevator Inspection Data

Elevator App: Part 1 – Data Layer Using F#

Restaurant Classification Via the Yellow Pages API Using F#

Categories

Recent Posts

Archives

Blogroll

Meta