Analytics in the Microsoft Stack

Disclaimer:  I really don’t know what I am talking about

I received an email from a coworker/friend yesterday with this in the body:

So, I have a friend who works for a major supermarket chain. In IT, they are straight out of the year 2000. They have tons and tons of data in SQL Server and I think Oracle. The industrial engineers (who do all of the planning) ask the IT group to run queries throughout the day, which takes hours to run. They use Excel for most of their processing. On the weekends, they run reporting queries which take hours and hours to run – all to get just basic information.

This got my wheels spinning about how I would approach the problem with the analytics toolset that I know is available.  The supermarket chain has a couple of problems

  • Lots of data that takes too long to munge through
  • The planners are dependent on IT group for processing the data

I would expect the official Microsoft answer is that they should implement Sql Server Analytics with Power BI.  I would assume if the group threw enough resources at this solution, it would work.  I then thought of a couple of alternative paths:

The first thing that comes to mind is using HDInsight (Microsoft’s Hadoop product)  on Azure.  That way the queries can run in a distributed manner and they can provision machines as they need them -> and when they are not running their queries, they can de-allocate the machines.

The second thought is using AzureML to do their model generation.  However, depending on the size of the datasets, AzureML may not be able to scale.  I have only used Azure ML on smaller datasets.

The third thought was using R?  I don’t think R is the best answer here.  Everything I know about R is that it is designed for data exploration and analysis of datasets that comfortably fit into the local machine’s memory.  Performance on R is horrible and scaling R is a real challenge. 

What about F#?  So this might be a good answer.  If you use the Hive Type Provider, you can get the benefits of HDInsight to do the processing and then have the goodness of the language syntax and REPL for data exploration.  Also, the group could look at MBrace for some kick-butt distributed processing that can scale on Azure. Finally, if they don come up with some kind of insight that lends itself for building analytics or models into an app, you can take the code out of the script file and stick it into a compliable assembly all within Visual Studio. 

What about Python?  No idea, I don’t enough about it

What about Matlab, SAS, etc..  No idea.  I stopped using those tools when R showed up.

What about Watson?  No idea.  I think I will have a better idea once I go to this.

Parsing Wake County School System Attendance Assignment Site With F#

As a follow up to this post, I then turned my attention to parsing the Wake County Public School Assignment Site.  If you are not familiar, large schools districts in America have a concept of ‘nodes’ where a child is assigned to a school pyramid (elementary, middle, high schools) based on their home address.  This gives the school attendance tremendous power because a house’s value is directly tied to how “good” (real or perceived) their assigned school pyramid.  WCPSS has a site here where you can enter in your address and find out the school pyramid.

Since there is not a public Api or even a publically available dataset, I decided to see if I could screen scrape the site.  The first challenge is that you need to navigate through 2 pages to get to your answer.  Here is the Fiddler trace

image

The first mistake you will notice is that they are using php.  The second is that they are using the same uri and they are parameterizing the requests via the form value:

image

Finally, their third mistake is that the pages comes back in an non-consistent way, making the DOM traversal more challenging.

Undaunted, I fired up Visual Studio. Because there are 2 pages that need to be used, I imported both of them as a the model for the HtmlTypeProvider

image

I then pulled out the form query string and placed them into some values.  The code so far:

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 3 open System.Net 4 open FSharp.Data 5 6 type context = HtmlProvider<"../data/HouseSearchSample01.html"> 7 type context' = HtmlProvider<"../data/HouseSearchSample02.html"> 8 9 let uri = "http://wwwgis2.wcpss.net/addressLookup/index.php" 10 let streetLookup = "StreetTemplateValue=STRATH&StreetName=Strathorn+Dr+Cary&StreetNumber=904&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 11 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode=CA+0198.2&StreetName=Strathorn+Dr+Cary&StreetTemplateValue=STRATH&StreetNumber=904&StreetZipCode=27519" 12

Skipping the 1st page, I decided to make a request and see if I could get the school information out of the DOM.  It well enough but you can see the immediate problem –> the page’s structure varies so just tagging the n element of the table will not work

1 let webClient = new WebClient() 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded") 3 let result = webClient.UploadString(uri,"POST",streetLookup') 4 let body = context'.Parse(result).Html.Body() 5 6 let tables = body.Descendants("TABLE") |> Seq.toList 7 let schoolTable = tables.[0] 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList 9 let elementaryDatas = schoolRows.[0].Descendants("TD") |> Seq.toList 10 let elementarySchool = elementaryDatas.[1].InnerText() 11 let middleSchoolDatas = schoolRows.[1].Descendants("TD") |> Seq.toList 12 let middleSchool = middleSchoolDatas.[1].InnerText() 13 //Need to skip for the enrollement cap message 14 let highSchoolDatas = schoolRows.[3].Descendants("TD") |> Seq.toList 15 let highSchool = highSchoolDatas.[1].InnerText() 16

 

image

I decided to take the dog for a walk and that time away from the keyboard was very helpful because I realized that although the table is not consistent, I don’t need it to be for my purposes.  All I need are the schools names for a given address.  What I need to do it remove all of the noise and just find the rows of the table with useful data:

1 let webClient = new WebClient() 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded") 3 let result = webClient.UploadString(uri,"POST",streetLookup') 4 let body = context'.Parse(result).Html.Body() 5 6 let tables = body.Descendants("TABLE") |> Seq.toList 7 let schoolTable = tables.[0] 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList 9 let schoolData = schoolRows |> Seq.collect(fun r -> r.Descendants("TD")) |>Seq.toList 10 let schoolData' = schoolData |> Seq.map(fun d -> d.InnerText().Trim()) 11 let schoolData'' = schoolData' |> Seq.filter(fun s -> s <> System.String.Empty) 12 13 //Strip out noise 14 let removeNonEssentialData (s:string) = 15 let markerPosition = s.IndexOf('(') 16 match markerPosition with 17 | -1 -> s 18 | _ -> s.Substring(0,markerPosition).Trim() 19 20 let schoolData''' = schoolData'' |> Seq.map(fun s -> removeNonEssentialData(s)) 21 22 let unimportantPhrases = [|"Neighborhood Busing";"This school has an enrollment cap"|] 23 let containsUnimportantPhrase (s:string) = 24 unimportantPhrases |> Seq.exists(fun p -> s.Contains(p)) 25 26 let schoolData'''' = schoolData''' |> Seq.filter(fun s -> containsUnimportantPhrase(s) = false ) 27 28 schoolData''''

And Boom goes the dynamite:

image

So working backwards, I need to parse the 1st page to get the CatchmentCode for an address, build the second’s page form data and then parse the results.  Parsing the 1st page for the catachmentCode was very straight forward:

1 let result = webClient.UploadString(uri,"POST",streetLookup) 2 let body = context.Parse(result).Html.Body() 3 let inputs = body.Descendants("INPUT") |> Seq.toList

image

1 let catchmentCode = inputs' |> Seq.filter(fun (n,v) -> n = "CatchmentCode") 2 |> Seq.map(fun (n,v) -> v) 3 |> Seq.head 4 let streetName = inputs' |> Seq.filter(fun (n,v) -> n = "StreetName") 5 |> Seq.map(fun (n,v) -> v) 6 |> Seq.head 7 let streetTemplateValue = inputs' |> Seq.filter(fun (n,v) -> n = "StreetTemplateValue") 8 |> Seq.map(fun (n,v) -> v) 9 |> Seq.head 10 let streetNumber = inputs' |> Seq.filter(fun (n,v) -> n = "StreetNumber") 11 |> Seq.map(fun (n,v) -> v) 12 |> Seq.head 13 let streetZipCode = inputs' |> Seq.filter(fun (n,v) -> n = "StreetZipCode") 14 |> Seq.map(fun (n,v) -> v) 15 |> Seq.head

 

image

So the answer is there, just the code sucks.  I refactored it to a single function and

1 let getValueFromInput(nameToFind:string) = 2 inputs' |> Seq.filter(fun (n,v) -> n = nameToFind) 3 |> Seq.map(fun (n,v) -> v) 4 |> Seq.head 5 let catchmentCode = getValueFromInput("CatchmentCode") 6 let streetName = getValueFromInput("StreetName") 7 let streetTemplateValue = getValueFromInput("StreetTemplateValue") 8 let streetNumber =getValueFromInput("StreetNumber") 9 let streetZipCode = getValueFromInput("StreetZipCode")

With the page 1 out of the way, I was ready to start altering the form query string.  I pulled the values out of the string and set up like this:

1 let streetTemplateValue = "STRAT" 2 let street = "Strathorn" 3 let suffix = "Dr" 4 let city = "Cary" 5 let streetNumber = "904" 6 let streetName = street+"+"+suffix+"+"+city 7 let streetLookup = "StreetTemplateValue="+streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 8

1 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode="+catchmentCode+"&StreetName="+streetName+"&StreetTemplateValue="+streetTemplateValue+"&StreetNumber="+streetNumber+"&StreetZipCode="+streetZipCode 2

So now it was just a matter of creating some data structures to pass into the 1st query string

1 type SearchCriteria = {streetTemplateValue:string;street:string;suffix:string;city:string;streetNumber:string;} 2 3 let searchCriteria = {streetTemplateValue="STRAT";street="Strathorn";suffix="Dr";city="Cary";streetNumber="904"} 4 //Page1 Query String 5 let streetName = searchCriteria.street+"+"+searchCriteria.suffix+"+"+searchCriteria.city 6 let streetLookup = "StreetTemplateValue="+searchCriteria.streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+searchCriteria.streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 7

and we now have the basis for a series of functions to do the school lookup.  You can see the gist here.

Parsing Wake County Tax Site With F#

Based on the response of my last post on Wake County School scores, I decided to look at each school’s revenue base.   Instead of looking at free and reduced lunch as a correlating factor for school scores, I wanted to look at the aggregate home valuations of each school’s population.

To do that, I thought of Wake County Tax Department’s web site found here, which you can look up an address and see the tax value of the property.  Although they don’t have an api, their web site’s search result page has a predictable uri like this: http://services.wakegov.com/realestate/Account.asp?id=0000001 so by placing in a 7-character integer, I could theoretically look at all of the tax records for the county.  Also, the HTML of the result page is standardized so parsing it should be fairly straightforward.

So I fired up Visual Studio and opened up the F# REPL. The first thing I did was to bring in the Html type provider and wire up a standard page for the type.

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 open FSharp.Data 3 type context = HtmlProvider<"../data/RealEstateSample.html"> 4

I then could bring down all of the DOM elements for the page: and find all of the <Table> elements

1 let uri = "http://services.wakegov.com/realestate/Account.asp?id=0000001" 2 let body = context.Load(uri).Html.Body() 3 let tables = body.Descendants("TABLE") |> Seq.toList 4 tables |> Seq.length 5

image

So there are 14 tables on the page.  After some manual inspection, the table that holds the address information is table number 7:

1 let addressTable = tables.[7] 2

image

My first thought was to parse the text to see if there are key words that I can search on

1 let baseText = taxTable.ToString() 2 let marker = baseText.IndexOf("Total Value Assessed") 3 let remainingText = baseText.Substring(marker) 4 let marker' = remainingText.IndexOf("$") 5 let remainingText' = remainingText.Substring(marker') 6 let marker'' = remainingText'.IndexOf("<") 7 let finalText = remainingText'.Substring(0,marker'')

I then thought, “Jamie you are being stupid”.  Since the DOM is structured consistently,  I can just use the type provider and search on tags:

1 let addressTable = tables.[7] 2 let fonts = addressTable.Descendants("font") |> Seq.toList 3 let addressOne = fonts.[1].InnerText() 4 let addressTwo = fonts.[2].InnerText() 5 let addressThree = fonts.[3].InnerText() 6

and sure enough

image

And then going to table number 11, I can get the assessed value:

1 let taxTable = tables.[11] 2 let fonts' = taxTable.Descendants("font") |> Seq.toList 3 let assessedValue = fonts'.[3].InnerText() 4

and how cool is this?

image

So with the data elements in place, I need a way of saving the data.  Fortunately, the Json type provider is also in FSharp.Data so I could do this:

1 let valuation = JsonValue.Record [| 2 "addressOne", JsonValue.String addressOne 3 "addressTwo", JsonValue.String addressTwo 4 "addressThree", JsonValue.String addressThree 5 "assessedValue", JsonValue.String assessedValue |] 6 open System.IO 7 File.AppendAllText(@"C:\Data\dataTest.json",valuation.ToString()) 8

And in the file:

image

So now I have the pieces to make requests to the Wake County site and put the values into a json file.  I decided to push the data to the file after each request so if there is a reentrant fault, I would not lose everything:  So here is the gist and here is the results:

image

I then decided to see how long it will take to download the 1st 1,000 Ints.

1 #time 2 [1..100] |> Seq.iter(fun id -> doValuation id)

and with fiddler running

image

It took about 5 minutes for 1,000 ints

image

so extrapolating the max possible (9,999,999), it would take 83 hours.

image

Two thoughts come to mind for the next step

1) Use MBrace with some VMs on Azure to do the requests in parallel

2) Do a binary search to see the actual upper number for Wake County.

Tune in next week so see if that works.

Wake County School Report Cards Using R

Recently Wake County School Systems released a “school report card” that can be used to compare how well a school is doing relative to the other schools in the state. As expected, it made front-page news in our local newspaper here.  The key theme was that schools that have kids from poorer families have worse results than schools from more affluent families.  Though this shouldn’t come as a surprise, the follow-up op eds were equally predictable: more money for poorer schools, changing the rating scale, etc..

I thought it would be an interesting data set to analyze to see if the conclusion that the N&O came up with was, in fact, the only conclusion you can get out of the dataset..  Since they did simple crosstab analysis, perhaps there was some other analysis that could be done?  I know that news paper articles are at a pretty low level reading level and perhaps they are also written at a low analytical level also?  I went to the website to download the data here and quickly ran into two items:

1) The dataset is very limited –> there are only 3 creditable variables in the dataset (county, free and reduced percent, and the school score).  It is almost as if the dataset was purposely limited to only support the conclusion.

2) The dataset is shown in a way that alternative analysis is very hard.  You have to install Tableau if you want to look the data yourself.  Parsing Tableau was a pain because even with Fiddler, they don’t render the results as HTML with some tags but as images.

Side Note –> My guess is that Tableau is trying to be the Flash for the analytics space.  I find it curious that companies/organizations that think they are just “one tool away” from good analytics.   Even the age of Watson,  it is never the tooling – it is always the analyst that determines the usefulness of a dataset.  It would much better if WCPSS embraced open data and had higher expectations of the people using the datasets.

In any event, with the 14 day trial of Tableau, I could download into Access.  I then exported the data into a .txt file (where it should have been in the 1st place).  I the pumped it into R Studio like so:

image

I then created 2 variables from the FreeAndReducedLunch and SchoolScores vectors.  When I ran the correlation the 1st time, I got an NA, meaning that there are some mal-formed data. 

image

I re-ran the correlation using only complete data and sure enough, there is a creditable correlation –> higher the percent of free and reduced lunch, the lower the score.  The N&O is right. 

image

I then added a filter to only look at Wake County and there is even a stronger correlation in Wake County than the state as a whole:

image

As I mentioned earlier, the dataset was set up for a pre-decided conclusion by limited the number of independent variables and the choice of using Tableau as the reporting mechanism.  I decided to augment the dataset with additional information.  My son plays in TYO and I unsuccessful tried to set up an orchestra at our local elementary school 8 years ago.  I also thought of this article where  some families tried to get more orchestras in Wake County schools.  Fortunately, the list of schools with orchestra can be found here and it did not take very long to add an “HasAnStringsProgram” field to the dataset.

image

Running a correlation for just the WCPSS schools shows that there is no relationship  between a school having an orchestra and their performance grade. 

image

So the statement by the parents in the N&O like this

… that music students have higher graduation rates, grades and test scores …

might be true for all music but a specialized strings program does not seem to impact the school’s score –> at least with this data.

Record Types and Serialization -> F# to C#

I was working on an exercise where I have a F# component being consumed by an MVC web application written in C#.   As I already asked about here, the CLIMutable is a great feature to the F# language spec.  I created a basic record type with the CLIMutable attribute like so:

1 [<CLIMutable>] 2 type Account = {Number:int; Holder:string; Amount:float}

I then create an instance of that type in the controller written in C#:

1 public class AccountController : ApiController 2 { 3 [HttpGet] 4 public Account Index() 5 { 6 var account = new Account(); 7 account.Amount = 100; 8 account.Holder = "Homer"; 9 account.Number = 1; 10 return account; 11 } 12 }

When I write the site and call the method via Fiddler, I get a nasty “@” symbol added to the end of the name:

image

which is no good.  I then thought of my question on SO and Mark Seeman’s blog post on this topic found here.  I added this to the WebApiConfig class

1 public static void Register(HttpConfiguration config) 2 { 3 // Web API configuration and services 4 GlobalConfiguration.Configuration.Formatters.JsonFormatter.SerializerSettings.ContractResolver = 5 new Newtonsoft.Json.Serialization.CamelCasePropertyNamesContractResolver(); 6 7 // Web API routes 8 config.MapHttpAttributeRoutes(); 9 10 config.Routes.MapHttpRoute( 11 name: "DefaultApi", 12 routeTemplate: "api/{controller}/{id}", 13 defaults: new { id = RouteParameter.Optional } 14 ); 15 }

And the “@” symbol goes away

image

The next task was to add in option types.  I updated the class like so:

1 [<CLIMutable>] 2 type Account = {Number:int;Holder:string option;Amount:float}

I then added a F# class to do the assignment of the value

1 type AccountRepository() = 2 member this.GetAccount() = 3 {Number=1;Holder=Some "Homer"; Amount=100.}

And then updated the controller:

1 [HttpGet] 2 public Account Index() 3 { 4 AccountRepository repository = new AccountRepository(); 5 return repository.GetAccount(); 6 }

And I get this:

image

Which is not what my UI friends want.  They want the same value for the json –> they want to ignore the existence of the option type.  Fair enough. I hit up stack overflow here and Sven had a great answer pointing me to Isaac Abraham’s serializer found here.  I popped that puppy into the project and updated the WebApiConfig like so:

1 var formatter = GlobalConfiguration.Configuration.Formatters.JsonFormatter; 2 formatter.SerializerSettings.ContractResolver = new DefaultContractResolver(); 3 formatter.SerializerSettings.Converters.Add(new IdiomaticDuConverter());

And boom goes the dynamite.

clip_image002