Creating a crosswalk table between WCPSS School assignment results and school report card school list

As part of my Wake County School Score analysis, I needed to build a cross-walk table between the Wake County School Site parsing that I did here and the school score result set.  The screen scraping put schools in this kind of format:

image

as an added wrinkle, there is some extra data for some of the schools:

 

while the score result set is in this format:

image

So I want to create a cross-walk table with this format:

image

Step one of this process is to get all of the distinct values from the school site data.  If the data was in a sql server database, getting that would as simple as

“Select distinct shoolName from reallyBigDenormalizedTable”

But the data is not in Sql Server, it is in a no-sql database and the json is structured where the names are in an array inside the data structure.  After messing around with the query syntax to traverse the nested array, I gave up and decided to sample the database.

Step one was to get a record out via the index

1 let getSchools (index:int) = 2 try 3 let endpointUrl = "https://chickensoftware.documents.azure.com:443/" 4 let client = new DocumentClient(new Uri(endpointUrl), authKey) 5 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 6 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 7 let documentLink = collection.SelfLink 8 let queryString = "SELECT * FROM houseassignment WHERE houseassignment.houseIndex = " + index.ToString() 9 let query = client.CreateDocumentQuery(documentLink,queryString) 10 match Seq.length query with 11 | 0 -> None 12 | _ -> 13 let firstValue = query |> Seq.head 14 let assignment = HouseAssignment.Parse(firstValue.ToString()) 15 Some assignment.Schools 16 with 17 | :? HttpRequestException as ex -> 18 None 19

The next step was to create an array of index numbers that have random values in them.  I found this really good extension method to System.Random to populate the array.  The next question was “how big does the sample size have to be to get most/all of the schools?”  I started seeding the array with different values and ran these functions:

1 let random = new System.Random(42) 2 let indexes = random.GetValues(1,350000) |> Seq.take(10000) |> Seq.toArray 3 let allSchools = indexes |> Seq.map(fun i -> getSchools(i)) |> Seq.toArray 4 let getNumberOfSchools (trial:int) = 5 let trialSchools = allSchools.[1..trial] 6 let allSchools' = trialSchools |> Seq.filter(fun s -> s.IsSome) 7 let allSchools'' = allSchools' |> Seq.collect(fun s -> s.Value) 8 let uniqueSchools = allSchools'' |> Seq.distinct 9 uniqueSchools |> Seq.length 10 11 let trialCount = [|1..9999|] 12 13 trialCount |> Seq.map(fun t -> t, getNumberOfSchools(t)) 14 |> Seq.iter(fun (t, c) -> printfn "%A %A" t c) 15

The sample above shows 10,000 records, which is pretty good.  If you graph it, you can see that you get the max values around 2,500.

image

Unfortunately, there were 11 schools on the report card that were not in the 10,000 set.  Confronted with this reality, I did what any reasonable research would do… I dropped them.  My guess is that these schools are not part of a base school pyramid, rather they are “application schools” like STEM or leadership academies.

In any event, with the list of schools down, I copied them into Excel and sorted them alphabetically.  I then put the school score list next to them and started matching.  Within 15 minutes, I had a creditable crosswalk table.

image

You can see the gist here

Aggregation of WCPSS Tax Records with School Assignment

So the next part of my WCPSS hit parade, I need a way of combing the screen scrape that I did from the Wake County Tax Records as described here and the screen scrape of the Wake County Public School Assignments as found here.  Getting data from the DocumentDb is straight foreword as long as you don’t ask too much from the query syntax.

I created two functions that pull the tax record and the school assignment via the index number:

1 let getAssignment (id:int) = 2 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 3 let documentLink = collection.SelfLink 4 let queryString = "SELECT * FROM houseassignment WHERE houseassignment.houseIndex = " + id.ToString() 5 let query = client.CreateDocumentQuery(documentLink,queryString) 6 match query |> Seq.length with 7 | 0 -> None 8 | _ -> 9 let assignmentValue = query |> Seq.head 10 let assignment = HouseAssignment.Parse(assignmentValue.ToString()) 11 Some assignment 12 13 let getValuation (id:int) = 14 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault() 15 let documentLink = collection.SelfLink 16 let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = 1" 17 let query = client.CreateDocumentQuery(documentLink,queryString) 18 match query |> Seq.length with 19 | 0 -> None 20 | _ -> 21 let valuationValue = query |> Seq.head 22 let valuation = HouseValuation.Parse(valuationValue.ToString()) 23 Some valuation

Note option types are being used because there any many index values where there is not a corresponding record.  Also, there might a situation where the assignment has a record but the valuation does not and vice-versa so I created a function to only put the records together where there both records:

1 let assignSchoolTaxBase (id:int) = 2 let assignment = getAssignment(id) 3 let valuation = getValuation(id) 4 match assignment.IsSome,valuation.IsSome with 5 | true, true -> assignment.Value.Schools 6 |> Seq.map(fun s -> s, valuation.Value.AssessedValue) 7 |> Some 8 | _ -> None

And running this on the first record, we are getting expected. 

image

Also, running it on an index where there there is not a record, we are also getting expected

image

With the matching working, we need a way of bring all of the school arrays together and then aggregating the tax valuation.  I decided to take a step by step approach to this, even though there might be a more terse way to write it. 

1 #time 2 indexes |> Seq.map(fun i -> assignSchoolTaxBase(i)) 3 |> Seq.filter(fun s -> s.IsSome) 4 |> Seq.collect(fun s -> s.Value) 5 |> Seq.groupBy(fun (s,av) -> s) 6 |> Seq.map(fun (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 7 |> Seq.toArray

When I run it on the 1st 10 records, the values come back as expected

image

So the last step is to run it on all 350,000 indexes (let indexes = [|1..350000|]).  The problem is that after a long period of time, things were not returning.  So this is where the power of Azure comes in –> there is no problem so large I can’t thow more cores at it.  I went to management portal and increased the VM to 8 cores

Capture

I then went into the code base and added pseq for the database calls (which I assume was taking the longest time):

1 #time 2 let indexes = [|1..350000|] 3 let assignedValues = indexes |> PSeq.map(fun i -> assignSchoolTaxBase(i)) |> Seq.toArray 4 5 let filePath = @"C:\Git\WakeCountySchoolScores\SchoolValuation.csv" 6 7 assignedValues 8 |> Seq.filter(fun s -> s.IsSome) 9 |> Seq.collect(fun s -> s.Value) 10 |> Seq.groupBy(fun (s,av) -> s) 11 |> Seq.map(fun (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 12 |> Seq.map(fun (s,v) -> s + "," + v.ToString() + Environment.NewLine) 13 |> Seq.iter(fun (s) -> File.AppendAllText(filePath, s))

and after 2 hours:

image

Using DocumentDB With F#

DocumentDB is Microsoft’s non-sql offering on Azure.  I have limited experience with non-sql databases in general so I thought it would be a good way to try out no-sql on a real project using F#.  The first thing I noticed is that you can’t get to DocumentDB from the “old” azure portal –> you have to spin it up in the new one:

 Capture

Once I created my DocumentDB instance, I went to the getting started guide and found the code samples to accomplish the basic tasks you would expect to see in  any database product.  The getting started guide does not make it an explicit step, but you need to spin up a new FSharp project in Visual Studio and then use NuGet to get the latest SDK.

Once the NuGet package is installed, I went to a script to add the references:

1 #r "../packages/Microsoft.Azure.Documents.Client.0.9.1-preview/lib/net40/Microsoft.Azure.Documents.Client.dll" 2 #r "../packages/Newtonsoft.Json.4.5.11/lib/net40/Newtonsoft.Json.dll" 3 4 open System 5 open Microsoft.Azure.Documents 6 open Microsoft.Azure.Documents.Client 7 open Microsoft.Azure.Documents.Linq 8

And I was good to go.  The 1st thing the walk through does is to create a database:

 

1 let client = new DocumentClient(new Uri(endpointUrl), authKey) 2 let database = new Database() 3 database.Id <- "FamilyRegistry" 4 let requestOptions = new RequestOptions() 5 let response = client.CreateDatabaseAsync(database,requestOptions).Result 6

image

Interestingly, that new database does not show up in the Azure portal until you do a post back

image  image

which really surprised me –> I figured the new portal would use SignalR.  In any event, with the database created, I went to create a collection, which seems roughly analogous to a table in a RDBMS world:

 

1 let documentCollection = new DocumentCollection() 2 documentCollection.Id <- "FamilyCollection" 3 client.CreateDocumentCollectionAsync(database.CollectionsLink,documentCollection,requestOptions) 4

Unfortunately, I got a oh-so-helpful null ref

System.NullReferenceException: Object reference not set to an instance of an object.

   at Microsoft.Azure.Documents.Database.get_CollectionsLink()

   at <StartupCode$FSI_0007>.$FSI_0007.main@() in C:\Users\Dixon\Desktop\ChickenSoftware.DocumentDb.Solution\

ChickenSoftware.DocumentDb\Script.fsx:line 23

Stopped due to error

So, the CollectionsLink has to be populated, which begs the question “what the hell is a collections link?”  My first thought was to assign it a value

image

But no dice.  I then starting dotting the class and I found that there is not a response.CollectionsLink but there is a response.Resource.CollectionsLink

And sure enough, this did it.  I deleted the database on the azure portal and re-ran the create database, this time capturing the collectionsLink and now I could create a collection

1 let documentCollection = new DocumentCollection() 2 documentCollection.Id <- "FamilyCollection" 3 client.CreateDocumentCollectionAsync(response.Resource.CollectionsLink,documentCollection) 4

image

So now it is time to insert some data.  I went back to the walk-through, created some data structures, and attempted to insert them into the database:

1 type Parent = {firstName:string} 2 type Pet = {givenName:string} 3 type Child = {firstName:string; gender:string; grade: int; pets:Pet list} 4 type Address = {state:string; county:string; city:string} 5 type family = {id:string; lastName:string; parents: Parent list; children: Child list; address: Address; isRegistered:bool} 6 7 let andersenFamily = {id="AndersenFamily"; lastName="Andersen"; 8 parents=[{firstName="Thomas"};{firstName="Mary Kay"}]; 9 children=[{firstName="Henriette Thaulow";gender="female"; 10 grade=5;pets=[{givenName="Fluffy"}]}]; 11 address={state = "WA"; county = "King"; city = "Seattle"}; 12 isRegistered = true} 13 14 client.CreateDocumentAsync(documentCollection'.Resource.DocumentsLink, andersenFamily) 15

And it worked fine.  Note I still needed the documentsLink

image

And finally pulling the data out required both some sql and the documents link:

1 let queryString = "SELECT * FROM Families f WHERE f.id = \"AndersenFamily\"" 2 3 let families = client.CreateDocumentQuery(documentCollection'.Resource.DocumentsLink,queryString) 4 families |> Seq.iter(fun f -> printfn "read %A from SQL" f) 5

Gives us what we want

image

And if I only want 1 part of the results I thought to use seq.Map and case the results

1 let families = client.CreateDocumentQuery(documentCollection'.Resource.DocumentsLink,queryString) 2 families |> Seq.map(fun f -> f :?> family) 3 |> Seq.iter(fun f -> printfn "read %A from SQL" f.lastName) 4

But I am getting an exception, so I need to think about this more

System.InvalidCastException: Unable to cast object of type ‘Microsoft.Azure.Documents.QueryResult’ to type ‘family’.

   at Microsoft.FSharp.Core.LanguagePrimitives.IntrinsicFunctions.UnboxGeneric[T](Object source)

   at Microsoft.FSharp.Collections.IEnumerator.map@107.DoMoveNext(b& )

   at Microsoft.FSharp.Collections.IEnumerator.MapEnumerator`1.System-Collections-IEnumerator-MoveNext()

   at Microsoft.FSharp.Collections.SeqModule.Iterate[T](FSharpFunc`2 action, IEnumerable`1 source)

   at <StartupCode$FSI_0010>.$FSI_0010.main@()

Stopped due to error

 

In any event, one thing profoundly vexed me: “if I don’t have a document link to an existing database, how do I get documents out of the database?”  I started Googling around a bit and found this helpful post on Stack Overflow.

It is makes some sense then to use queries to traverse data base and collections by using queries –> esp because they are using linq.  I fired up a new script, put the stack overflow code in,

1 let client = new DocumentClient(new Uri(endpointUrl), authKey) 2 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "FamilyRegistry" ).ToArray().FirstOrDefault() 3 printfn "%s" database.SelfLink

and wammo blamo:

image

I then went back to stack overflow to see if there was a more idiomatic way to interact with the documents and  Panagiotis Kanavos was kind of enough to answer my question here.  Of the different possibilities offered, I settled on this style:

1 let database = client.CreateDatabaseQuery() |> Seq.filter(fun db -> db.Id = "FamilyRegistry") 2 |> Seq.head 3 printfn "%s" database.SelfLink

And it works like a champ.

You can find the gist here