Creating a crosswalk table between WCPSS School assignment results and school report card school list
March 31, 2015 1 Comment
As part of my Wake County School Score analysis, I needed to build a cross-walk table between the Wake County School Site parsing that I did here and the school score result set. The screen scraping put schools in this kind of format:
as an added wrinkle, there is some extra data for some of the schools:
while the score result set is in this format:
So I want to create a cross-walk table with this format:
Step one of this process is to get all of the distinct values from the school site data. If the data was in a sql server database, getting that would as simple as
“Select distinct shoolName from reallyBigDenormalizedTable”
But the data is not in Sql Server, it is in a no-sql database and the json is structured where the names are in an array inside the data structure. After messing around with the query syntax to traverse the nested array, I gave up and decided to sample the database.
Step one was to get a record out via the index
1 let getSchools (index:int) = 2 try 3 let endpointUrl = "https://chickensoftware.documents.azure.com:443/" 4 let client = new DocumentClient(new Uri(endpointUrl), authKey) 5 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 6 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 7 let documentLink = collection.SelfLink 8 let queryString = "SELECT * FROM houseassignment WHERE houseassignment.houseIndex = " + index.ToString() 9 let query = client.CreateDocumentQuery(documentLink,queryString) 10 match Seq.length query with 11 | 0 -> None 12 | _ -> 13 let firstValue = query |> Seq.head 14 let assignment = HouseAssignment.Parse(firstValue.ToString()) 15 Some assignment.Schools 16 with 17 | :? HttpRequestException as ex -> 18 None 19
The next step was to create an array of index numbers that have random values in them. I found this really good extension method to System.Random to populate the array. The next question was “how big does the sample size have to be to get most/all of the schools?” I started seeding the array with different values and ran these functions:
1 let random = new System.Random(42) 2 let indexes = random.GetValues(1,350000) |> Seq.take(10000) |> Seq.toArray 3 let allSchools = indexes |> Seq.map(fun i -> getSchools(i)) |> Seq.toArray 4 let getNumberOfSchools (trial:int) = 5 let trialSchools = allSchools.[1..trial] 6 let allSchools' = trialSchools |> Seq.filter(fun s -> s.IsSome) 7 let allSchools'' = allSchools' |> Seq.collect(fun s -> s.Value) 8 let uniqueSchools = allSchools'' |> Seq.distinct 9 uniqueSchools |> Seq.length 10 11 let trialCount = [|1..9999|] 12 13 trialCount |> Seq.map(fun t -> t, getNumberOfSchools(t)) 14 |> Seq.iter(fun (t, c) -> printfn "%A %A" t c) 15
The sample above shows 10,000 records, which is pretty good. If you graph it, you can see that you get the max values around 2,500.
Unfortunately, there were 11 schools on the report card that were not in the 10,000 set. Confronted with this reality, I did what any reasonable research would do… I dropped them. My guess is that these schools are not part of a base school pyramid, rather they are “application schools” like STEM or leadership academies.
In any event, with the list of schools down, I copied them into Excel and sorted them alphabetically. I then put the school score list next to them and started matching. Within 15 minutes, I had a creditable crosswalk table.
You can see the gist here
Pingback: F# Weekly #13, 2015 | Sergey Tihon's Blog