Aggregation of WCPSS Tax Records with School Assignment
March 17, 2015 Leave a comment
So the next part of my WCPSS hit parade, I need a way of combing the screen scrape that I did from the Wake County Tax Records as described here and the screen scrape of the Wake County Public School Assignments as found here. Getting data from the DocumentDb is straight foreword as long as you don’t ask too much from the query syntax.
I created two functions that pull the tax record and the school assignment via the index number:
1 let getAssignment (id:int) = 2 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 3 let documentLink = collection.SelfLink 4 let queryString = "SELECT * FROM houseassignment WHERE houseassignment.houseIndex = " + id.ToString() 5 let query = client.CreateDocumentQuery(documentLink,queryString) 6 match query |> Seq.length with 7 | 0 -> None 8 | _ -> 9 let assignmentValue = query |> Seq.head 10 let assignment = HouseAssignment.Parse(assignmentValue.ToString()) 11 Some assignment 12 13 let getValuation (id:int) = 14 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault() 15 let documentLink = collection.SelfLink 16 let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = 1" 17 let query = client.CreateDocumentQuery(documentLink,queryString) 18 match query |> Seq.length with 19 | 0 -> None 20 | _ -> 21 let valuationValue = query |> Seq.head 22 let valuation = HouseValuation.Parse(valuationValue.ToString()) 23 Some valuation
Note option types are being used because there any many index values where there is not a corresponding record. Also, there might a situation where the assignment has a record but the valuation does not and vice-versa so I created a function to only put the records together where there both records:
1 let assignSchoolTaxBase (id:int) = 2 let assignment = getAssignment(id) 3 let valuation = getValuation(id) 4 match assignment.IsSome,valuation.IsSome with 5 | true, true -> assignment.Value.Schools 6 |> Seq.map(fun s -> s, valuation.Value.AssessedValue) 7 |> Some 8 | _ -> None
And running this on the first record, we are getting expected.
Also, running it on an index where there there is not a record, we are also getting expected
With the matching working, we need a way of bring all of the school arrays together and then aggregating the tax valuation. I decided to take a step by step approach to this, even though there might be a more terse way to write it.
1 #time 2 indexes |> Seq.map(fun i -> assignSchoolTaxBase(i)) 3 |> Seq.filter(fun s -> s.IsSome) 4 |> Seq.collect(fun s -> s.Value) 5 |> Seq.groupBy(fun (s,av) -> s) 6 |> Seq.map(fun (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 7 |> Seq.toArray
When I run it on the 1st 10 records, the values come back as expected
So the last step is to run it on all 350,000 indexes (let indexes = [|1..350000|]). The problem is that after a long period of time, things were not returning. So this is where the power of Azure comes in –> there is no problem so large I can’t thow more cores at it. I went to management portal and increased the VM to 8 cores
I then went into the code base and added pseq for the database calls (which I assume was taking the longest time):
1 #time 2 let indexes = [|1..350000|] 3 let assignedValues = indexes |> PSeq.map(fun i -> assignSchoolTaxBase(i)) |> Seq.toArray 4 5 let filePath = @"C:\Git\WakeCountySchoolScores\SchoolValuation.csv" 6 7 assignedValues 8 |> Seq.filter(fun s -> s.IsSome) 9 |> Seq.collect(fun s -> s.Value) 10 |> Seq.groupBy(fun (s,av) -> s) 11 |> Seq.map(fun (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 12 |> Seq.map(fun (s,v) -> s + "," + v.ToString() + Environment.NewLine) 13 |> Seq.iter(fun (s) -> File.AppendAllText(filePath, s))
and after 2 hours: