Wake County Voter Analysis Using FSharp, AzureML, and R
August 25, 2015 1 Comment
One of the real strengths of FSharp its ability to plow through and transform data in a very intuitive way, I was recently looking at Wake Country Voter Data found here to do some basic voter analysis. My first thought was to download the data into R Studio. Easy? Not really. The data is available as a ginormous Excel spreadsheet of database of about 154 MB in size. I wanted to slim the dataset down and make it a .csv for easy import into R but using Excel to export the data as a .csv kept screwing up the formatting and importing it directly into R Studio from Excel resulting in out of memory crashes. Also, the results of the different election dates were not consistent –> sometimes null, sometimes not. I managed to get the data into R Studio without a crash and wrote a function of either voted “1” or not “0” for each election
1 #V = voted in-person on Election Day 2 #A = voted absentee by mail or early voting (through May 2006) 3 #M = voted absentee by mail (November 2006 - present) 4 5 #O = voted One-Stop early voting (November 2006 - present) 6 #T = voted at a transfer precinct on Election Day 7 #P = voted a provisional ballot 8 #L = Legacy data (prior to 2006) 9 #D = Did not show 10 11 votedIndicated <- function(votedCode) { 12 switch(votedCode, 13 "V" = 1, 14 "A" = 1, 15 "M" = 1, 16 "O" = 1, 17 "T" = 1, 18 "P" = 1, 19 "L" = 1, 20 "D" = 0) 21 } 22
However, every time I tried to run it, the IDE would crash with an out of memory issue.
Stepping back, I decided to transform the data in Visual Studio using FSharp. I created a sample from the ginormous excel spreadsheet and then imported the data using a type provider. No memory crashes!
1 #r "../packages/ExcelProvider.0.1.2/lib/net40/ExcelProvider.dll" 2 open FSharp.ExcelProvider 3 4 [<Literal>] 5 let samplePath = "../../Data/vrdb-Sample.xlsx" 6 7 open System.IO 8 let baseDirectory = __SOURCE_DIRECTORY__ 9 let baseDirectory' = Directory.GetParent(baseDirectory) 10 let baseDirectory'' = Directory.GetParent(baseDirectory'.FullName) 11 let inputFilePath = @"Data\vrdb.xlsx" 12 let fullInputPath = Path.Combine(baseDirectory''.FullName, inputFilePath) 13 14 type WakeCountyVoterContext = ExcelFile<samplePath> 15 let context = new WakeCountyVoterContext(fullInputPath) 16 let row = context.Data |> Seq.head
I then applied a similar function for voted or not and then exported the data as a .csv
1 let voted (voteCode:obj) = 2 match voteCode = null with 3 | true -> "0" 4 | false -> "1" 5 6 open System 7 let header = "Id,Race,Party,Gender,Age,20080506,20080624,20081104,20091006,20091103,20100504,20100622,20101102,20111011,20111108,20120508,20120717,20121106,20130312,20131008,20131105,20140506,20140715,20141104" 8 9 let createOutputRow (row:WakeCountyVoterContext.Row) = 10 String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}", 11 row.voter_reg_num, 12 row.race_lbl, 13 row.party_lbl, 14 row.gender_lbl, 15 row.eoy_age, 16 voted(row.``05/06/2008``), 17 voted(row.``06/24/2008``), 18 voted(row.``11/04/2008``), 19 voted(row.``10/06/2009``), 20 voted(row.``11/03/2009``), 21 voted(row.``05/04/2010``), 22 voted(row.``06/22/2010``), 23 voted(row.``11/02/2010``), 24 voted(row.``10/11/2011``), 25 voted(row.``11/08/2011``), 26 voted(row.``05/08/2012``), 27 voted(row.``07/17/2012``), 28 voted(row.``11/06/2012``), 29 voted(row.``03/12/2013``), 30 voted(row.``10/08/2013``), 31 voted(row.``11/05/2013``), 32 voted(row.``05/06/2014``), 33 voted(row.``07/15/2014``), 34 voted(row.``11/04/2014``) 35 ) 36 37 let outputFilePath = @"Data\vrdb.csv" 38 39 let data = context.Data |> Seq.map(fun row -> createOutputRow(row)) 40 let fullOutputPath = Path.Combine(baseDirectory''.FullName, outputFilePath) 41 42 let file = new StreamWriter(fullOutputPath,true) 43 44 file.WriteLine(header) 45 context.Data |> Seq.map(fun row -> createOutputRow(row)) 46 |> Seq.iter(fun r -> file.WriteLine(r)) 47
The really great thing is that I could write and then dispose of each line so I could do it without any crashes. Once the data was into a a .csv (10% the size of Excel), I could then import it into R Studio without a problem. It is a common lesson but really shows that using the right tool for the job saves tons of headaches.
I knew from a previous analysis of voter data that the #1 determinate of a person from wake county voting in a off-cycle election was their age:
So then in R, I created a decision tree for just age to see what the split was:
1 library(rpart) 2 temp <- rpart(all.voters$X20131008 ~ all.voters$Age) 3 plot(temp) 4 text(temp)
Thanks to Placidia for answering my question on stats.stackoverflow
So basically politicians should be targeting people 50 years or older or perhaps emphasizing issues that appeal to the over 50 crowd.
Pingback: Dew Drop – August 26, 2015 (#2077) | Morning Dew