Consuming and Analyzing Census Data Using F#
August 19, 2014 3 Comments
As part of my Nerd Dinner refactoring, I wanted to add the ability to guess a person’s age and gender based on their name. I did a quick search on the internet and the only place that I found that has an API is here and it doesn’t have everything I am looking for. Fortunately, the US Census website has some flat files with the kind of data I am looking for here.
I grabbed the data and pumped it into Azure Blob Storage here. You can swap out the state code to get each dataset. I then loaded in a list of State Codes found here that match to the file names.
I then fired up Visual Studio and created a new FSharp project. I added FSharp.Data to use a Type Provider to access the data. I don’t need to install the Azure Storage .dlls b/c the blobs are public and I just have to read the file
Once Nuget was done with its magic, I opened up the script file, pointed to the newly-installed FSharp.Data, and added a reference to the datasets on blob storage:
#r "../packages/FSharp.Data.2.0.9/lib/portable-net40+sl5+wp8+win8/FSharp.Data.dll" open FSharp.Data type censusDataContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/AK.TXT"> type stateCodeContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv">
(Note that I am going add FSharp as a language to my Live Writer code snippet add-in at a later date)
In any event, I then printed out all of the codes to see what it looks like:
let stateCodes = stateCodeContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv"); stateCodes.Rows |> Seq.iter(fun r -> printfn "%A" r)
And by changing the lambda slightly like so,
stateCodes.Rows |> Seq.iter(fun r -> printfn "%A" r.Abbreviation)
I get all of the state codes
I then tested the census data with code and results are expected
let arkansasData = censusDataContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/AK.TXT"); arkansasData.Rows |> Seq.iter(fun r -> printfn "%A" r)
So then I created a method to load all of the state census data and giving me the length of the total:
let stateCodes = stateCodeContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv"); let usaData = stateCodes.Rows |> Seq.collect(fun r -> censusDataContext.Load(System.String.Format("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/{0}.TXT",r.Abbreviation)).Rows) |> Seq.length
Since this is a I/O bound operation, it made sense to load the data asynchronously, which speeded things up considerably. You can see my question over on Stack Overflow here and the resulting code takes about 50% of the time on a my dual-processor machine:
stopwatch.Start() let fetchStateDataAsync(stateCode:string)= async{ let uri = System.String.Format("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/{0}.TXT",stateCode) let! stateData = censusDataContext.AsyncLoad(uri) return stateData.Rows } let usaData' = stateCodes.Rows |> Seq.map(fun r -> fetchStateDataAsync(r.Abbreviation)) |> Async.Parallel |> Async.RunSynchronously |> Seq.collect id |> Seq.length stopwatch.Stop() printfn "Parallel: %A" stopwatch.Elapsed.Seconds
With the data in hand, it was time to analyze the data to see if there is anything we can do. Since 23 seconds is a bit too long to wait for a page load (), I will need to put the 5.5 million records into a format that can be easily searched. Thinking what we want is:
Given a name, what is the gender?
Given a name, what is the age?
Given a name, what is their state of birth?
Also, since we have their current location, we can also input the name and location and answer those questions. If we make the assumption that their location is the same as their birth state, we can narrow down the list even further.
In any event, I first added a GroupBy to the name:
let nameSum = usaData' |> Seq.groupBy(fun r -> r.Mary) |> Seq.toArray
And then I summed up the counts of the names
let nameSum = usaData' |> Seq.groupBy(fun r -> r.Mary) |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) |> Seq.toArray
And then the total in the set:
let totalNames = nameSum |> Seq.sumBy(fun (n,c) -> c)
And then applied a simple average and sorted it descending
let nameAverage = nameSum |> Seq.map(fun (n,c) -> n,c,float c/ float totalNames) |> Seq.sortBy(fun (n,c,a) -> -a - 1.) |> Seq.toArray
So I feel really special that my parents gave me the most popular name in the US ever…
And focusing back to the task on hand, I want to determine the probability that a person is male or female based on their name:
let nameSearch = usaData' |> Seq.filter(fun r -> r.Mary = "James") |> Seq.groupBy(fun r -> r.F) |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) |> Seq.toArray
So 18196 parents thought is would be a good idea to name their daughter ‘James’. I created a quick function like so:
let nameSearch' name = let nameFilter = usaData' |> Seq.filter(fun r -> r.Mary = name) |> Seq.groupBy(fun r -> r.F) |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) nameFilter |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) |> Seq.toArray nameSearch' "James"
So if I see the name “James”, there is a 99% chance it is a male. This can lead to a whole host of questions like variance of names, names that are closest to gender neutral, etc…. Leaving those questions to another day, I now have something I can put into Nerd Dinner. Now, if there was only a way to handle nicknames and friendly names….
You can see the full code here.
Love these types of posts. It really helps to show how awesome F# is when using it with data.
Pingback: F# Weekly #34, 2014 | Sergey Tihon's Blog
Pingback: F# Map To Seq – everockitworld