Statistics | Jamie Dixon's Home

Consuming and Analyzing Census Data Using F#

August 19, 2014 3 Comments

As part of my Nerd Dinner refactoring, I wanted to add the ability to guess a person’s age and gender based on their name. I did a quick search on the internet and the only place that I found that has an API is here and it doesn’t have everything I am looking for. Fortunately, the US Census website has some flat files with the kind of data I am looking for here.

I grabbed the data and pumped it into Azure Blob Storage here. You can swap out the state code to get each dataset. I then loaded in a list of State Codes found here that match to the file names.

I then fired up Visual Studio and created a new FSharp project. I added FSharp.Data to use a Type Provider to access the data. I don’t need to install the Azure Storage .dlls b/c the blobs are public and I just have to read the file

Once Nuget was done with its magic, I opened up the script file, pointed to the newly-installed FSharp.Data, and added a reference to the datasets on blob storage:

#r "../packages/FSharp.Data.2.0.9/lib/portable-net40+sl5+wp8+win8/FSharp.Data.dll"
open FSharp.Data


type censusDataContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/AK.TXT">
type stateCodeContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv">

(Note that I am going add FSharp as a language to my Live Writer code snippet add-in at a later date)

In any event, I then printed out all of the codes to see what it looks like:

let stateCodes =  stateCodeContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv");
stateCodes.Rows |> Seq.iter(fun r -> printfn "%A" r)

And by changing the lambda slightly like so,

stateCodes.Rows |> Seq.iter(fun r -> printfn "%A" r.Abbreviation)

I get all of the state codes

I then tested the census data with code and results are expected

let arkansasData = censusDataContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/AK.TXT");
arkansasData.Rows |> Seq.iter(fun r -> printfn "%A" r)

So then I created a method to load all of the state census data and giving me the length of the total:

let stateCodes =  stateCodeContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv");
let usaData = stateCodes.Rows 
                |> Seq.collect(fun r -> censusDataContext.Load(System.String.Format("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/{0}.TXT",r.Abbreviation)).Rows)
                |> Seq.length

Since this is a I/O bound operation, it made sense to load the data asynchronously, which speeded things up considerably. You can see my question over on Stack Overflow here and the resulting code takes about 50% of the time on a my dual-processor machine:

stopwatch.Start()
let fetchStateDataAsync(stateCode:string)=
    async{
        let uri = System.String.Format("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/{0}.TXT",stateCode)
        let! stateData =  censusDataContext.AsyncLoad(uri)
        return stateData.Rows
    }


let usaData' = stateCodes.Rows
                    |> Seq.map(fun r -> fetchStateDataAsync(r.Abbreviation))
                    |> Async.Parallel
                    |> Async.RunSynchronously
                    |> Seq.collect id
                    |> Seq.length
stopwatch.Stop()
printfn "Parallel: %A" stopwatch.Elapsed.Seconds

With the data in hand, it was time to analyze the data to see if there is anything we can do. Since 23 seconds is a bit too long to wait for a page load ( Smile ), I will need to put the 5.5 million records into a format that can be easily searched. Thinking what we want is:

Given a name, what is the gender?

Given a name, what is the age?

Given a name, what is their state of birth?

Also, since we have their current location, we can also input the name and location and answer those questions. If we make the assumption that their location is the same as their birth state, we can narrow down the list even further.

In any event, I first added a GroupBy to the name:

let nameSum = usaData' 
                |> Seq.groupBy(fun r -> r.Mary)
                |> Seq.toArray

And then I summed up the counts of the names

let nameSum = usaData' 
                |> Seq.groupBy(fun r -> r.Mary)
                |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 
                |> Seq.toArray

And then the total in the set:

let totalNames = nameSum |> Seq.sumBy(fun (n,c) -> c)

And then applied a simple average and sorted it descending

let nameAverage = nameSum 
                    |> Seq.map(fun (n,c) -> n,c,float c/ float totalNames)
                    |> Seq.sortBy(fun (n,c,a) -> -a - 1.)
                    |> Seq.toArray

So I feel really special that my parents gave me the most popular name in the US ever…

And focusing back to the task on hand, I want to determine the probability that a person is male or female based on their name:

let nameSearch = usaData'
                    |> Seq.filter(fun r -> r.Mary = "James")
                    |> Seq.groupBy(fun r -> r.F)
                    |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 
                    |> Seq.toArray

So 18196 parents thought is would be a good idea to name their daughter ‘James’. I created a quick function like so:

let nameSearch' name = 
    let nameFilter = usaData'
                        |> Seq.filter(fun r -> r.Mary = name)
                        |> Seq.groupBy(fun r -> r.F)
                        |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 

    let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c)
    nameFilter 
        |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 
        |> Seq.toArray

nameSearch' "James"

So if I see the name “James”, there is a 99% chance it is a male. This can lead to a whole host of questions like variance of names, names that are closest to gender neutral, etc…. Leaving those questions to another day, I now have something I can put into Nerd Dinner. Now, if there was only a way to handle nicknames and friendly names….

You can see the full code here.

Filed under F#, Statistics

Screen Scraping College Football Statistics

December 24, 2013 3 Comments

As a follow-up to my post of the correlation of Academic Ranking and Football Rankings in the Big Ten, I thought I would look that the relationship between two different kinds of Football Rankings: the recruiting ranking assigned by Rivals and the actual results on the field. To that end, I went to collect the data programmically because I am doing a time-series analysis and I didn’t want to do data-entry.

My first stop was to find a free service that exposes this data on the web. No luck – either the data was a service that cost money or the data was presented as a web page. Since I have never screen-scraped using F# (and I am cheap), I chose option #2.

My first data point was the recruiting ranking found here. When I inspected the source of the page, I caught a break – the data is actually stored as Json on the page.

So firing up Visual Studio, I created a solution with 1 F# project and 2 C# projects:

I then wrote a unit test to check that something is being returned:

[TestMethod]
public void getRecrutRankings_RetunsExpected()
{
    var rankings = RankingProvider.getRecrutRankings("2012");
    Assert.AreNotEqual(0, rankings.Length);
}

I then went over the F#. I created the RankingProvider type and then add a function that pulls in the rankings for a given year:

static member getRecrutRankings(year) =
    let url = "http://sports.yahoo.com/footballrecruiting/football/recruiting/teamrank/&quot;+year+"/BIG10/all";
    let request = WebRequest.Create(Uri(url)) 
    use response = request.GetResponse() 
    use stream = response.GetResponseStream() 
    use reader = new IO.StreamReader(stream) 
    let htmlString = reader.ReadToEnd()
    let startPosition = htmlString.IndexOf("var rankingsTableData =")
    let headerLength = 23
    let endPosition = htmlString.IndexOf(";",startPosition)
    let data = htmlString.Substring(startPosition+headerLength,endPosition-startPosition-headerLength).Trim()
    let results = JsonConvert.DeserializeObject(data)
    let castedResults = results :?> Newtonsoft.Json.Linq.JArray
                                            |> Seq.map(fun x -> (x.Value("name").ToString(), Int32.Parse(x.Value("rank").ToString())))
                                            |> Seq.toList

A couple of things to note.

Lines 2 through 12 are language-agnostic. You would write the exact same code in C#/VB.NET with a slightly different syntax.
Line 13 is where things get interesting. I used the :?> operator to cast the Json to a typed structure. :?> wins as the weirdest symbol I have ever used in computer programming. I guess I haven’t been programming long enough?
Lines 14 and 15 is where you can see why F# is better than C#. I created a function that takes the Json and pushes it into a tuple. With no iteration, the code is both easier to read and less likely to have bugs

Hoping to press my luck, I went over the the other page (the one that holds the standings from the actual games) to see if they used Json. No dice – so back to mid-2000s screen scraping. I created a function that loads the table into an XML document and then searches for a given school.

static member getConferenceStanding(year, school) =
    let url = "http://espn.go.com/college-football/conferences/standings/_/id/5/year/&quot;+year+"/big-ten-conference";         
    let request = WebRequest.Create(Uri(url)) 
    use response = request.GetResponse() 
    use stream = response.GetResponseStream() 
    use reader = new IO.StreamReader(stream) 
    let htmlString = reader.ReadToEnd()
    let divMarkerStartPosition = htmlString.IndexOf("my-teams-table");
    let tableStartPosition = htmlString.IndexOf("<table",divMarkerStartPosition);
    let tableEndPosition = htmlString.IndexOf("</table",tableStartPosition);
    let data = htmlString.Substring(tableStartPosition, tableEndPosition- tableStartPosition+8)
    let xmlDocument = new XmlDocument();
    xmlDocument.LoadXml(data);
    let keyNode = xmlDocument.GetElementsByTagName("td")
                    |> Seq.cast<XmlNode>
                    |> Seq.find (fun node -> node.InnerText = school)
    let valueNode = keyNode.NextSibling
    (keyNode.InnerText, valueNode.InnerText)

A couple of things to note:

Lines 2-7 are identical to the prior function so they should be combined into a single function that can be independently testable.
Lines 8-13 are language-agnostic. You would write the exact same code in C#/VB.NET with a slightly different syntax.
Lines 14-18 is where F# really shines. Like the prior function, by using functional programming techniques in F#, I saved myself time, avoid bugs, and made the code much more intuitive.
I am making a web call for each function call– this should be optimized so the call is made once and the xmlDocument is passed in. This would also make the function much more testable (even without a mocking framework)

Next up, I needed to call this function for each of the Big Ten Schools:

static member getConferenceStandings(year)=
    let schools =[|"Nebraska";"Michigan";"Northwestern";"Michigan State";"Iowa";
        "Minnesota";"Ohio State";"Penn State";"Wisconsin"; "Purdue"; "Indiana"; "Illinois"|]
    Seq.map(fun school -> RankingProvider.getConferenceStanding(year,school)) schools
        |> Seq.sortBy snd
        |> Seq.toList
        |> List.rev

This is purely F# and is a pure joy to write (and look the least amount of time). Note that the sort is on the second element of the tuple and that the list is reversed because the second element is the wins-losses so F# is sorting ascending on the number of wins. Since Seq does not have a rev function, I turned it into a List, which does have the rev function

Some might ask “Why didn’t you use type-providers?” My answer is “I tried, but I couldn’t get them to work.” For example, here is the code that I used for the type provider when parsing the xmlDocument:

xmlDocument.LoadXml(data);
let document = XmlProvider<xmlDocument>

The problem is that the type provider expects a uri (and I can’t find an overload to pass in the document). It looks like type providers are more designed for providers that are ready to, well, provide (Web Services, Databases, etc..) versus jerry-rigged data (like screen scraping).

In any event, with these two functions, ready, I went to the UI project and decided to see how the teams did in 2012 on the field compared to how the teams did in recruiting 2 years before:

static void Main(string[] args)
{
    Console.WriteLine("Start");
 
    Console.WriteLine("——-Rankings");
    var rankings = RankingProvider.getRecrutRankings("2010");
    foreach (var school in rankings)
    {
        Console.WriteLine(school.Item1 + ":" + school.Item2);
    }
 
    Console.WriteLine("——-Standings");
    var standings = RankingProvider.getConferenceStandings("2012");
    foreach (var school in standings)
    {
        Console.WriteLine(school.Item1 + ":" + school.Item2);
    }
 
    Console.WriteLine("End");
    Console.ReadKey();
}

And the results:

I have no idea if a 2-year lag between recruiting and rankings is the right number – perhaps an analysis of the correct lag will be done. After all, between red-shirt freshmen, transfer rules, and attrition, there are plenty of variables the determine when a recruiting class has the biggest impact. Also, the standings are a blend of recruiting classes and since I am not evaluating individual players, I can’t go to that level of detail. 2 years out seems reasonable, but as Bluto famiously once said

static member getBlutoQuote() =
    "Seven years of college down the drain.";

the average might be different. In any event, I now have the data I want so the next step is to analyze it to see if there is any correlation. At first glance, there might be something – the top 4 schools for recruiting all finished in the top 4 in the standings – but the bottom 4 is more muddled with only Illinois doing poorly in both recruiting and the standings.

More to come…

Filed under F#, Statistics

F# > C# when doing math

December 10, 2013 5 Comments

My friend/coworker Rob Seder sent me this code project link and said it might be an interesting exercise to duplicate what he had done in F#. Interesting indeed! Challenge accepted!

I first created a solution like so:

I then copied the Variance calculation from the post to the C# implementation:

public class Calculations
{
    public static Double Variance(IEnumerable<Double> source)
    {
        int n = 0;
        double mean = 0;
        double M2 = 0;
 
        foreach (double x in source)
        {
            n = n + 1;
            double delta = x – mean;
            mean = mean + delta / n;
            M2 += delta * (x – mean);
        }
        return M2 / (n – 1);
    }
}

I then created a couple of unit tests for the method and made sure that the results ran green:

[TestClass]
public class CSharpCalculationsTests
{
    [TestMethod]
    public void VarianceOfSameNumberReturnsZero()
    {
        Collection<Double> source = new Collection<double>();
        source.Add(2.0);
        source.Add(2.0);
        source.Add(2.0);
 
        double expected = 0;
        double actual = Calculations.Variance(source);
        Assert.AreEqual(expected, actual);
    }
 
    [TestMethod]
    public void VarianceOfOneAwayNumbersReturnsOne()
    {
        Collection<Double> source = new Collection<double>();
        source.Add(1.0);
        source.Add(2.0);
        source.Add(3.0);
 
        double expected = 1;
        double actual = Calculations.Variance(source);
        Assert.AreEqual(expected, actual);
    }    
}

I then spun up the same unit tests to test the F# implementation and then went over to the F# project. My first attempt started along the lines like this:

namespace Tff.BasicStats.FSharp
 
open System
open System.Collections.Generic
 
type Calculations() = 
    static member Variance (source:IEnumerable<double>) =
        let mean = Seq.average(source)
        let deltas = Seq.map(fun x -> x-mean) source
        let deltasSum = Seq.sum deltas
        let deltasLength = Seq.length deltas
        deltasSum/(double)deltasLength

I then realized that I was writing procedural code in F# – I was not taking advantage of the power that the expressiveness that the language provides. I also realized that looking at the C# code to understand how to calculate Variance was useless – I was getting lost in the loop and the poorly-named variables. I went over to Wikipedia’s definition to see if that could help me understand Variance better but I got lost in all of the formulas. I then binged Variance on Google and one of the 1st links is MathIsFun with this explanation. This was more like it! Cool dog pictures and a stupid simple recipe for calculating Variance. The steps are:

I hopped over to Visual Studio and wrote a one-for-one line of code to match the recipe:

namespace Tff.BasicStats.FSharp
 
open System
open System.Collections.Generic
 
type Calculations() = 
    static member Variance (source:IEnumerable<double>) =
        let mean = Seq.average source
        let deltas = Seq.map(fun x -> sqrt(x-mean)) source
        Seq.average deltas

I ran the unit tests but they were running red! I was getting a NaN.

Hearing my cursing, my 7th grade son came over and said – “Dad, that is wrong. You don’t use the square root on the (x-mean), you square it. Also, you can’t take the square root of a negative number and any item in that list that is less than the average will return that ” Let me repeat that – a 7th grader with no coding experience but who knows about Variance from his math class just read the code and found the problem.

I then changed the code to square the value like so:

namespace Tff.BasicStats.FSharp
 
open System
open System.Collections.Generic
 
type Calculations() = 
    static member Variance (source:IEnumerable<double>) =
        let mean = Seq.average source
        let deltas = Seq.map(fun x -> pown(x-mean) 2) source
        Seq.average deltas

And now my unit test… runs…. Red!

Not understanding why, I turned to the REPL (F# Interactive Window). I first entered my test set:

I then entered the calculation from each line against the test set:

Staring at the resulting array, it hit me that perhaps the original unit test’s expected value was wrong! I went over to TutorVista and entered in my array. Would you believe it?

The calculation on the code project site is incorrect! The correct way to do the unit test is:

[TestMethod]
public void VarianceOfOneAwayNumbersReturnsOne()
{
    Collection<Double> source = new Collection<double>();
    source.Add(1.0);
    source.Add(2.0);
    source.Add(3.0);
 
    //double expected = 6666666667;
    double expected = 2f / 3f; 
    double actual = Calculations.Variance(source);
    Assert.AreEqual(expected, actual);
}    

(Note that expected was the easiest way I could come up with .6 repeating without getting all crazy on the formatting). Now both my unit tests run green and one of the C# ones runs red.

I have no interest in trying to figure out how to fix that C# code – I care less about how to solve my problem and more about just solving the problem. The real power of F# really is on display here. The coolest parts of this exercise were:

One-for-one correspondence between the steps to solve a problem and the code
The code is much more readable to non developers
By concentrating on how to solve the problem in C#, the original developer lost sight of what he was trying to accomplish. F# focuses you on the result, not the code.
Unit tests can be wrong – if you let your code’s result drive the expected and not a external source.

Filed under F#, Statistics

Jamie Dixon's Home

Consuming and Analyzing Census Data Using F#

Screen Scraping College Football Statistics

F# > C# when doing math

Categories

Recent Posts

Archives

Blogroll

Meta