Fun with Statistics and Charts
September 2, 2014 1 Comment
I am preparing my Raleigh Code Camp submission ‘Nerd Dinner With Brains” this weekend. If you are not familiar, Nerd Dinner is the canonical example of a MVC application and is very familiar to Web Devs who want to learn MVC the Microsoft way. You can see the walkthrough here. For everything that Nerd Dinner is, it is not … smart. There is no business rules outside of some basic input validation, which is pretty representative of many “Boring Line Of Business Applications (BLOBAs according to Scott Waschlan). Not coincidently, the lack of business logic is the biggest reason many BLOBAs don’t have many unit tests –> if all you are doing is wire framing a database, what business logic needs to be tested?
The talk is going to take the Nerd Diner wireframe and inject some analytics to the application. To that end, I first considered the person who is attending the dinner. All we know about them is their name and possibly their location. So what can a name tell you? Turns out, plenty.
As I showed in this post, there is a great source of the number of names given by gender, yearOfBrith, and stateOfBirth from the US census. Picking up where that post left off, I loaded in the entire data set into memory.
My first question was, “given a name, can I tell what gender the person is?” This is very straight forward to calculate.
1 let genderSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.F) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.toArray 11 12 genderSearch "James" 13
And the REPL shows me that is is very likely that “James” is a male:
I can then set up in the web.config file a confidence point where there name is a male/female, I am thinking 75%. Once we have that, the app can respond differently. Perhaps we have a product-placement advertisement that becomes a male-focused if we are reasonably certain that the user is a male. Perhaps we can be more subtle and change the theme of the site, or the page navigation, to induce the person to do additional things on the site.
In any event, I then wanted to tackle age. I spun up some code to isolate a person’s age
1 let ageSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.``1910``) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 |> Seq.toArray 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.toArray
I had no idea if names have a certain age connotation so I decided to do some basic charting. Isaac Abraham pointed me to FSharp.Chart which is a great way to do some basic charting for discovery.
1 let chartData = ageSearch "James" 2 |> Seq.map(fun (y,c,p) -> y, c) 3 |> Seq.sortBy(fun (y,c) -> y) 4 5 Chart.Line(chartData).ShowChart()
And sure enough, the name “James” has a real ebb and flow for its popularity.
so if the user has a name of “James”, you can make a reasonable assumption they are male and probably born before 1975. Cue up the Van Halen!
And yes, because I had to:
1 let chartData = ageSearch "Britney" 2 |> Seq.map(fun (y,c,p) -> y, c) 3 |> Seq.sortBy(fun (y,c) -> y)
Kinda does match her career, no?
Anyway, back to the task at hand. In terms of analytics, I want to be a bit more precise then eyeballing a chart. I started with the following code:
1 ageSearch "James" 2 |> Seq.map(fun (y,c,p) -> float c) 3 |> Seq.average 4 5 ageSearch "James" 6 |> Seq.map(fun (y,c,p) -> float c) 7 |> Seq.min 8 9 ageSearch "James" 10 |> Seq.map(fun (y,c,p) -> float c) 11 |> Seq.max 12
With these basic statistics out of the way, I then wanted to look at when the name was no longer popular. I decided to use 1 standard deviation away from the average to determine an outlier. First the standard deviation:
1 let variance (source:float seq) = 2 let mean = Seq.average source 3 let deltas = Seq.map(fun x -> pown(x-mean) 2) source 4 Seq.average deltas 5 6 let standardDeviation(values:float seq) = 7 sqrt(variance(values)) 8 9 ageSearch "James" 10 |> Seq.map(fun (y,c,p) -> float c) 11 |> standardDeviation 12 13 let standardDeviation' = ageSearch "James" 14 |> Seq.map(fun (y,c,p) -> float c) 15 |> standardDeviation 16 17 let average = ageSearch "James" 18 |> Seq.map(fun (y,c,p) -> float c) 19 |> Seq.average 20 21 let attachmentPoint = average+standardDeviation'
And then I can get the last year that the name was within 1 standard deviation above the average (greater than 71,180 names given):
1 2 let popularYears = ageSearch "James" 3 |> Seq.map(fun (y,c,p) -> y, float c) 4 |> Seq.filter(fun (y,c) -> c > attachmentPoint) 5 |> Seq.sortBy(fun (y,c) -> y) 6 |> Seq.last
So “James” is very likely a male and likely born before 1964. Cue up the Pink Floyd!
The last piece was the state of birth –> can I guess the state of birth for a user? I first looked at the states on a plot
1 let chartData' = stateSearch "James" 2 |> Seq.map(fun (s,c,p) -> s,c) 3 4 Chart.Column(chartData').ShowChart() 5
Nothing really stands out at me –> states with the most births have the most names. I could do an academic exercise of seeing what states favor certain names, but that does not help me with Nerd Dinner in guessing the state of birth when given a name.
I pressed on to look at the top 10 states:
1 let topTenStates = stateSearch "James" 2 |> Seq.sortBy(fun (s,c,p) -> -c-1) 3 |> Seq.take 10 4 5 let topTenTotal = topTenStates 6 |> Seq.sumBy(fun (s,c,p) -> c) 7 let total = stateSearch "James" 8 |> Seq.sumBy(fun (s,c,p) -> c) 9 10 float topTenTotal/float total
So 50% of “James” were born in 10 states. Again, I am not sure there is any actionable information here. For example, if a majority of “James” were born in MI, I might have something (cue up the Bob Seger).
Interestingly, there are certain number of names where the state of birth does matter. For example, consider “Jose”:
Unsurprisingly, the two states are CA and TX. Just using James and Jose as an example:
- James is a male born before 1964
- Jose is a male born before 2008 in either TX or CA
As an academic exercise, we could construct a random forest to find the names with the greatest state affinity. However, that won’t help us on Nerd Dinner so I am leaving that out for another day.
This analysis does not account for a host of factors (person not born in the USA, nicknames, etc..), but it is still better than the nothing that Nerd Dinner currently has. This analysis is not particular sophisticated but I often find that even the most basic statistics can be very powerful if used correctly. That will be the next part of the talk…
Pingback: F# Weekly #36, 2014 | Sergey Tihon's Blog