Age and Sex Analysis Of Microsoft USA MVPs

A couple of weeks ago, this came across my Twitter

image

I participated in this hackathon (well, helped run the F# one).  My response was:

image

I was surprised that I got into this exchange with a Microsoft PM:

image

That last comment by me was inspired by Mark Twain: “never wrestle with a pig.  You just get dirty and the pig likes it.”  But it did get me to thinking about the composition of the US MVPs.  I did an analysis a couple of years ago of the photos of the Microsoft MVPs (found here and here) so it made sense to follow up on that code and see if I was wrong about my “middle age white guy” hypothesis.  I could get the photos from the MVP site and pass them into the Microsoft Cognitive Services API for facial analysis for age/sex data.  Using F# made the analysis a snap.

A nice thing about the Microsoft MVP website is that it is public and has photos of the MVPs.  Here is one of the pages:

image

and when you look at the source of the page, each of those photos has a distinct uri:

image

I opened up Visual Studio and created a new F# project.  I went into the script file and brought in the libraries to do some http requests.  I then created a couple of functions to pull down the HTML of each of the 19 pages and put it into 1 big string:

1 let getPageContents(pageNumber:int) = 2 let uri = new Uri("http://mvp.microsoft.com/en-us/search-mvp.aspx?lo=United+States&sl=0&browse=False&sc=s&ps=36&pn=" + pageNumber.ToString()) 3 let request = WebRequest.Create(uri) 4 request.Method <- "GET" 5 let response = request.GetResponse() 6 use stream = response.GetResponseStream() 7 use reader = new StreamReader(stream) 8 reader.ReadToEnd() 9 10 let contents = 11 [|1..19|] 12 |> Array.map(fun i -> getPageContents i) 13 |> Seq.reduce(fun x y -> x + y)

(OT: Since I did a map..reduce on lines 12 and 13, does that mean I am working with “Big Data”?)

I then created a quick parser to find only the uris of the photos in all of the HTML.

1 let getUrisFromPageContents(pageContents:string) = 2 let pattern = "/PublicProfile/Photo/\d+" 3 let matchCollection = Regex.Matches(pageContents, pattern) 4 matchCollection 5 |> Seq.cast 6 |> Seq.map(fun (m:Match) -> m.Value) 7 |> Seq.map(fun v -> "https://mvp.microsoft.com/en-us" + v + "?language=en-us") 8 |> Seq.toArray 9 10 let uris = getUrisFromPageContents contents

Sure enough, I got 684 uris for MVP photos.  I then wrote another Web Request to pull down each of the photos and save them to disk:

1 let saveImage uri = 2 use client = new WebClient() 3 let id = Guid.NewGuid() 4 let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos\" + id.ToString() + ".jpg" 5 client.DownloadFile(Uri(uri),path) 6 7 uris 8 |> Seq.iter saveImage 9

And I now have all 684 photos on disk.

image

I did not bring down the names of the MVPs – instead using a GUID to randomize the photos, but a name analysis would also be interesting.  With the photos now local, I could then upload them to Microsoft Cognitive Services API to do facial analysis.  You can read about the details of the API here.  I created a third web request to pass the photo up and get the results from the API:

1 let getOxfordResults path = 2 let queryString = HttpUtility.ParseQueryString(String.Empty) 3 queryString.Add("returnFaceId","true") 4 queryString.Add("returnFaceLandmarks","false") 5 queryString.Add("returnFaceAttributes","age,gender") 6 let uri = "https://api.projectoxford.ai/face/v1.0/detect?" + queryString.ToString() 7 let bytes = File.ReadAllBytes(path) 8 let client = new HttpClient() 9 client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key","xxxxxxxxxxx") 10 let response = new HttpResponseMessage() 11 let content = new ByteArrayContent(bytes) 12 content.Headers.ContentType <- MediaTypeHeaderValue("application/octet-stream") 13 let result = client.PostAsync(uri,content).Result 14 Thread.Sleep(TimeSpan.FromSeconds(5.0)) 15 match result.StatusCode with 16 | HttpStatusCode.OK -> Some (result.Content.ReadAsStringAsync().Result) 17 | _ -> None

Notice that I put a 5 second sleep into the call.  This is because Microsoft throttles the requests to 20 per minute. Also, since some of the photos do not have a face, I used the F# option type. The results come back from the Microsoft Cognitive Services API  as Json. To parse the results, I used the FSharp Json Type Provider:

1 type FaceInfo = JsonProvider<Sample="[{\"faceId\":\"83045097-daa1-4f1c-8669-ed012e9b5975\",\"faceRectangle\":{\"top\":187,\"left\":209,\"width\":214,\"height\":214},\"faceAttributes\":{\"gender\":\"male\",\"age\":42.8}}]"> 2 3 let parseOxfordResuls results = 4 match results with 5 | Some r -> 6 let face = FaceInfo.Parse(r) 7 match Seq.length face with 8 | 0 -> None 9 | _ -> let header = face |> Seq.head 10 Some(header.FaceAttributes.Age,header.FaceAttributes.Gender) 11 | None -> None

So now I can get estimated age and gender from Microsoft Cognitive Services API.  I was disappointed that the API does not estimate race.  I assume they have the technology but from a social-acceptance point of view, they don’t make it publically available.  In any event, a look though their photos show that a majority are white people.  In any event, I went ahead and ran this and went out to work on my sons stock car while the requests were spinning.

1 #time 2 let results = 3 let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos" 4 Directory.GetFiles(path) 5 |> Array.map(fun f -> getOxfordResults f) 6 |> Array.map(fun r -> parseOxfordResuls r)

When I came back, I had a nice sequence of a tuple that contained ages and genders.

image

To analyze the data, I pulled in Math .NET.  First, I took a look age:

1 Seq.length results //684 2 3 let ages = 4 results 5 |> Seq.filter(fun r -> r.IsSome) 6 |> Seq.map(fun o -> fst o.Value) 7 |> Seq.map(fun a -> float a) 8 9 let stats = new DescriptiveStatistics(ages) 10 let count = stats.Count 11 let largest = stats.Maximum 12 let smallest = stats.Minimum 13 let mean = stats.Mean 14 let median = Statistics.Median(ages) 15 let variance = stats.Variance 16 let standardDeviation = stats.StandardDeviation 17 let kurtosis = stats.Kurtosis 18 let skewness = stats.Skewness 19 let lowerQuartile = Statistics.LowerQuartile(ages) 20 let uppserQuartile = Statistics.UpperQuartile(ages) 21

Here are the results. 

image

I got 620 valid photos of the 684 MVPs – so a 91% hit rate and I have enough observations to make the analysis statistically valid.  It looks like Cognitive Services made at least 1 mistake with an age of 4.9 years –> perhaps someone was using a meme for their photo?  In any event, the mean is estimated at 41.95 and the median is 40.95, so a slight skew left. (Note I mislabeled it on the screen shot above)

I then wanted to see the distribution of the ages so I brought in FSharp charting and ran a basic histogram:

1 open FSharp.Charting 2 3 let chart = Chart.Histogram(ages,Intervals=10.0) 4 Chart.Show(chart)

image

So the ages look very Gaussian.

I then decided to look at gender:

1 let gender = 2 results 3 |> Seq.filter(fun r -> r.IsSome) 4 |> Seq.map(fun o -> snd o.Value) 5 6 gender 7 |> Seq.countBy(fun v -> v) 8 |> Seq.map(fun (g,c) -> g, c, float c/float count)

With the results being:

image

So there are 12% females and 88% males.  With an average age 42 years old and 88% male, “middle age white guy” seems like an appropriate label and I stand by my original tweet – we certainly have work to do in 2017.

You can find the gist here