F# | Jamie Dixon's Home

The Social Transf#ormation of Software Development

December 12, 2017 6 Comments

This post is part of the F# Advent Calendar in English 2017 project. Check out all the other great posts there! And special thanks to Sergey Tihon for organizing this. (Also, thanks to Scott W from whom I copy/pasted the prior sentences.)

Since we are in the holiday season, it is appropriate that I spend some time reflecting on how awesome the FSharp language is. Also during this season, as we turn into a new year, it makes sense to look a bit into the future. Full disclaimer – I have no idea what I am talking about.

When I was an undergraduate, my advisor recommended The Social Transformation of American Medicine by Paul Starr. The book centers on American physicians and how they used social factors to achieve the professional autonomy and financial rewards that they have today. Indeed, there are few other professions that have such a powerful “moat” around their profession. I still remember a line that Starr used when describing how the AMA used the rule of law to crush non-certified physicians at the turn of the twentieth century: “Power abhors competition like nature abhors a vacuum” *

I recently re-read much of the book, spurred on by Uncle Bob Martin’s talk about professional autonomy that I saw at Skills Matter a couple of years ago (similar one here) . I am wondering how our professional will shake out – computer programmers are indispensable, in fact they are one of the few competitive advantages for companies*. Like it or not, all companies are software shops, the ones with good management get this and are adapting, the bad ones… well, they can always merge. Yet the average software engineer does not have the kind of compensation and autonomy that the average physician has. To be sure, there are some “rock stars”, but the averages do not compare.

So why don’t software engineers have such status? One can make an argument that correct software can be just as life-saving as a physician’s diagnosis, and potentially at a larger scale. So if it is not societal importance, what is it? Starr’s books has some interesting observations that lend itself to some interesting questions which would certainly increase the software developer’s status.

· Does the software industry need to have established practices that at enforced by the rule of law?

· Does anyone deploying software need to be licensed? There might be some minor logistical issues, but certainly doable (just tell VC it is a Blockchain use case)

· Does anyone writing software need to join an overall unified professional organization?

· Do we socialize politicians that “kids learn to code” is a dumb idea just like “kids learn to do surgery”?

· Do we educate business leaders of non-tech shops that the problem is not enough programmers but too many bad programmer? If they want better results, pay more and implement steps #1-#3 above. Does it then raise the possibility that their many management teams and associated skill set are pretty much irrelevant in a 21^st century technology company?

· Does the industry tell computer programmers that can’t pass the muster for the 1^st three bullet points that they need to find a different profession?

Which leads us to FSharp. FSharp is an awesome language that had a hard landing on the enterprise space. As our profession evolves, the market will have less influence. As far as I can see, the entire package of Windows, Visual Studio, .NET, C#, Desktop App, and Web Forms is coming to an end. Will newer languages like F# will get a serious look? I will be interested to see if the same regulatory forces that will lead to the end of the Mort will also pick and choose winning languages. God help us if they decide on javascript.

So what can we do in the FSharp community? Starr has some interesting lessons here too:

· Stick to our guns and continue to push for high-quality teams and code. Like physicians that turned away patients that would not follow their orders, avoid shops that treat development as a cost of business versus a strategic asset

· If you do work at a place like that, be an outspoken advocate for change.

· Get involved with startups and non-enterprise technical communities.

· Get involved with government – volunteer on boards, get to know your local politicians, spend some time raising FSharp awareness to key stakeholders

· Code Code Code. Keep your coding chops sharp.

In any event, I don’t know the answers to the above questions. I don’t even know if they are the right questions. To be sure, I maintain our profession is evolving and we do have some control of where it will end up. With that in mind, hopefully the FSharp community can continue to be a force of positive change in our industry.

Onward to 2018!

** I find this line very useful in a variety of settings -> esp. in explaining American politics to friends from overseas

** IT does matter. It always has. The article was wrong from the moment it was published.

Filed under F#

Age and Sex Analysis Of Microsoft USA MVPs

December 25, 2016 29 Comments

A couple of weeks ago, this came across my Twitter

I participated in this hackathon (well, helped run the F# one). My response was:

I was surprised that I got into this exchange with a Microsoft PM:

That last comment by me was inspired by Mark Twain: “never wrestle with a pig. You just get dirty and the pig likes it.” But it did get me to thinking about the composition of the US MVPs. I did an analysis a couple of years ago of the photos of the Microsoft MVPs (found here and here) so it made sense to follow up on that code and see if I was wrong about my “middle age white guy” hypothesis. I could get the photos from the MVP site and pass them into the Microsoft Cognitive Services API for facial analysis for age/sex data. Using F# made the analysis a snap.

A nice thing about the Microsoft MVP website is that it is public and has photos of the MVPs. Here is one of the pages:

and when you look at the source of the page, each of those photos has a distinct uri:

I opened up Visual Studio and created a new F# project. I went into the script file and brought in the libraries to do some http requests. I then created a couple of functions to pull down the HTML of each of the 19 pages and put it into 1 big string:

 1 let getPageContents(pageNumber:int) =
 2      let uri = new Uri("http://mvp.microsoft.com/en-us/search-mvp.aspx?lo=United+States&sl=0&browse=False&sc=s&ps=36&pn=" + pageNumber.ToString())
 3      let request = WebRequest.Create(uri)
 4      request.Method <- "GET"
 5      let response = request.GetResponse()
 6      use stream = response.GetResponseStream()
 7      use reader = new StreamReader(stream)
 8      reader.ReadToEnd()
 9 
10 let contents = 
11     [|1..19|] 
12     |> Array.map(fun i -> getPageContents i)
13     |> Seq.reduce(fun x y -> x + y)

(OT: Since I did a map..reduce on lines 12 and 13, does that mean I am working with “Big Data”?)

I then created a quick parser to find only the uris of the photos in all of the HTML.

 1 let getUrisFromPageContents(pageContents:string) =
 2      let pattern = "/PublicProfile/Photo/\d+"
 3      let matchCollection = Regex.Matches(pageContents, pattern)
 4      matchCollection 
 5          |> Seq.cast 
 6          |> Seq.map(fun (m:Match) -> m.Value)
 7          |> Seq.map(fun v -> "https://mvp.microsoft.com/en-us" + v + "?language=en-us")
 8          |> Seq.toArray
 9 
10 let uris = getUrisFromPageContents contents

Sure enough, I got 684 uris for MVP photos. I then wrote another Web Request to pull down each of the photos and save them to disk:

1 let saveImage uri =
2     use client = new WebClient()
3     let id = Guid.NewGuid()
4     let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos\" + id.ToString() + ".jpg"
5     client.DownloadFile(Uri(uri),path)
6 
7 uris
8 |> Seq.iter saveImage
9

And I now have all 684 photos on disk.

I did not bring down the names of the MVPs – instead using a GUID to randomize the photos, but a name analysis would also be interesting. With the photos now local, I could then upload them to Microsoft Cognitive Services API to do facial analysis. You can read about the details of the API here. I created a third web request to pass the photo up and get the results from the API:

 1 let getOxfordResults path =
 2     let queryString = HttpUtility.ParseQueryString(String.Empty)
 3     queryString.Add("returnFaceId","true")
 4     queryString.Add("returnFaceLandmarks","false")
 5     queryString.Add("returnFaceAttributes","age,gender")
 6     let uri = "https://api.projectoxford.ai/face/v1.0/detect?" + queryString.ToString()
 7     let bytes = File.ReadAllBytes(path)
 8     let client = new HttpClient()
 9     client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key","xxxxxxxxxxx")
10     let response = new HttpResponseMessage()
11     let content = new ByteArrayContent(bytes)
12     content.Headers.ContentType <- MediaTypeHeaderValue("application/octet-stream")
13     let result = client.PostAsync(uri,content).Result
14     Thread.Sleep(TimeSpan.FromSeconds(5.0))
15     match result.StatusCode with
16     | HttpStatusCode.OK -> Some (result.Content.ReadAsStringAsync().Result)
17     | _ -> None

Notice that I put a 5 second sleep into the call. This is because Microsoft throttles the requests to 20 per minute. Also, since some of the photos do not have a face, I used the F# option type. The results come back from the Microsoft Cognitive Services API as Json. To parse the results, I used the FSharp Json Type Provider:

 1 type FaceInfo = JsonProvider<Sample="[{\"faceId\":\"83045097-daa1-4f1c-8669-ed012e9b5975\",\"faceRectangle\":{\"top\":187,\"left\":209,\"width\":214,\"height\":214},\"faceAttributes\":{\"gender\":\"male\",\"age\":42.8}}]">
 2 
 3 let parseOxfordResuls results =
 4     match results with
 5     | Some r -> 
 6         let face = FaceInfo.Parse(r)
 7         match Seq.length face with
 8         | 0 -> None
 9         | _ -> let header = face |> Seq.head
10                Some(header.FaceAttributes.Age,header.FaceAttributes.Gender)
11     | None -> None

So now I can get estimated age and gender from Microsoft Cognitive Services API. I was disappointed that the API does not estimate race. I assume they have the technology but from a social-acceptance point of view, they don’t make it publically available. In any event, a look though their photos show that a majority are white people. In any event, I went ahead and ran this and went out to work on my sons stock car while the requests were spinning.

1 #time
2 let results =
3     let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos"
4     Directory.GetFiles(path)
5     |> Array.map(fun f -> getOxfordResults f)
6     |> Array.map(fun r -> parseOxfordResuls r)

When I came back, I had a nice sequence of a tuple that contained ages and genders.

To analyze the data, I pulled in Math .NET. First, I took a look age:

 1 Seq.length results  //684
 2 
 3 let ages =
 4     results
 5     |> Seq.filter(fun r -> r.IsSome)
 6     |> Seq.map(fun o -> fst o.Value)
 7     |> Seq.map(fun a -> float a)
 8 
 9 let stats = new DescriptiveStatistics(ages)
10 let count = stats.Count
11 let largest = stats.Maximum
12 let smallest = stats.Minimum
13 let mean = stats.Mean
14 let median = Statistics.Median(ages)
15 let variance = stats.Variance
16 let standardDeviation = stats.StandardDeviation
17 let kurtosis = stats.Kurtosis
18 let skewness = stats.Skewness
19 let lowerQuartile = Statistics.LowerQuartile(ages)
20 let uppserQuartile = Statistics.UpperQuartile(ages)
21

Here are the results.

I got 620 valid photos of the 684 MVPs – so a 91% hit rate and I have enough observations to make the analysis statistically valid. It looks like Cognitive Services made at least 1 mistake with an age of 4.9 years –> perhaps someone was using a meme for their photo? In any event, the mean is estimated at 41.95 and the median is 40.95, so a slight skew left. (Note I mislabeled it on the screen shot above)

I then wanted to see the distribution of the ages so I brought in FSharp charting and ran a basic histogram:

1 open FSharp.Charting
2 
3 let chart = Chart.Histogram(ages,Intervals=10.0)
4 Chart.Show(chart)

So the ages look very Gaussian.

I then decided to look at gender:

1 let gender =
2     results
3     |> Seq.filter(fun r -> r.IsSome)
4     |> Seq.map(fun o -> snd o.Value)
5 
6 gender
7     |> Seq.countBy(fun v -> v)
8     |> Seq.map(fun (g,c) -> g, c, float c/float count)

With the results being:

So there are 12% females and 88% males. With an average age 42 years old and 88% male, “middle age white guy” seems like an appropriate label and I stand by my original tweet – we certainly have work to do in 2017.

You can find the gist here

Filed under Data Science, F# Tagged with F# DataScience

Creating Dynamic Uris For Visual Studio Web Tests

November 29, 2015 3 Comments

This post is part of the F# Advent Calendar in English 2015 project. Check out all the other great posts there! And special thanks to Sergey Tihon for organizing this. (Also, thanks to Scott W from whom I copy/pasted the prior sentences.)

One of the cooler features built into Visual Studio 2015 is the ability to create web tests and load tests. I had blogged about customizing them here and here but those posts did not cover the scenario where I need to dynamically create a uri. For example, consider the following web test that is hitting a web api 2 controller with some very rpc syntax:

Notice that the ContextParameters are setting the uri so I can move the test among environments. Also, notice the dynamic part of the uri called {{friendlyName}}.

One of the limitations of the out of the box web test is that context parameters cannot be data bound but can be appended as part of the uri. Also, query string parameters can be data bound cannot be appended as part of the uri. So if we want to go to a database and get a series of friendly names for our chickens to pass into our test, we are stuck. Grounded really.

Enter web test plug ins. I added a F# project to the solution and added a .fs file called UriAdjuster like so:

I then added references to:

Microsoft.VisualStudio.QualityTools.WebTestFramework and FSharp.Data.TypeProviders

I then added the following code to the UriAdjuster file:

namespace ChickenSoftware.ChickenApi.LoadTests.PlugIns

open System
open System.Text
open Microsoft.FSharp.Data.TypeProviders
open Microsoft.VisualStudio.TestTools.WebTesting

type internal EntityConnection = SqlEntityConnection<"myConnectionString",Pluralize = true>

type public UriAdjuster()  =
    inherit WebTestPlugin()
        let context = EntityConnection.GetDataContext()
            override this.PreRequestDataBinding(sender:Object, e:PreRequestDataBindingEventArgs) = 
                let random = new Random()
                let index = random.Next((context.Chickens |> Seq.length) - 1)
                let chicken = context.Chickens |> Seq.nth(index)

                e.Request.Url <- e.Request.Url.Replace("{{friendlyName}}",chicken.FriendlyName)

                base.PreRequestDataBinding(sender,e)
                ()

So I am going to the database on every request (using the awesome of type providers), pulling out a random chicken and updating the url with its friendlyName.

So that’s it. We now have the ability to create valid uris that we can then dump into our load test. Since our load test is running so fast, I guess we can say it is flying. So perhaps chickens can fly?

Happy holidays everyone.

Filed under F#, Web Testing

EjectABed Version 2 – Now Using the Raspberry Pi (Part 1)

September 8, 2015 1 Comment

I recently entered a hackster.io competition that centered around using Windows 10 on the Raspberry Pi. I entered the ejectabed and it was accepted to the semi-final round. My thought was to take the existing ejectabed controller from a Netduino and move it to a Raspberry Pi. While doing that, I could open the ejectabed from my local area network to the internet so anyone could eject Sloan.

My 1st step was hook my Raspberry Pi up to my home network and deploy from Visual Studio to it. Turns out, it was pretty straightforward.

I took a old Asus Portable Wireless Router and plugged it into my home workstation. I then configured the router to act as an Access Point so that it would pass though all traffic from the router to which my developer workstation is attached. I then attached the router to the PI and powered it though the PI’s USB port. I then plugged the PI’s HDMI out to a spare monitor of mine.

With all of the hardware plugged in, I headed over to Windows On Devices and followed the instructions on how to set up a Raspberry PI. After installing the correct software on my developer workstation, flashing the SD card with win10, plugging the SD card into the PI, turning the PI on, and then remoting into the PI via powershell, I could see the PI on my local workstation via the Windows IoT Core Watcher and the PI showing its friendly welcome screen via HDMI.

I then headed over to Visual Studio and copy/pasted the equisite “Hello IoT World” Blinky project to the Pi and watched the light go on and off.

With that out of the way, I decided to look at controlling the light via Twitter and Azure. The thought was to have the PI monitor a message queue on Azure and whenever there was a message, blink on or off (simulating the ejectabed being activated). To that end, I went into Azure and created a basic storage account. One of the nice things about Azure is that you get a queue out of the box when you create a storage account:

One of the not so nice things about Azure is that there is no way to control said Queue via their UI. You have to create, push, and pull from the queue in code. I went back to visual studio and added in the Azure Storage Nuget package

I then created a method to monitor the queue

 1         internal async Task<Boolean> IsMessageOnQueue()
 2         {
 3             var storageConnectionString = "DefaultEndpointsProtocol=https;AccountName=ejectabed;AccountKey=xxx";
 4             var storageAccount = CloudStorageAccount.Parse(storageConnectionString);
 5             var client = storageAccount.CreateCloudQueueClient();
 6             var queue = client.GetQueueReference("sloan");
 7             var queueExists = await queue.ExistsAsync();
 8             if (!queueExists)
 9             { 
10                 GpioStatus.Text = "Queue does not exist or is unreachable.";
11                 return false;
12             }
13             var message = await queue.GetMessageAsync(); 
14             if (message != null)
15             {
16                 await queue.DeleteMessageAsync(message);
17                 return true;
18             }
19             GpioStatus.Text = "No message for the EjectABed.";
20             return false;
21         }
22

Then if there is a message, the PI would run the ejection sequence (in this case blink the light)

1         internal void RunEjectionSequence()
2         {
3             bedCommand.Eject();
4             bedTimer = new DispatcherTimer();
5             bedTimer.Interval = TimeSpan.FromSeconds(ejectionLength);
6             bedTimer.Tick += LightTimer_Tick;
7             bedTimer.Start();
8         }

I deployed the code to the PI without a problem. I then created a Basic console application to push messages to the queue that the PI could drain

 1     class Program
 2     {
 3         static String storageConnectionString = "DefaultEndpointsProtocol=https;AccountName=ejectabed;AccountKey=xxx";
 4 
 5         static void Main(string[] args)
 6         {
 7             Console.WriteLine("Start");
 8             Console.WriteLine("Press The 'E' Key To Eject.  Press 'Q' to quit...");
 9 
10             var keyInfo = ConsoleKey.S;
11             do
12             {
13                 keyInfo = Console.ReadKey().Key;
14                 if (keyInfo == ConsoleKey.E)
15                 {
16                     CreateQueue();
17                     WriteToQueue();
18                     //ReadFromQueue();
19                 }
20 
21             } while (keyInfo != ConsoleKey.Q);
22 
23             Console.WriteLine("End");
24             Console.ReadKey();
25         }
26 
27         private static void CreateQueue()
28         {
29             var storageAccount = CloudStorageAccount.Parse(storageConnectionString);
30             var client = storageAccount.CreateCloudQueueClient();
31             var queue = client.GetQueueReference("sloan");
32             queue.CreateIfNotExists();
33             Console.WriteLine("Created Queue");
34         }
35 
36         private static void WriteToQueue()
37         {
38             var storageAccount = CloudStorageAccount.Parse(storageConnectionString);
39             var client = storageAccount.CreateCloudQueueClient();
40             var queue = client.GetQueueReference("sloan");
41             var message = new CloudQueueMessage("Eject!");
42             queue.AddMessage(message);
43             Console.WriteLine("Wrote To Queue");
44         }
45 
46 
47         private static void ReadFromQueue()
48         {
49             var storageAccount = CloudStorageAccount.Parse(storageConnectionString);
50             var client = storageAccount.CreateCloudQueueClient();
51             var queue = client.GetQueueReference("sloan");
52             var queueExists = queue.Exists();
53             if (!queueExists)
54                 Console.WriteLine("Queue does not exist");
55             var message = queue.GetMessage();
56             if (message != null)
57             {
58                 queue.DeleteMessage(message);
59                 Console.WriteLine("Message Found and Deleted");
60             }
61             else
62             {
63                 Console.WriteLine("No messages");
64             }
65         }
66

I could then Write to the queue and the PI would read and react. You can see it in action here:

With the queue up and running, I was ready to add in the ability for someone to Tweet to the queue. I created a cloud service project and pointed to a new project that will monitor Twitter and then push to the queue:

The Twitter project uses the TweetInvi nuget package and is a worker project. It makes a call to Twitter every 15 seconds and if there is a tweet to “ejectabed” with a person’s name, it will write to the queue (right now, only Sloan’s name is available)

 1 type TwitterWorker() =
 2     inherit RoleEntryPoint() 
 3 
 4     let storageConnectionString = RoleEnvironment.GetConfigurationSettingValue("storageConnectionString")
 5 
 6     let createQueue(queueName) =
 7         let storageAccount = CloudStorageAccount.Parse(storageConnectionString)
 8         let client = storageAccount.CreateCloudQueueClient()
 9         let queue = client.GetQueueReference(queueName);
10         queue.CreateIfNotExists() |> ignore
11 
12     let writeToQueue(queueName) =
13         let storageAccount = CloudStorageAccount.Parse(storageConnectionString)
14         let client = storageAccount.CreateCloudQueueClient()
15         let queue = client.GetQueueReference(queueName)
16         let message = new CloudQueueMessage("Eject!")
17         queue.AddMessage(message) |> ignore
18 
19     let writeTweetToQueue(queueName) =
20         createQueue(queueName)
21         writeToQueue(queueName)
22 
23     let getKeywordFromTweet(tweet: ITweet) = 
24         let keyword = "sloan"
25         let hasKeyword = tweet.Text.Contains(keyword)
26         let isFavourited = tweet.FavouriteCount > 0
27         match hasKeyword, isFavourited  with
28         | true,false -> Some (keyword,tweet)
29         | _,_ -> None
30         
31 
32     override this.Run() =
33         while(true) do
34             let consumerKey = RoleEnvironment.GetConfigurationSettingValue("consumerKey")
35             let consumerSecret = RoleEnvironment.GetConfigurationSettingValue("consumerSecret")
36             let accessToken = RoleEnvironment.GetConfigurationSettingValue("accessToken")
37             let accessTokenSecret = RoleEnvironment.GetConfigurationSettingValue("accessTokenSecret")
38 
39             let creds = Credentials.TwitterCredentials(consumerKey, consumerSecret, accessToken, accessTokenSecret)
40             Tweetinvi.Auth.SetCredentials(creds)
41             let matchingTweets = Tweetinvi.Search.SearchTweets("@ejectabed")
42             let matchingTweets' =  matchingTweets |> Seq.map(fun t -> getKeywordFromTweet(t))
43                                                   |> Seq.filter(fun t -> t.IsSome)
44                                                   |> Seq.map (fun t -> t.Value)
45             matchingTweets' |> Seq.iter(fun (k,t) -> writeTweetToQueue(k))        
46             matchingTweets' |> Seq.iter(fun (k,t) -> t.Favourite())        
47 
48             Thread.Sleep(15000)
49 
50     override this.OnStart() = 
51         ServicePointManager.DefaultConnectionLimit <- 12
52         base.OnStart()

Deploying to Azure was a snap

And now when I Tweet,

the PI reacts. Since Twitter does not allow the same Tweet to be sent again, I deleted it every time I wanted to send a new message to the queue.

Filed under Azure, F#, IoT, Raspberry Pi

Facebook Api Using F#

September 1, 2015 2 Comments

A common requirement for modern user-facing applications is to interface with Facebook. Unfortunately, Facebook does not make it easy on developers –> in fact it is one of the harder apis that I have seen. However, there is a covering sdk that you can use, along with some hoop jumping, to get it working. The problem is one of assumptions. The .NET sdk assumes that you want to build a Windows Store or Phone app and it is human to facebook connections. Once you get past those assumptions, you can do pretty well.

The first thing you need to do is set up a Facebook account.

Then register as a developer and create an application

In Visual Studio, Nuget in the facebook sdk

Then, in the REPL add the following code to get the auth token

 1 #r "../packages/Facebook.7.0.6/lib/net45/Facebook.dll"
 2 #r "../packages/Newtonsoft.Json.7.0.1/lib/net45/Newtonsoft.Json.dll"
 3 
 4 open Facebook
 5 open Newtonsoft.Json
 6 
 7 type Credentials = {client_id:string; client_secret:string; grant_type:string;scope:string}
 8 let credentials = {client_id="123456";
 9                    client_secret="123456";
10                    grant_type="client_credentials";
11                    scope="manage_pages,publish_stream,read_stream,publish_checkins,offline_access"}
12 
13 
14 let client = FacebookClient()
15 let tokenJson = client.Get("oauth/access_token",credentials)
16 type Token = {access_token:string}
17 let token = JsonConvert.DeserializeObject<Token>(tokenJson.ToString());

Which gives

Once you get the token, you can make a request to user and post to the page

1 let client' = FacebookClient(token.access_token)
2 client'.Get("me")
3 
4 let pageId = "me"
5 type FacecbookPost = {title:string; message:string}
6 let post = {title="Test Title"; message = "Test Message"}
7 client'.Post(pageId + "/feed", post)
8

I was getting this message though

So then the fun part. Apparently, you need to submit your application to the facebook team to be approved to be used. So now I have to submit icons and a description on how this application will be used before I can make a POST. <sigh>

Thanks to Gene Belitski for his help on my question on Stack Overflow

Filed under F#, Facebook

Wake County Voter Analysis Using FSharp, AzureML, and R

August 25, 2015 1 Comment

One of the real strengths of FSharp its ability to plow through and transform data in a very intuitive way, I was recently looking at Wake Country Voter Data found here to do some basic voter analysis. My first thought was to download the data into R Studio. Easy? Not really. The data is available as a ginormous Excel spreadsheet of database of about 154 MB in size. I wanted to slim the dataset down and make it a .csv for easy import into R but using Excel to export the data as a .csv kept screwing up the formatting and importing it directly into R Studio from Excel resulting in out of memory crashes. Also, the results of the different election dates were not consistent –> sometimes null, sometimes not. I managed to get the data into R Studio without a crash and wrote a function of either voted “1” or not “0” for each election

 1 #V = voted in-person on Election Day
 2 #A = voted absentee by mail or early voting (through May 2006)
 3 #M = voted absentee by mail (November 2006 - present)
 4 
 5 #O = voted One-Stop early voting (November 2006 - present)
 6 #T = voted at a transfer precinct on Election Day
 7 #P = voted a provisional ballot
 8 #L = Legacy data (prior to 2006)
 9 #D = Did not show
10 
11 votedIndicated <- function(votedCode) {
12   switch(votedCode,
13          "V" = 1,
14          "A" = 1,
15          "M" = 1,
16          "O" = 1,
17          "T" = 1,
18          "P" = 1,
19          "L" = 1,
20          "D" = 0)
21 }
22

However, every time I tried to run it, the IDE would crash with an out of memory issue.

Stepping back, I decided to transform the data in Visual Studio using FSharp. I created a sample from the ginormous excel spreadsheet and then imported the data using a type provider. No memory crashes!

 1 #r "../packages/ExcelProvider.0.1.2/lib/net40/ExcelProvider.dll"
 2 open FSharp.ExcelProvider
 3 
 4 [<Literal>]
 5 let samplePath = "../../Data/vrdb-Sample.xlsx"
 6 
 7 open System.IO  
 8 let baseDirectory = __SOURCE_DIRECTORY__
 9 let baseDirectory' = Directory.GetParent(baseDirectory)
10 let baseDirectory'' = Directory.GetParent(baseDirectory'.FullName)
11 let inputFilePath = @"Data\vrdb.xlsx"
12 let fullInputPath = Path.Combine(baseDirectory''.FullName, inputFilePath)
13 
14 type WakeCountyVoterContext = ExcelFile<samplePath>
15 let context = new WakeCountyVoterContext(fullInputPath)
16 let row = context.Data |> Seq.head

I then applied a similar function for voted or not and then exported the data as a .csv

 1 let voted (voteCode:obj) =
 2     match voteCode = null with
 3     | true -> "0"
 4     | false -> "1"
 5 
 6 open System
 7 let header =  "Id,Race,Party,Gender,Age,20080506,20080624,20081104,20091006,20091103,20100504,20100622,20101102,20111011,20111108,20120508,20120717,20121106,20130312,20131008,20131105,20140506,20140715,20141104"
 8                     
 9 let createOutputRow (row:WakeCountyVoterContext.Row) =
10     String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{16},{17},{18},{19},{20},{21},{22},{23}",
11         row.voter_reg_num,
12         row.race_lbl, 
13         row.party_lbl, 
14         row.gender_lbl, 
15         row.eoy_age,
16         voted(row.``05/06/2008``),
17         voted(row.``06/24/2008``),
18         voted(row.``11/04/2008``),
19         voted(row.``10/06/2009``),
20         voted(row.``11/03/2009``),
21         voted(row.``05/04/2010``),
22         voted(row.``06/22/2010``),
23         voted(row.``11/02/2010``),
24         voted(row.``10/11/2011``),
25         voted(row.``11/08/2011``),
26         voted(row.``05/08/2012``),
27         voted(row.``07/17/2012``),
28         voted(row.``11/06/2012``),
29         voted(row.``03/12/2013``),
30         voted(row.``10/08/2013``),
31         voted(row.``11/05/2013``),
32         voted(row.``05/06/2014``),
33         voted(row.``07/15/2014``),
34         voted(row.``11/04/2014``)
35         )
36 
37 let outputFilePath = @"Data\vrdb.csv"
38 
39 let data = context.Data |> Seq.map(fun row -> createOutputRow(row))
40 let fullOutputPath = Path.Combine(baseDirectory''.FullName, outputFilePath)
41 
42 let file = new StreamWriter(fullOutputPath,true)
43 
44 file.WriteLine(header)
45 context.Data |> Seq.map(fun row -> createOutputRow(row))
46              |> Seq.iter(fun r -> file.WriteLine(r))
47

The really great thing is that I could write and then dispose of each line so I could do it without any crashes. Once the data was into a a .csv (10% the size of Excel), I could then import it into R Studio without a problem. It is a common lesson but really shows that using the right tool for the job saves tons of headaches.

I knew from a previous analysis of voter data that the #1 determinate of a person from wake county voting in a off-cycle election was their age:

So then in R, I created a decision tree for just age to see what the split was:

1 library(rpart)
2 temp <- rpart(all.voters$X20131008 ~ all.voters$Age)
3 plot(temp)
4 text(temp)

Thanks to Placidia for answering my question on stats.stackoverflow

So basically politicians should be targeting people 50 years or older or perhaps emphasizing issues that appeal to the over 50 crowd.

Filed under Analytics, Azure ML, F#, R

F#, REPL Driven Development, and Scrum

August 4, 2015 Leave a comment

Last week, I did a book review of sorts on Scrum: The Art of Doing Twice The Work In Half The Time. When I was reading the text, an interesting thought hit me several times. As a pragmatic practitioner of Test Driven Development (TDD), which often goes hand in hand with Agile and Scrum ideas, I often wonder if I am doing something the best way. I remember distinctly Robert C Martin talking about being your new CTO with the goal of all of the code working correctly all of the time and that he didn’t care if you used TDD, but he doesn’t know a better way.

I was thinking how lately I have been practicing REPL-driven development using F#. If you are not familiar, REPL stands for “READ-EVALUATE-PRINT-LOOP” and has been the primary methodology of functional programmers and data scientists for years. In RDD, I quickly prove out my idea tin the REPL to see if I make sense. it is strictly happy path programming. Once I think I have a good idea or a solution to whatever problem I am working on, I lift the code into a compiled assembly. The data elements I used in the REPL then get ported over into my unit tests. I typically use C# unit tests so that I can confirm that my FSharp code will interop nicely with any VB.NET/C# projects in the solution. I then layer on code coverage to make sure I have covered all happy paths and then throw some fail cases at the code.

Thinking of this methodology, I think it is closer to Scrum than traditional TDD for a couple of reasons:

Fail Fast and Fix early. You cannot prove out ideas any faster than in the REPL except for maybe a dry board. Curly-braces and OO-centered languages like Java and C# are great for certain jobs, but the require much more ceremony and code for code’s sake. As Sutherland points out, context-switching is a killer. The less you have to worry about code (classes, moqs, etc..) the faster and better you will be at solving your problem.

Working Too Hard Makes More Work. One of the most startling things about using F# on real projects is that there is just not very much code. I finished and looked around to see what I missed. My unit tests were passing, code coverage was high, and there just wasn’t much code. It was quite unsettling. I now realize that lots of C#/Java code needs to be generated for real programming projects (exception handling, class hierarchies, design patterns, etc…). But as the Dartmouth Basic Manual once said “typing is not a substitute for thinking”, all of this code begets more code. It is a cycle of work that creates more work that F# does not have.

Duplication/Boilerplates/Templates. Complete and Total Waste So this one is pretty self-explanatory. Many people (myself included) think that Visual Studio needs better F# templates. However, once you get good at writing F# code, you really don’t need them. Maybe it is good that there aren’t many more? In any event, you don’t use templates and boiler plates in the REPL…

Filed under Coding Best Practices, F#

The Counted: Initial Analysis Using FSharp and R

July 7, 2015 3 Comments

(Note: this is post one of three. Next week is a deeper dive into the data and the following week is an analysis of law enforcement officers killed in the line of duty)

Andrew Oliver hit me up on Twitter with a new dataset that he stumbled across. The dataset is called “The Counted” and it is an attempt to count all of the deaths at the hand of police in America in 2015. Apparently, this data is not collected systematically by the US government, which is kind of puzzling. You can read about and download the data here. A sample looks like:

John asked what we could do with the dataset –> esp when comparing to other variables like socio-economic status. Step #1 in my mind was to geo-locate the data. Since this is a .csv, the first-first thing was to remove extra commas and replace them with semi-colons or blank spaces (for example, US Marshals Service, Pennsylvania State Police, Allegheny County Sheriff’s Office became US Marshals Service; Pennsylvania State Police; Allegheny County Sheriff’s Office and Corrections Department, 1400 E 4th Ave became Corrections Department 1400 E 4th Ave)

Adding Geolocations

Drawing on my code that I wrote using Texas A&M’s Geoservice found here, I converted the json type provider script into a function that takes address info and returns a geolocation:

 1 let getGeoCoordinates(streetAddress:string, city:string, state:string) =
 2     let apiKey = "xxxxx"
 3     let stringBuilder = new StringBuilder()
 4     stringBuilder.Append("https://geoservices.tamu.edu/Services/Geocode/WebService/GeocoderWebServiceHttpNonParsed_V04_01.aspx") |> ignore
 5     stringBuilder.Append("?streetAddress=") |> ignore
 6     stringBuilder.Append(streetAddress) |> ignore
 7     stringBuilder.Append("&city=") |> ignore
 8     stringBuilder.Append(city) |> ignore
 9     stringBuilder.Append("&state=") |> ignore
10     stringBuilder.Append(state) |> ignore
11     stringBuilder.Append("&apiKey=") |> ignore
12     stringBuilder.Append(apiKey) |> ignore
13     stringBuilder.Append("&version=4.01") |> ignore
14     stringBuilder.Append("&format=json") |> ignore
15 
16     let searchUri = stringBuilder.ToString()
17     let searchResult = GeoLocationServiceContext.Load(searchUri)
18 
19     let firstResult = searchResult.OutputGeocodes |> Seq.head
20     firstResult.OutputGeocode.Latitude, firstResult.OutputGeocode.Longitude, firstResult.OutputGeocode.MatchScore

I then loaded in the dataset via the .csv type provider:

1 [<Literal>]
2 let theCountedSample = "..\Data\TheCounted.csv"
3 type TheCountedContext = CsvProvider<theCountedSample>
4 let theCountedData = TheCountedContext.Load(theCountedSample)
5

I then mapped the geofunction to the imported dataset:

1 let theCountedGeoLocated = theCountedData.Rows 
2                             |> Seq.map(fun r -> r, getGeoCoordinates(r.Streetaddress, r.City, r.State))
3                             |> Seq.toList
4                             |> Seq.map(fun (r,(lat,lon,ms)) -> String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15}",
5                                                                      r.Name,r.Age,r.Gender,r.Raceethnicity,r.Month,r.Day,r.Year, r.Streetaddress, r.City,r.State,r.Cause,r.Lawenforcementagency,r.Armed,lat,lon,ms))
6

And then finally exported the data

1 let baseDirectory = __SOURCE_DIRECTORY__
2 let baseDirectory' = Directory.GetParent(baseDirectory)
3 let filePath = "Data\TheCountedWithGeo.csv"
4 let fullPath = Path.Combine(baseDirectory'.FullName, filePath)
5 File.WriteAllLines(fullPath,theCountedGeoLocated)

The gist is here. Using the csv and json type providers made the analysis a snap –> a majority code is just building up the string for the service call. +1 for simplicity.

Analyzing The Results

After adding geolocations to the dataset, I opened R studio and imported the dataset.

1 theCounted <- read.csv("./Data/TheCountedWithGeo.csv") 
2 summary(theCounted)
3

So this is good news that we have good confidence on all of the observations so we don’t have to drop any records (making the counted, un-counted, as it were).

I then googled how to create a US map and put some data points on them and ran across this post. I copied and pasted the code, changed the variable names, said “there is no way it is this easy” out loud, and hit CTRL+ENTER.

 1 library(ggplot2)
 2 library(maps)
 3 
 4 all.states <- map_data("state")
 5 plot <- ggplot() 
 6 plot <- plot + geom_polygon(data=all.states, aes(x=long, y=lat, group = group),
 7                       colour="grey", fill="white" )
 8 plot <- plot + geom_point(data=theCounted, aes(x=lon, y=lat), 
 9                     colour="#FF0040")
10 plot <- plot + guides(size=guide_legend(title="Homicides"))
11 plot

The gist is here.

Filed under Analytics, F#, R

Sandcastle Help File Builder and FSharp

June 30, 2015 3 Comments

If you are going to write and release a professional-grade .NET assembly, there are some things that need to be considered: logging, exception handling, and documentation. For .NET components, Sandcastle Help File Builder is the go-to tool to generate documentation as either the old-school .chm file or as a web deploy.

Consider an assembly that contains a Customer record type, an interface for a Customer Repository, and two implementations (In-Memory and ADO.NET)

 1 type Customer = {id:int; firstName:string; lastName:string}
 2 
 3 type ICusomerRepository =
 4     abstract member GetCustomer : int -> Customer
 5     abstract member InsertCustomer: Customer -> int
 6     abstract member DeleteCustomer: int -> unit
 7 
 8 type InMemoryCustomerRepository ()= 
 9     let customers = [
10             {id=1; firstName = "First"; lastName = "Customer"}
11             {id=2; firstName = "Second"; lastName = "Customer"}
12             {id=3; firstName = "Third"; lastName = "Customer"}]
13     let customers' = new List<Customer>(customers)
14 
15     interface ICusomerRepository with
16         member this.GetCustomer(id:int) =
17             customers' |> Seq.find(fun c -> c.id = id)
18         member this.InsertCustomer(customer: Customer) =
19             let nextId = customers'.Count
20             let customer' = {customer with id=nextId}
21             customers'.Add(customer')
22             nextId
23         member this.DeleteCustomer(id: int) =
24             let customer = customers |> Seq.find(fun c -> c.id = id)
25             customers'.Remove(customer) |> ignore
26 
27 type SqlServerCustomerRepository (connectionString:string) =
28     interface ICusomerRepository with
29         member this.GetCustomer(id:int) =
30             use connection = new SqlConnection(connectionString)
31             let commandText = "Select * from customers where id = " + id.ToString()
32             use command = new SqlCommand(commandText, connection)
33             connection.Open()
34             use reader = command.ExecuteReader()
35             reader.Read() |> ignore
36             {id=reader.[0] :?> int; 
37             firstName=reader.[1] :?> string; 
38             lastName =reader.[2] :?> string}
39                         
40         member this.InsertCustomer(customer: Customer) =
41             use connection = new SqlConnection(connectionString)
42             let commandText = new StringBuilder()
43             commandText.Append("Insert customers values") |> ignore
44             commandText.Append(customer.firstName) |> ignore
45             commandText.Append(",") |> ignore
46             commandText.Append(customer.lastName) |> ignore
47             use command = new SqlCommand(commandText.ToString(), connection)
48             connection.Open()
49             command.ExecuteNonQuery()
50 
51         member this.DeleteCustomer(id: int) =
52             use connection = new SqlConnection(connectionString)
53             let commandText = "Delete customers where id = " + id.ToString()
54             use command = new SqlCommand(commandText, connection)
55             connection.Open()
56             command.ExecuteNonQuery() |> ignore
57

To auto-generate XML code comments, you need to mark “XML documentation file” on the Build page of project properties:

With the .XML file created during the build, you can then fire up Sandcastle to point to the .XML file

With that, you can get some nice component documents based on your XML Code Comments. Since I have not put any into my project yet, there is nothing in the docs.

So therein lies the rub. I started entering XML comments (bare minimum) like so:

 1 /// <summary>
 2 /// Interface for Customer Repository implementations.
 3 /// </summary>
 4 type ICusomerRepository =
 5     /// <summary>
 6     /// Get a single validated customer.
 7     /// </summary>
 8     ///<param name="param0">The customer Id</param>
 9     ///<returns>A validated Customer.</returns>
10     abstract member GetCustomer : int -> Customer
11     /// <summary>
12     /// Insert a single validated customer.
13     /// </summary>
14     ///<param name="param0">A validated customer.</param>
15     ///<returns>The Id of the customer, generated by the respository.</returns>
16     abstract member InsertCustomer: Customer -> int
17     /// <summary>
18     /// Deletes a single customer from the respository.
19     /// </summary>
20     ///<param name="param0">The customer Id</param>
21     abstract member DeleteCustomer: int -> unit

And you can see what happens. The code base goes from 5 lines of readable code to 21 lines of clutter to make the help file.

One of the tenants of good code is that it is clean –> so we use SOLID principles, run FxCop, and the like. Another tenant of good code is that it is uncluttered –> so we use FSharp, use ROP instead of structured exception handling, and avoid boilerplates and templating. The problem is that we still can’t get away from clutter if we want to have good documentation. Option A is to just drop documentation, a laudable but unrealistic goal, especially in a corporate environment. Option B I am not sure on. I am wondering if I create a separate file in the project just for the code comments. That way the actual code is uncluttered and you can work with it undistracted and the XML still gets generated…

Filed under Coding Best Practices, F#

Parsing Wireshark Files Using F#

June 16, 2015 1 Comment

I went to the Research Triangle Analysts Meetup for network security where I was exposed to Wireshark for the1st time. One of the problems with analyzing packets is that the data comes in a variety of structures, depending on the nature of what is being captured and what level of commutation is being analyzed. I decided to learn a bit about network analysis using this book:

One of the examples was analyzing Twitter Direct Messages. The interesting thing is that the contents of DMs are sent in plain text, so that is a good word to the wise.

I was thinking about how to best analyze the sample packets for the DM and I immediately thought of using F# Type Providers. You can export the data from Wireshark in a variety of formats, I chose XML for no particular reason

After exporting the data to the file system and bring the data in via the TP, I then wrote a quick script to see how fast I could get to the message sent. Turns out pretty quick:

 1 open System.IO
 2 open FSharp.Data
 3 
 4 [<Literal>]
 5 let uri = @"C:\Users\jamie\Desktop\ChickenSoftware.PacketAnalysis.Solution\Data\twitter_dm"
 6 
 7 type Context = XmlProvider<uri>
 8 let data = Context.Load(uri)
 9 
10 let protoes = data.Packets |> Seq.collect(fun p -> p.Protoes)
11 let fields = protoes |> Seq.collect(fun p -> p.Fields)
12 let content = fields |> Seq.filter(fun f -> f.Name = Some "urlencoded-form")
13 let values = content |> Seq.map(fun c -> c.Showname)
14 let values' = values |> Seq.filter(fun v -> v.Value.Contains("text"))
15 values'

So F# makes is very easy to consume the data and traverse it. I am curious how easy it will be to start applying machine learning to these files using F#. That is up next…

Filed under F#, Network Analysis

← Older posts

Newer posts →

Jamie Dixon's Home