Refactoring McCaffrey’s Regression to F#
April 14, 2015 1 Comment
James McCaffrey’s most recent MSDN article is about multi-class regression article is a great starting place for folks interested in the ins and outs of creating a regression. You can find the article here. He wrote the code in C# in a very much imperative style so the FSharp in me immediately wanted to rewrite it in F#.
Interestingly, Mathias Brandewinder also had the same idea and did a better (and more complete) job than me. You can see his post here.
I decided to duck into McCaffrey’s code and see where I could rewrite part of the code. My first step was to move his C# code to a more manageable format.
I changed the project from a console app to a .dll and then split the two classes into their own file. I then added some unit tests so that I can verify that my reworking was correct:
1 [TestClass] 2 public class CSLogisticMultiTests 3 { 4 LogisticMulti _lc = null; 5 double[][] _trainData; 6 double[][] _testData; 7 8 public CSLogisticMultiTests() 9 { 10 int numFeatures = 4; 11 int numClasses = 3; 12 int numRows = 1000; 13 int seed = 42; 14 var data = LogisticMultiProgram.MakeDummyData(numFeatures, numClasses, numRows, seed); 15 LogisticMultiProgram.SplitTrainTest(data, 0.80, 7, out _trainData, out _testData); 16 _lc = new LogisticMulti(numFeatures, numClasses); 17 18 int maxEpochs = 100; 19 double learnRate = 0.01; 20 double decay = 0.10; 21 _lc.Train(_trainData, maxEpochs, learnRate, decay); 22 } 23 24 [TestMethod] 25 public void GetWeights_ReturnExpected() 26 { 27 double[][] bestWts = _lc.GetWeights(); 28 var expected = 13.939104508387803; 29 var actual = bestWts[0][0]; 30 Assert.AreEqual(expected, actual); 31 } 32 33 [TestMethod] 34 public void GetBiases_ReturnExpected() 35 { 36 double[] bestBiases = _lc.GetBiases(); 37 var expected = 11.795019237894717; 38 var actual = bestBiases[0]; 39 Assert.AreEqual(expected, actual); 40 } 41 42 [TestMethod] 43 public void GetTrainAccuracy_ReturnExpected() 44 { 45 var expected = 0.92125; 46 var actual = _lc.Accuracy(_trainData); 47 Assert.AreEqual(expected, actual); 48 } 49 50 [TestMethod] 51 public void GetTestAccuracy_ReturnExpected() 52 { 53 var expected = 0.895; 54 double actual = _lc.Accuracy(_testData); 55 Assert.AreEqual(expected, actual); 56 } 57 } 58
You will notice that this is the exact code that McCaffrey uses in his output for the Console app. In any event, they were running all green
I then went into the F# Project and fired up the REPL. I decided to start with the MakeDummyData method because it seemed beefy enough to demonstrate the language differences between the languages, it is fairly self-contained, and its data is already testable. Here is the first 9 lines of code.
1 Random rnd = new Random(seed); 2 double[][] wts = new double[numFeatures][]; 3 for (int i = 0; i < numFeatures; ++i) 4 wts[i] = new double[numClasses]; 5 double hi = 10.0; 6 double lo = -10.0; 7 for (int i = 0; i < numFeatures; ++i) 8 for (int j = 0; j < numClasses; ++j) 9 wts[i][j] = (hi - lo) * rnd.NextDouble() + lo;
And here is the F# equivalent
1 let rnd = new Random(seed) 2 let hi = 10.0 3 let lo = -10.0 4 let wts = Array.create numFeatures (Array.create numClasses 1.) 5 let wts' = wts |> Array.map(fun row -> row |> Array.map(fun col -> (hi - lo) * rnd.NextDouble() + lo)) 6
There is one obvious difference and 1 subtle difference. The obvious difference is that the F# code does not do any looping to create and populate the array of arrays data structure, rather it uses the high-order Array.Map function. This reduces the idiomatic line count from 9 to 5 – a 50% decrease (and a funny move from the 1980s). (Note that I use the words “idiomatic line count” because you can reduce both examples to a single line of code but that makes in unworkable by humans. Both examples show the typical way you would write code in the language.) So with the fewer lines of code, which is more readable? That is a subjective opinion. A C#/Java/Javascript/Curly-Brace dev would say the C#. Everyone else in the world would say F#.
The less obvious difference is that F# emphasizes immutability so that there are two variables (wts and wts’) and the C# has 1 variable that is mutated. The implication is lost in such a small example, but if the numFeatures was large, you would want to take advantage of mutli-core processors and the F# code is ready for parallelism. The C# code would have to be reworked to use an immutable collection.
The next lines create and populate the biases variable. The C# Code:
1 double[] biases = new double[numClasses]; 2 for (int i = 0; i < numClasses; ++i) 3 biases[i] = (hi - lo) * rnd.NextDouble() + lo; 4
And the F# Code
1 let biases = Array.create numClasses 1. 2 let biases' = biases |> Array.map(fun row -> (hi - lo) * rnd.NextDouble() + lo) 3
Same deal as before. No loops or mutation. Fewer lines of code and better readability.
The last set of code is a ball of string so it is very hard to separate out.
1 double[][] result = new double[numRows][]; // allocate result 2 for (int i = 0; i < numRows; ++i) 3 result[i] = new double[numFeatures + numClasses]; 4 5 for (int i = 0; i < numRows; ++i) // create one row at a time 6 { 7 double[] x = new double[numFeatures]; // generate random x-values 8 for (int j = 0; j < numFeatures; ++j) 9 x[j] = (hi - lo) * rnd.NextDouble() + lo; 10 11 double[] y = new double[numClasses]; // computed outputs storage 12 for (int j = 0; j < numClasses; ++j) // compute z-values 13 { 14 for (int f = 0; f < numFeatures; ++f) 15 y[j] += x[f] * wts[f][j]; 16 y[j] += biases[j]; 17 } 18 19 // determine loc. of max (no need for 1 / 1 + e^-z) 20 int maxIndex = 0; 21 double maxVal = y[0]; 22 for (int c = 0; c < numClasses; ++c) 23 { 24 if (y[c] > maxVal) 25 { 26 maxVal = y[c]; 27 maxIndex = c; 28 } 29 } 30 31 for (int c = 0; c < numClasses; ++c) // convert y to 0s or 1s 32 if (c == maxIndex) 33 y[c] = 1.0; 34 else 35 y[c] = 0.0; 36 37 int col = 0; // copy x and y into result 38 for (int f = 0; f < numFeatures; ++f) 39 result[i][col++] = x[f]; 40 for (int c = 0; c < numClasses; ++c) 41 result[i][col++] = y[c]; 42 } 43
Note the use of code comments, which is typically considered a code smell, even in demonstration code.
Here is the F# Code:
1 let x = Array.create numFeatures 1. 2 let x' = x |> Array.map(fun row -> (hi - lo) * rnd.NextDouble() + lo) 3 4 let xWts = Array.zip x' wts' 5 let xWts' = xWts |> Array.map(fun (x,wts) -> wts |> Array.sumBy(fun wt -> wt * x)) 6 7 let y = Array.create numClasses 1. 8 let yWts = Array.zip y xWts' 9 let y' = yWts |> Array.map(fun (y,xwt) -> y + xwt) 10 11 let yBias = Array.zip y' biases' 12 let y'' = yBias |> Array.map(fun (y,bias) -> y + bias) 13 14 let maxVal = y'' |> Array.max 15 16 let y''' = y'' |> Array.map(fun y -> if y = maxVal then 1. else 0.) 17 18 let xy = Array.append x' y''' 19 let result = Array.create numRows xy
This is pretty much the same as before,no loops, immutability, and a 50% reduction of code. Also, notice that by using a more functional style breaks apart the ball of string. Individual values are one their own line to be individual evaluated and manipulated. Also, the if..then statement goes to a single line.
So I had a lot of fun working through these examples. The major differences were
- Amount of Code and Code Readability
- Immutability and ready for parallelism
- I am not planning to refactor the rest of the project, but you can too as the project is found here. I am curious if using an array of arrays is the best way to represent the matric –> I guess it is standard for the curly-brace community? I would think using Deedle would be better, but I don’t know enough about it (yet).
Pingback: F# Weekly #16, 2015 | Sergey Tihon's Blog