Kaplan-Meier Survival Analysis Using F#
May 6, 2014 5 Comments
I was reading the most recent issue of MSDN a couple of days ago when I came across this article on doing a Kaplan-Meier survival analysis. I thought the article was great and I am excited that MSDN is starting to publish articles on data analytics. However, I did notice that there wasn’t any code in the article, which is odd, so I went to the on-line article and others had a similar question:
I decided to implement a Kaplan-Meier survival (KMS) analysis using F#. After reading the article a couple of times, I was still a bit unclear on how the KMS is implemented and there does not seem to be any pre-rolled in the standard .NET stat libraries out there. I went on over to this site where there was an excellent description of how the survival probability is calculated. I went ahead and built an Excel spreadsheet to match the nih one and then compare to what Topol is doing:
Notice that Topol censored the data for the article. If we only cared about the probability of crashes, then we would not censor the data for when the device was turned off.
So then I was ready to start coding so spun up a solution with an F# project for the analysis and a C# project for the testing.
I then loaded into the unit test project the datasets that Topol used:
- [TestMethod]
- public void EstimateForApplicationX_ReturnsExpected()
- {
- var appX = new CrashMetaData[]
- {
- new CrashMetaData(0,1,false),
- new CrashMetaData(1,5,true),
- new CrashMetaData(2,5,false),
- new CrashMetaData(3,8,false),
- new CrashMetaData(4,10,false),
- new CrashMetaData(5,12,true),
- new CrashMetaData(6,15,false),
- new CrashMetaData(7,18,true),
- new CrashMetaData(8,21,false),
- new CrashMetaData(9,22,true),
- };
- }
I could then wire up the unit tests to compare the output to the article and what I had come up with.
- public void EstimateForApplicationX_ReturnsExpected()
- {
- var appX = new CrashMetaData[]
- {
- new CrashMetaData(0,1,false),
- new CrashMetaData(1,5,true),
- new CrashMetaData(2,5,false),
- new CrashMetaData(3,8,false),
- new CrashMetaData(4,10,false),
- new CrashMetaData(5,12,true),
- new CrashMetaData(6,15,false),
- new CrashMetaData(7,18,true),
- new CrashMetaData(8,21,false),
- new CrashMetaData(9,22,true),
- };
- var expected = new SurvivalProbabilityData[]
- {
- new SurvivalProbabilityData(0,1.000),
- new SurvivalProbabilityData(5,.889),
- new SurvivalProbabilityData(12,.711),
- new SurvivalProbabilityData(18,.474),
- new SurvivalProbabilityData(22,.000)
- };
- KaplanMeierEstimator estimator = new KaplanMeierEstimator();
- var actual = estimator.CalculateSurvivalProbability(appX);
- Assert.AreSame(expected, actual);
- }
However, one of the neat features of F# is the REPL so I don’t need to keep running unit tests to prove correctness when I am proving out a concept. So I added equivalent test code in the beginning of the F# project so I could run in the REPL my ideas:
- type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
- type KapalanMeierAnalysis() =
- member this.GenerateXAppData ()=
- [| {userId=0; crashTime=1; crashed=false};{userId=1; crashTime=5; crashed=true};
- {userId=2; crashTime=5; crashed=false};{userId=3; crashTime=8; crashed=false};
- {userId=4; crashTime=10; crashed=false};{userId=5; crashTime=12; crashed=true};
- {userId=6; crashTime=15; crashed=false};{userId=7; crashTime=18; crashed=true};
- {userId=8; crashTime=21; crashed=false};{userId=9; crashTime=22; crashed=true}|]
- member this.RunAnalysis(crashMetaData: array<CrashMetaData>) =
The first thing I did was duplicate the 1st 3 columns of the Excel spreadsheet:
- let crashSequence = crashMetaData
- |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
- | true -> 1
- | false -> 0),
- (match crash.crashed with
- | true -> 0
- | false -> 1))
In the REPL:
The forth column is tricky because it is a cumulative calculation. Instead of for..eaching in an imperative style, I took advantage of the functional language constructs to make the code much more readable. Once I calculated that column outside of the base Sequence, I added it back in via Seq.Zip
- let cumulativeDevices = crashMetaData.Length
- let crashSequence = crashMetaData
- |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
- | true -> 1
- | false -> 0),
- (match crash.crashed with
- | true -> 0
- | false -> 1))
- let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
- let crashSequence' = Seq.zip crashSequence availableDeviceSequence
- |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
In the REPL:
The next two columns were a snap –> they were just calculations based on the existing values:
- let cumulativeDevices = crashMetaData.Length
- let crashSequence = crashMetaData
- |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
- | true -> 1
- | false -> 0),
- (match crash.crashed with
- | true -> 0
- | false -> 1))
- let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
- let crashSequence' = Seq.zip crashSequence availableDeviceSequence
- |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
- let crashSequence'' = crashSequence'
- |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld))
The last column was another cumulative calculation so I added another accumulator and used Seq.scan and Seq.Zip.
- let cumulativeDevices = crashMetaData.Length
- let cumulativeSurvivalProbability = 1.
- let crashSequence = crashMetaData
- |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
- | true -> 1
- | false -> 0),
- (match crash.crashed with
- | true -> 0
- | false -> 1))
- let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
- let crashSequence' = Seq.zip crashSequence availableDeviceSequence
- |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
- let crashSequence'' = crashSequence'
- |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld))
- let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
- let survivalProbabilitySequence' = survivalProbabilitySequence
- |> Seq.skip 1
The last step was to map all of the columns and only output what was in the article. The final answer is:
- namespace ChickenSoftware.SurvivalAnalysis
- type CrashMetaData = {userId: int; crashTime: int; crashed: bool}
- type public SurvivalProbabilityData = {crashTime: int; survivalProbaility: float}
- type KaplanMeierEstimator() =
- member this.CalculateSurvivalProbability(crashMetaData: array<CrashMetaData>) =
- let cumulativeDevices = crashMetaData.Length
- let cumulativeSurvivalProbability = 1.
- let crashSequence = crashMetaData
- |> Seq.map(fun crash -> crash.crashTime, (match crash.crashed with
- | true -> 1
- | false -> 0),
- (match crash.crashed with
- | true -> 0
- | false -> 1))
- let availableDeviceSequence = Seq.scan(fun cumulativeCrashes (time,crash,nonCrash) -> cumulativeCrashes – 1 ) cumulativeDevices crashSequence
- let crashSequence' = Seq.zip crashSequence availableDeviceSequence
- |> Seq.map(fun ((time,crash,nonCrash),cumldevices) -> time,crash,nonCrash,cumldevices)
- let crashSequence'' = crashSequence'
- |> Seq.map(fun (t,c,nc,cumld) -> t,c,nc,cumld, float c/ float cumld, 1.-(float c/ float cumld))
- let survivalProbabilitySequence = Seq.scan(fun cumulativeSurvivalProbability (t,c,nc,cumld,dp,sp) -> cumulativeSurvivalProbability * sp ) cumulativeSurvivalProbability crashSequence''
- let survivalProbabilitySequence' = survivalProbabilitySequence
- |> Seq.skip 1
- let crashSequence''' = Seq.zip crashSequence'' survivalProbabilitySequence'
- |> Seq.map(fun ((t,c,nc,cumld,dp,sp),cumlsp) -> t,c,nc,cumld,dp,sp,cumlsp)
- crashSequence'''
- |> Seq.filter(fun (t,c,nc,cumld,dp,sp,cumlsp) -> c=1 )
- |> Seq.map(fun (t,c,nc,cumld,dp,sp,cumlsp) -> t,System.Math.Round(cumlsp,3))
And this matches the article (almost exactly). The article also has a row for iteration zero, which I did not bake in. Instead of fixing my code, I changed the unit test and removed that 1st column. In any event, I ran the test and it ran red –> but the values are identical so I assume it is a problem with the Assert.AreSame() function. I would take the time to figure it out but it is 75 degrees on a Sunday afternoon and I want to go play catch with my kids…
Note it also matches the other data set Topol has in the article:
In any event, this code reads pretty much the way I was thinking about the problem – each column of the Excel spreadsheet has a 1 to 1 correspondence to the F# code block. I did use explanatory variables liberally which might offend the more advanced functional programmers but taking each step in turn really helped me focus on getting each step correct before going to the next one.
1) I had to offset the cumulativeSurvivalProabability by one because the calculation is how many crashed on a day compared to how many were working at the start of the day. The Seq.Scan increments the counter for the next row of the sequence and I need it for the current row. Perhaps there is an overload for Seq.Scan?
2) I adopted the functional convention of using ticks to denote different physical manifestations of the same logic concept (crashedDeviceSequence “became” crashedDeviceSequence’, etc…). Since everything is immutable by default in F#, this kind of naming convention makes a lot of sense to me. However, I can see it quickly becoming unwieldy.
3) I could not figure out how to operate on the base tuple so instead I used a couple of supporting Sequences and then put everything together using Seq.Zip. I assume there is a more efficient way to do that.
4) One of the knocks against functional/scientific programming is that values are named poorly. To combat that, I used the full names in my tuples to start. After a certain point though, the names got too unwieldy so I resorted to their initials. I am not sure what the right answer is here, or even if there is right answer.
.
Useful article, thanks. Unfortunately the code samples are hard to read because they are so wide. Perhaps use 3 space indents and wrap at around 60 chars? Cheers!
Yeah. I really have to get off LiveWriter and a VS Code add-in. Any suggestions?
I use markdown and Jekyll for my (static) blog. And Octopress or similar are good too.
I’m not sure why the indents are coming out so big for you though — somehow 8 or more spaces are being added — maybe the formatter thinks the indents are tabs?
Pingback: F# Weekly #19, 2014 | Sergey Tihon's Blog
Pingback: 5/17/2014 Developer News Roundup | The Puntastic Programmer