Functional Bioinformatics Algorithms

I have been looking at bioinformatics much more seriously recently by taking Algorithms for DNA Sequencing by Ben Langmead on Cousera and working though Bioinformatics Algorithms by Compeau and Pevzner. 

I noticed in both cases that the code samples are very much imperative focused: lots of loops, lots of mutable variables, lots of mess. I endeavored to re-write the code in a more functional style using immutable variables and pipelining functions

Consider the code for Pattern Count, a function that counts how often a pattern appears in a larger string. For example, the pattern “AAA” appears in the larger string “AAATTTAAA” twice. If the larger string was “AAAA”, the pattern “AAA” is also twice since AAA appears in index 0-2 and index 1-3.

Here is the code that appears in the book:

let patternCount (text:string) (pattern:string) =
    let mutable count = 0
    for i = 0 to (text.Length - pattern.Length) do
        let subString = text.Substring(i,pattern.Length)
        if subString = pattern then
            count <- count + 1
        else
            ()
    count

Contrast that to a more functional style:

let patternCount (text:string) (pattern:string) =
    text 
    |> Seq.windowed pattern.Length 
    |> Seq.map(fun c -> new string(c))
    |> Seq.filter(fun s -> s = pattern)
    |> Seq.length

There are three main benefits of the functional style:

  1. The code is much more readable. Each step of the transformation is explicit as a single line of the pipeline. There is almost a one to one match between the text from the book and the code written. The only exception is the Seq.map because the windowed function outputs as a char array and we need to transform it back into a string.
  2. The code is auditable. The pipeline can be stopped at each step and output reviewed for correctness.
  3. The code is reproducible. Because each of the steps uses immutable values, pattern count will produce the same result for a given input regardless of processor, OS, or other external factors.

In practice, the pattern count can be used like this:

let text = "ACAACTCTGCATACTATCGGGAACTATCCT"
let pattern = "ACTAT"
let counts = patternCount text pattern

val counts : int = 2

In terms of performance, I added in a function to make a sequence of ten million nucleotides and then searched it:

let getRandomNuclotide () =
    let dictionary = ["A";"C";"G";"T"]
    let random = new Random()
    dictionary.[random.Next(4)]

let getRandomSequence (length:int) =
    let nuclotides = [ for i in 0 .. length -> getRandomNuclotide() ]
    String.Join("", nuclotides)

let largerText = getRandomSequence 10000000
#time
let counts = patternCount largerText pattern

Real: 00:00:00.814, CPU: 00:00:00.866, GC gen0: 173, gen1: 1, gen2: 0
val counts : int = 9816

It ran in about one second on my macbook using only 1 processor. If I wanted to make it faster and run this for the four billion nucleotides found in human DNA, I would use the Parallel Seq library, which is a single letter change to the code. That would be a post for another time…

The gist is here