Functional Bioinformatics Algorithms: Part 2

Pressing on with more bioinformatic algorithms implemented in a functional style, the next algorithm found in Bioinformatics Algorithms by Compeau and Pevzner is to find the most frequent pattern in a string of text.

I started writing in the imperative style from the book like so (the length of the substring to be found is called the “k-mer” so it gets the parameter name “k”)

type Count = {Text:string; PatternCount:int}
let frequentWords (text:string) (k:int) =
    let patternCount (text:string) (pattern:string) =
        text 
        |> Seq.windowed pattern.Length 
        |> Seq.map(fun c -> new string(c))
        |> Seq.filter(fun s -> s = pattern)
        |> Seq.length

    let counts = new List<Count>()
    for i = 0 to text.Length-k do
        let pattern = text.Substring(i,k)
        let patternCount = patternCount text pattern
        let count = {Text=text; PatternCount=patternCount}
        counts.add(count)
        counts |> Seq.orderByDesc(fun c -> c.PatternCount)
        let maxCount = counts|> Seq.head
    let frequentPatterns = new List<Count>()
    for i = 0 to counts.length
        if count.[i].PatternCount = maxCount then
            frequentPatterns.add(count.[i])
        else
            ()

But I gave up because, well, the code is ridiculous. I went back to the original pattern count algorithms written in F# and then added in a block to find the most frequent patterns:

let frequentWords (text:string) (k:int) =
    let patternCounts =
        text
        |> Seq.windowed k
        |> Seq.map(fun c -> new string(c))
        |> Seq.countBy(fun s -> s)
        |> Seq.sortByDescending(fun (s,c) -> c)
    let maxCount = patternCounts |> Seq.head |> snd
    patternCounts 
        |> Seq.filter(fun (s,c) -> c = maxCount)
        |> Seq.map(fun (s,c) -> s)

The VS Code linter was not happy with my Seq.countBy implementation… but it works. I think the code is explanatory:

  1. window the string for the length of k

2. do a countyBy on the resulting substrings

3. sort it by descending, find the top substring amount

4. filter the substring list by that top substring count.

The last map returns just the pattern and leaves off the frequency, which I think is a mistake but is how the book implements it. Here is an example of the frequentWords function in action:

let getRandomNuclotide () =
    let dictionary = ["A";"C";"G";"T"]
    let random = new Random()
    dictionary.[random.Next(4)]

let getRandomSequence (length:int) =
    let nuclotides = [ for i in 0 .. length -> getRandomNuclotide() ]
    String.Join("", nuclotides)

let largerText = getRandomSequence 1000000

let currentFrequentWords = frequentWords largerText 9
currentFrequentWords

I didn’t set the seed value for generating the largerText string so the results will be different each time.

Gist is here

Functional Bioinformatics Algorithms

I have been looking at bioinformatics much more seriously recently by taking Algorithms for DNA Sequencing by Ben Langmead on Cousera and working though Bioinformatics Algorithms by Compeau and Pevzner. 

I noticed in both cases that the code samples are very much imperative focused: lots of loops, lots of mutable variables, lots of mess. I endeavored to re-write the code in a more functional style using immutable variables and pipelining functions

Consider the code for Pattern Count, a function that counts how often a pattern appears in a larger string. For example, the pattern “AAA” appears in the larger string “AAATTTAAA” twice. If the larger string was “AAAA”, the pattern “AAA” is also twice since AAA appears in index 0-2 and index 1-3.

Here is the code that appears in the book:

let patternCount (text:string) (pattern:string) =
    let mutable count = 0
    for i = 0 to (text.Length - pattern.Length) do
        let subString = text.Substring(i,pattern.Length)
        if subString = pattern then
            count <- count + 1
        else
            ()
    count

Contrast that to a more functional style:

let patternCount (text:string) (pattern:string) =
    text 
    |> Seq.windowed pattern.Length 
    |> Seq.map(fun c -> new string(c))
    |> Seq.filter(fun s -> s = pattern)
    |> Seq.length

There are three main benefits of the functional style:

  1. The code is much more readable. Each step of the transformation is explicit as a single line of the pipeline. There is almost a one to one match between the text from the book and the code written. The only exception is the Seq.map because the windowed function outputs as a char array and we need to transform it back into a string.
  2. The code is auditable. The pipeline can be stopped at each step and output reviewed for correctness.
  3. The code is reproducible. Because each of the steps uses immutable values, pattern count will produce the same result for a given input regardless of processor, OS, or other external factors.

In practice, the pattern count can be used like this:

let text = "ACAACTCTGCATACTATCGGGAACTATCCT"
let pattern = "ACTAT"
let counts = patternCount text pattern

val counts : int = 2

In terms of performance, I added in a function to make a sequence of ten million nucleotides and then searched it:

let getRandomNuclotide () =
    let dictionary = ["A";"C";"G";"T"]
    let random = new Random()
    dictionary.[random.Next(4)]

let getRandomSequence (length:int) =
    let nuclotides = [ for i in 0 .. length -> getRandomNuclotide() ]
    String.Join("", nuclotides)

let largerText = getRandomSequence 10000000
#time
let counts = patternCount largerText pattern

Real: 00:00:00.814, CPU: 00:00:00.866, GC gen0: 173, gen1: 1, gen2: 0
val counts : int = 9816

It ran in about one second on my macbook using only 1 processor. If I wanted to make it faster and run this for the four billion nucleotides found in human DNA, I would use the Parallel Seq library, which is a single letter change to the code. That would be a post for another time…

The gist is here