Functional Bioinformatics Algorithms

I have been looking at bioinformatics much more seriously recently by taking Algorithms for DNA Sequencing by Ben Langmead on Cousera and working though Bioinformatics Algorithms by Compeau and Pevzner. 

I noticed in both cases that the code samples are very much imperative focused: lots of loops, lots of mutable variables, lots of mess. I endeavored to re-write the code in a more functional style using immutable variables and pipelining functions

Consider the code for Pattern Count, a function that counts how often a pattern appears in a larger string. For example, the pattern “AAA” appears in the larger string “AAATTTAAA” twice. If the larger string was “AAAA”, the pattern “AAA” is also twice since AAA appears in index 0-2 and index 1-3.

Here is the code that appears in the book:

let patternCount (text:string) (pattern:string) =
    let mutable count = 0
    for i = 0 to (text.Length - pattern.Length) do
        let subString = text.Substring(i,pattern.Length)
        if subString = pattern then
            count <- count + 1
        else
            ()
    count

Contrast that to a more functional style:

let patternCount (text:string) (pattern:string) =
    text 
    |> Seq.windowed pattern.Length 
    |> Seq.map(fun c -> new string(c))
    |> Seq.filter(fun s -> s = pattern)
    |> Seq.length

There are three main benefits of the functional style:

  1. The code is much more readable. Each step of the transformation is explicit as a single line of the pipeline. There is almost a one to one match between the text from the book and the code written. The only exception is the Seq.map because the windowed function outputs as a char array and we need to transform it back into a string.
  2. The code is auditable. The pipeline can be stopped at each step and output reviewed for correctness.
  3. The code is reproducible. Because each of the steps uses immutable values, pattern count will produce the same result for a given input regardless of processor, OS, or other external factors.

In practice, the pattern count can be used like this:

let text = "ACAACTCTGCATACTATCGGGAACTATCCT"
let pattern = "ACTAT"
let counts = patternCount text pattern

val counts : int = 2

In terms of performance, I added in a function to make a sequence of ten million nucleotides and then searched it:

let getRandomNuclotide () =
    let dictionary = ["A";"C";"G";"T"]
    let random = new Random()
    dictionary.[random.Next(4)]

let getRandomSequence (length:int) =
    let nuclotides = [ for i in 0 .. length -> getRandomNuclotide() ]
    String.Join("", nuclotides)

let largerText = getRandomSequence 10000000
#time
let counts = patternCount largerText pattern

Real: 00:00:00.814, CPU: 00:00:00.866, GC gen0: 173, gen1: 1, gen2: 0
val counts : int = 9816

It ran in about one second on my macbook using only 1 processor. If I wanted to make it faster and run this for the four billion nucleotides found in human DNA, I would use the Parallel Seq library, which is a single letter change to the code. That would be a post for another time…

The gist is here

Wake County Restaurant Inspection Data with Azure ML and F#

With Azure ML now available, I was thinking about some of the analysis I did last year and how I could do even more things with the same data set.  One such analysis that came to mind was the restaurant inspection data that I analyzed last year.  You can see the prior analysis here.

I uploaded the restaurant data into Azure and thought of a simple question –> can we predict inspection scores based on some easily available data?  This is an interesting dataset because there are some categorical data elements (zip code, restaurant type, etc…) and there are some continuous ones (priority foundation, etc…).

Here is the base dataset:

image

I created a new experiment and I used a boosted regression model and a neural network regression and used a 70/30 train/test split.

image

After running the models and inspecting the model evaluation, I don’t have a very good model

image

I then decided to go back and pull some of the X variables out of the dataset and concentrate on only a couple of variables.  I added a project column module and then selected Restaurant Type and Zip Code as the X variables and left the Inspection Score as the Y variable. 

image

With this done, I added a couple of more models (Bayesian Linear Regression and a Decision Forest Regression) and gave it a whirl

image

image

Interesting, adding these models did not give us any better of a prediction and dropping the variables to two made a less accurate model.  Without doing any more analysis, I picked the model with the lowest MAE )Boosted Decision Tree Regression) and published it at a web service:

image

I published it as a web service and now I can consume if from a client app.   I used the code that I used for voting analysis found here as a template and sure enough:

["27519","Restaurant","0","96.0897827148438"]

["27612","Restaurant","0","95.5728530883789"]

So restaurants in Cary,NC have a higher inspection score than the ones found in Northwest Raleigh.   However, before we start  alerting the the Cary Chamber of Commerce to create a marketing campaign (“Eat in Cary, we are safer”), the difference is within the MAE.

In any event, it would be easy to create a  phone app and you don’t know a restaurant score, you can punch in the establishment type and the zip code and have a good idea about the score of the restaurant. 

This is an academic exercise b/c the establishments have to show you their card and yelp has their score on them, but a fun exercise none the less.  Happy eating.

Tech Jam and Team Islington Green

IslingtonGreen

 

So a couple of friends from TRINUG – Ian Cillay and David Green – invited me to join them at a 36 hour continuous code fest called Tech Jam. Tech Jam was put on Met Life and the Department of Veterans Affairs on Nov 1 and 2 in RTP. This team of 3 developers was joined by Ian Henshaw who helped with some primary data provision and much of the user story development. David handled the MongoDB part, Ian took care of the UI (Bootstrap and Knockout), and I did the analytics (F# and R) and security piece.

The problem domain was that the Veterans Administration has this format of medical records called Blue Button. Blue Button is an unstructured format which make typical parsing and analysis very difficult. Here is a sample:

clip_image002

Also, the VA wanted some kind of mutli-platform solution that a vet can use that can make sense of this data and allow him/her to provide it to a care giver when needed.

Some random thoughts about the contest:

  • MongoDB makes a lot of sense for the data because of it being so unstructured. Note that the VA has now introduced BlueButton+, which is XML format – so that is a step in the right direction. I was impressed how easy Mongo was to use and how powerful it is to tackle unstructured data – but note that even MongoDb still needs some kind of structure – just not xNF relational…
  • Bootstrap was awesome – we used an out of the box template and it was great to knock out an easy design.
  • F# made analytics and predictive analysis a snap. 
  • There were 15 teams registered, 10 actually presented at the end. There were 2 community teams (us and another), 1 high school team (awesome), and a bunch of corporate teams (Deutsche Bank, IBM, Tata(X3!), etc…).
  • One of the teams (Infusion) was a vendor for Met Life already (they worked on the wall project and the infinity project). Unsurprisingly, they won the grand prize.  My only comment on that is that civic/community hackers already view events like this with a skeptical eye – and this did nothing to help MetLife in the eyes of those kind of people – in fact probably did the opposite.
  • I was amazed by how many teams did not actually address the primary problems that the VA needed fixed.  The problems were taming unstructured data and presenting it in a platform-agnostic way.  Most teams used HTML5/Phonegap for req #2 – which is the easier one.  I think only 3 teams actually addressed requirement #1?
  • I am proud to say that my team did address both requirements.  As I sometimes say “I listen to two things in life: my wife and the requirements.  It goes better for me if I do that.”
  • Another team was from a company where the boss showed up on Saturday to present the team’s work. I guess that is the difference with a corporate hack-a-thon.  Also, you can tell the corporate teams because they had more powerpoint and less code in their final presentation.  Not that there is anything wrong with that….
  • The best line this weekend was when I asked a D level person at Metlife why no one at Metlife was retweeting my tweets with their hashtags (#techjam) and he said "we have a guy for that." Sure enough, they had 1 person who was their tweet guy.
  • MetLife really know how to put on a contest. The MC for this – Gary Hoberman – was awesome – a V-Level techie that really could communicate with the coders. The food was good (they took into account different dietary needs), the working space was good and the swag was useable.  Also, they had dev mentors circulating around the room, though they seemed to spend their time with other teams – so we didn’t interact with them.  They also had reps from MongoDB and MSFT helping out – which is great because we leaned on them for specific problems.

The problem with a code contest is that you have little time to learn from your fellow devs – it is pretty much heads down and check-in. In any event, the best part was hanging out great developers like Ian and David and working on a worthwhile project for our country’s vets.  At the end, it was a fun time and my team delivered that the VA can use to springboard into a real application.