Building Amino Acid Lookup Dictionaries Using Python and F#

The heart of hypedsearch is a “seed and extend” algorithm. We take the results from the mass spectrometry and compare them to our known dictionary of peptides to see if any are hybrid. The problem is that the mass-spec observations can be 1 to n amino acids and there is considerable noise in the samples. As a way of increasing compute time performance, we decided to break the dictionary of known peptides into a catalog of fragments with each catalog based on the size of the kmer used to break the protein’s amino acid chains into parts. The idea is to pass a weight from a mass-spec observation and get a return of all possible peptides that have the same weight (measured in daltons)

I first implemented the dictionary is python since that is what hypedsearch is written in. The proteins are stored in a file format called “.fasta” which is the de-facto standard for storing genomic and proteomic data. There is actually a fasta module that makes reading and parsing the file a snap. After reading the file, I parsed the contents into a data structure that contains the name of the protein and a collection of amino acids – each with their individual weight

Amino_Acid = namedtuple('Amino_Acid', ['Id', 'Weight' ],defaults=['',0.0])
Protein = namedtuple('Protein',['Name', 'Amino_Acids'],defaults=['', []])

def extract_protein_name(description):
    return description.split('|')[-1].split()[0]

def generate_amino_acid(character):
    weights = {
    "A": 71.037114,"R": 156.101111,"N": 114.042927,"D": 115.026943,"C": 103.009185,
    "E": 129.042593,"Q": 128.058578,"G": 57.021464,"H": 137.058912,"I": 113.084064,
    "L": 113.084064,"K": 128.094963,"M": 131.040485,"F": 147.068414,"P": 97.052764,
    "S": 87.032028,"T": 101.047679,"U": 150.95363,"W": 186.079313,"Y": 163.06332,
    "V": 99.068414,"X": 0, "B": 113.084064, "Z": 0 }
    weight = weights.get(character)
    return Amino_Acid(character,weight)

def extract_amino_acids(sequence):
    amino_acids = []
    for character in sequence:
        amino_acid = generate_amino_acid(character)
        amino_acids.append(amino_acid)
    return amino_acids      

def generate_proteins(fasta_parser):
    proteins = []
    for entry in fasta_parser:
        protein_name = extract_protein_name(entry.description)
        amino_acids = extract_amino_acids(entry.sequence)
        protein = Protein(protein_name,amino_acids)
        proteins.append(protein)
    return proteins

The next step is to create a data structure that has a attribute of weight and the amino acids associated with that weight – with the index from the original protein of where that amino acid chain is located (note that I used amino acid chain and peptide interchangeably, apologies if some biologist out there just threw their Cheetos at the screen).

Protein_Match = namedtuple('Protein_Match',['Protein_Name', 'Weight', 'Start_Index', 'End_Index'], defaults=['',0,0,0])

def get_cumulative_weights(amino_acids, kmer_length):
    df_all = pd.DataFrame(amino_acids)
    df_weights = df_all.loc[:,'Weight']
    windows = df_weights.rolling(kmer_length).sum()
    no_nan_windows = windows.fillna(0)
    rounded_windows = no_nan_windows.apply(lambda x: round(x,2))
    return rounded_windows

def generate_protein_match(protein, data_tuple, kmer_length):
    protein_name = protein.Name
    (start_index, weight) = data_tuple
    end_index = start_index + kmer_length
    return Protein_Match(protein_name,weight, start_index,end_index)

def get_protein_matches(protein, kmer_length):
    protein_matches = []
    cumulative_weights = get_cumulative_weights(protein.Amino_Acids,kmer_length)
    indexes = cumulative_weights.index.tolist()
    values = cumulative_weights.values.tolist()
    data_tuples = list(zip(indexes,values))
    for data_tuple in data_tuples:
        protein_match = generate_protein_match(protein, data_tuple, kmer_length)
        protein_matches.append(protein_match)
    return protein_matches

def generate_proteins_matches(proteins,kmer_length):
    proteins_matches = []
    for protein in proteins:
        protein_matches = get_protein_matches(protein,kmer_length)
        proteins_matches = proteins_matches + protein_matches
    return proteins_matches

Once I had all of the proteinweights for a given kmer, I could then bundle them up into a data structure that had all of the records associated with a single weight.

Weight_Protein_Matches = namedtuple('Weight_Protein_Matches',['Weight','Protein_Match'],defaults=[0,[]])

def handle_weight_protein_matches(weight_protein_matches, all_weight_protein_matches):
    exists = False
    for item in all_weight_protein_matches:
        if item.Weight == weight_protein_matches.Weight:
            item.Protein_Match.append(weight_protein_matches)
            exists = True
            break
    if exists == False:
        all_weight_protein_matches.append(weight_protein_matches)

def generate_all_weight_protein_matches(protein_matches):
    all_weight_protein_matches = []
    for protein_match in protein_matches:
        weight = protein_match.Weight
        weight_protein_matches = Weight_Protein_Matches(weight, [protein_match])
        handle_weight_protein_matches(weight_protein_matches,all_weight_protein_matches)
    return all_weight_protein_matches

Nothing really exciting here (the way code should be) – just lots of loops. I did try to avoid mutable variables and I am not happy with that one early return in handle_weight_protein_matches. I then took the functions out for a spin.

file_path = '/Users/jamesdixon/Documents/Hypedsearch_All/hypedsearch/data/database/sample_database.fasta'
fasta_parser = fasta.read(file_path) #n = 279
proteins = generate_proteins(fasta_parser)
proteins_matches = generate_proteins_matches(proteins,2) #n=103163 for kmer=2
all_weight_protein_matches = generate_all_weight_protein_matches(proteins_matches) #n=176 for kmer=2 and round to 2 decimal places
print(all_weight_protein_matches)

And it ran like a champ

So then I thought “I am not a fan of all of those loops and named tuples for data structures leaves a lot to be desired. I wonder if I can implement this in F#? Also I was inspired by Don teaching Guido F# last week might have entered my thinking. Turns out, it is super easy in F# thanks to the high-order functions in the language.

Step one was to read the data from the file. Unlike python, there is a not a .fasta type provider AFAIK so I wrote some code to parse the contents (thank you to Fyodor Soikin for answering my Stack Overflow question on the chunk function)

type FastaEntry = {Description:String; Sequence:String}

let chunk lines =
    let step (blocks, currentBlock) s =
        match s with
        | "" -> (List.rev currentBlock :: blocks), []
        | _ -> blocks, s :: currentBlock
    let (blocks, lastBlock) = Array.fold step ([], []) lines
    List.rev (lastBlock :: blocks)

let generateFastaEntry (chunk:String List) =
    match chunk |> Seq.length with
    | 0 -> None
    | _ ->
        let description = chunk |> Seq.head
        let sequence = chunk |> Seq.tail |> Seq.reduce (fun acc x -> acc + x)
        Some {Description=description; Sequence=sequence}

let parseFastaFileContents fileContents = 
    chunk fileContents
    |> Seq.map(fun c -> generateFastaEntry c)
    |> Seq.filter(fun fe -> fe.IsSome)
    |> Seq.map(fun fe -> fe.Value)

Generating the amino acids and proteins was roughly equivalent to the python code – though I have to admit that the extra characters when setting up the amino acid record types was annoying compared to the name tuple syntax. On the flip side, no for..each – just high order .skip, .head, .map to do the work and .toList to keep the data aligned.

type AminoAcid = {Id:String; Weight:float}
type Protein = {Name:string; AminoAcids: AminoAcid List}

let extractProteinName (description:string) =
    description.Split('|')
    |> Seq.skip 1
    |> Seq.head

let generateAminoAcid (character:char) =
    match character with
    | 'A' -> {Id="A"; Weight=71.037114}| 'R' -> {Id="R"; Weight=156.101111} 
    | 'N' -> {Id="N"; Weight=114.042927} | 'D' -> {Id="D"; Weight=115.026943} 
    | 'C' -> {Id="C"; Weight=103.009185} | 'E' -> {Id="E"; Weight=129.042593} 
    | 'Q' -> {Id="Q"; Weight=128.058578} | 'G' -> {Id="G"; Weight=57.021464} 
    | 'H' -> {Id="H"; Weight=137.058912} | 'I' -> {Id=":I"; Weight=113.084064} 
    | 'L' -> {Id="L"; Weight=113.084064} | 'K' -> {Id="K"; Weight=128.094963} 
    | 'M' -> {Id="M"; Weight=131.040485} | 'F' -> {Id="F"; Weight=147.068414} 
    | 'P' -> {Id="P"; Weight=97.052764} | 'S' -> {Id="S"; Weight=87.032028} 
    | 'T' -> {Id="T"; Weight=101.047679}| 'U' -> {Id="U"; Weight=150.95363} 
    | 'W' -> {Id="W"; Weight=186.079313} | 'Y' -> {Id="Y"; Weight=163.06332} 
    | 'V' -> {Id="V"; Weight=99.068414} | 'X' -> {Id="X"; Weight=0.0} 
    | 'B' -> {Id="B"; Weight=113.084064} | 'Z' -> {Id="Z"; Weight=0.0}
    | _ -> {Id="Z"; Weight=0.0}

let extractAminoAcids(sequence:string) =
    sequence.ToUpper().ToCharArray()
    |> Seq.map(fun c -> generateAminoAcid c)
    |> Seq.toList

let generateProtein(fastaEntry:FastaEntry)=
    let name = extractProteinName fastaEntry.Description
    let aminoAcids = extractAminoAcids fastaEntry.Sequence
    {Name=name;AminoAcids=aminoAcids}

let generateProteins parsedFileContents =
    parsedFileContents
    |> Seq.map(fun fc -> generateProtein fc)
    |> Seq.toList

On to the ProteinMatch data structure

type ProteinMatch = {ProteinName:string; Weight:float; StartIndex:int; EndIndex:int}

let generateProteinMatch (protein: Protein) (index:int) (aminoAcids:AminoAcid array) (kmerLength:int) =
    let name = protein.Name
    let weight = 
        aminoAcids 
        |> Array.map(fun aa -> aa.Weight)
        |> Array.reduce(fun acc x -> acc + x)
    let startIndex = index * kmerLength 
    let endIndex = index * kmerLength + kmerLength      
    {ProteinName = name; Weight = weight; StartIndex = startIndex; EndIndex = endIndex}

let generateProteinMatches (protein: Protein) (kmerLength:int) =
    protein.AminoAcids
    |> Seq.windowed(kmerLength)
    |> Seq.mapi(fun idx aa -> generateProteinMatch protein idx aa kmerLength)

let generateAllProteinMatches (proteins: Protein list) (kmerLength:int) =
    proteins 
    |> Seq.map(fun p -> generateProteinMatches p kmerLength)
    |> Seq.reduce(fun acc pm -> Seq.append acc pm)

So I love the windowed function for creating the slices of kmers. Compared to the python, I find the code is much more readable and much more testable – plus the bonus of parallelism by adding “PSeq”.

To the last data structure

type WeightProteinMatches = {Weight:float; ProteinMatchs:ProteinMatch list}

let generateWeightProteinMatches (proteinMatches: ProteinMatch list)=
    proteinMatches
    |> Seq.map(fun pm -> {ProteinName=pm.ProteinName;StartIndex=pm.StartIndex;EndIndex=pm.EndIndex;Weight=System.Math.Round(pm.Weight,2)})
    |> Seq.groupBy(fun pm -> pm.Weight)
    |> Seq.map(fun (w,pm) -> {Weight= w; ProteinMatchs = pm |> Seq.toList })

More love: the groupBy function for the summation is perfect. The groupBy function in python does not return a similar data structure of the index and the associated collections – the way F# does it makes building up that data structure a snap. Less Code, fewer errors, more fun.

Once the functions were in place, I took them out for a spin

let filePath = @"/Users/jamesdixon/Documents/Hypedsearch_All/hypedsearch/data/database/sample_database.fasta"
let fileContents = File.ReadLines(filePath) |> Seq.cast |> Seq.toArray
let parsedFileContents = parseFastaFileContents fileContents
let proteins = generateProteins parsedFileContents
let proteinMatches = generateAllProteinMatches proteins 2
let weightProteinMatches = generateWeightProteinMatches (proteinMatches |> Seq.toList)

And it ran like a champ

Since I am obviously a F# fan-boy, I much preferred writing the F#. I am curious what my python-centric colleges will think when I present this…

Leave a comment