August | 2021 | Jamie Dixon's Home

Azure Functions Using F#

August 3, 2021 Leave a comment

When Azure Functions first came out, F# had pretty good support – templates, the ability to run a .fsx file, cool examples written by Don… Fast forward to 2021 and we have bupkis. I recently wrote a dictionary of amino acid weights that would be perfect as a service: pass in a kmer length and weight, get all hits from amino acids.

I first attempted to create a function app, select a .fsx template, and write the code in my browser. Alas, only .csx scripting is available.

I attempted to remove the .csx file and put a .fsx file in via Kudu, but it looks like support for F# scripting is not installed. I created an issue on Azure User voice, but Jordan Marr beat me to it by about 2 years.

I then attempted to open VS Code and upload a function to Azure. Alas, only C# templates are available in VS Code and replacing .fs and .fsproj files in the OOB project fails on compile.

I then installed Visual Studio (remember that?) hoping that it might have some templates. Nope.

So I ranted on Twitter and I got some helpful advice from Jon Canning and Jiri Landsman: both have projects on Git Hub that create an Azure Function: here and here. I downloaded them – Jon’s took some work because I didn’t have Paket installed. I am amazed that for a simple “Hello World” return from a http request takes 11 files and of the two files that have F# code, there are 31 lines of code. That is about 1 file and two lines of code per character.

I got both project building locally, but I was at a loss about how to actually deploy them to Azure since there was not a built-in facility in VS Code or Visual Studio. Jon pointed me to this article where I figured out the deployment.

Here are the steps to get a F# Azure Function working:

Go into Azure and create a Azure Function App
Download either Jon or Jiri’s template
Build the template locally
Go to the bin folder and compress the contents. Do not compress the folder itself.
Go to the Kudu zip deploy screen using the uri of https://xxx.scm.azurewebsites.net/ZipDeployUI
Drag/Drop the .zip file you made in #4 into the screen. Wait for Kudu to finish
Go to your function app and hopefully see your new function
Navigate to that function to see it working
Lament the fact that you have to jump though all of these hoops when all you wanted to do was just return “hello world” to a http request

I have not tried deploying to Azure from GitHub yet. If past experience is any indication, I will need to carve out a Saturday to do it. What really annoys me is that I have a limited time to work on this project and instead of actually writing code that will help cure diabetes, I am spending it doing this garbage. No wonder academics just do everything on their machine. Hopefully Microsoft will recognize this fact and devote more help to people trying to solve domain problems so we can spend less time worrying about solving Microsoft Infrastructure puzzles.

Filed under Uncategorized

Building Amino Acid Lookup Dictionaries Using Python and F#

August 2, 2021 Leave a comment

The heart of hypedsearch is a “seed and extend” algorithm. We take the results from the mass spectrometry and compare them to our known dictionary of peptides to see if any are hybrid. The problem is that the mass-spec observations can be 1 to n amino acids and there is considerable noise in the samples. As a way of increasing compute time performance, we decided to break the dictionary of known peptides into a catalog of fragments with each catalog based on the size of the kmer used to break the protein’s amino acid chains into parts. The idea is to pass a weight from a mass-spec observation and get a return of all possible peptides that have the same weight (measured in daltons)

I first implemented the dictionary is python since that is what hypedsearch is written in. The proteins are stored in a file format called “.fasta” which is the de-facto standard for storing genomic and proteomic data. There is actually a fasta module that makes reading and parsing the file a snap. After reading the file, I parsed the contents into a data structure that contains the name of the protein and a collection of amino acids – each with their individual weight

Amino_Acid = namedtuple('Amino_Acid', ['Id', 'Weight' ],defaults=['',0.0])
Protein = namedtuple('Protein',['Name', 'Amino_Acids'],defaults=['', []])

def extract_protein_name(description):
    return description.split('|')[-1].split()[0]

def generate_amino_acid(character):
    weights = {
    "A": 71.037114,"R": 156.101111,"N": 114.042927,"D": 115.026943,"C": 103.009185,
    "E": 129.042593,"Q": 128.058578,"G": 57.021464,"H": 137.058912,"I": 113.084064,
    "L": 113.084064,"K": 128.094963,"M": 131.040485,"F": 147.068414,"P": 97.052764,
    "S": 87.032028,"T": 101.047679,"U": 150.95363,"W": 186.079313,"Y": 163.06332,
    "V": 99.068414,"X": 0, "B": 113.084064, "Z": 0 }
    weight = weights.get(character)
    return Amino_Acid(character,weight)

def extract_amino_acids(sequence):
    amino_acids = []
    for character in sequence:
        amino_acid = generate_amino_acid(character)
        amino_acids.append(amino_acid)
    return amino_acids      

def generate_proteins(fasta_parser):
    proteins = []
    for entry in fasta_parser:
        protein_name = extract_protein_name(entry.description)
        amino_acids = extract_amino_acids(entry.sequence)
        protein = Protein(protein_name,amino_acids)
        proteins.append(protein)
    return proteins

The next step is to create a data structure that has a attribute of weight and the amino acids associated with that weight – with the index from the original protein of where that amino acid chain is located (note that I used amino acid chain and peptide interchangeably, apologies if some biologist out there just threw their Cheetos at the screen).

Protein_Match = namedtuple('Protein_Match',['Protein_Name', 'Weight', 'Start_Index', 'End_Index'], defaults=['',0,0,0])

def get_cumulative_weights(amino_acids, kmer_length):
    df_all = pd.DataFrame(amino_acids)
    df_weights = df_all.loc[:,'Weight']
    windows = df_weights.rolling(kmer_length).sum()
    no_nan_windows = windows.fillna(0)
    rounded_windows = no_nan_windows.apply(lambda x: round(x,2))
    return rounded_windows

def generate_protein_match(protein, data_tuple, kmer_length):
    protein_name = protein.Name
    (start_index, weight) = data_tuple
    end_index = start_index + kmer_length
    return Protein_Match(protein_name,weight, start_index,end_index)

def get_protein_matches(protein, kmer_length):
    protein_matches = []
    cumulative_weights = get_cumulative_weights(protein.Amino_Acids,kmer_length)
    indexes = cumulative_weights.index.tolist()
    values = cumulative_weights.values.tolist()
    data_tuples = list(zip(indexes,values))
    for data_tuple in data_tuples:
        protein_match = generate_protein_match(protein, data_tuple, kmer_length)
        protein_matches.append(protein_match)
    return protein_matches

def generate_proteins_matches(proteins,kmer_length):
    proteins_matches = []
    for protein in proteins:
        protein_matches = get_protein_matches(protein,kmer_length)
        proteins_matches = proteins_matches + protein_matches
    return proteins_matches

Once I had all of the proteinweights for a given kmer, I could then bundle them up into a data structure that had all of the records associated with a single weight.

Weight_Protein_Matches = namedtuple('Weight_Protein_Matches',['Weight','Protein_Match'],defaults=[0,[]])

def handle_weight_protein_matches(weight_protein_matches, all_weight_protein_matches):
    exists = False
    for item in all_weight_protein_matches:
        if item.Weight == weight_protein_matches.Weight:
            item.Protein_Match.append(weight_protein_matches)
            exists = True
            break
    if exists == False:
        all_weight_protein_matches.append(weight_protein_matches)

def generate_all_weight_protein_matches(protein_matches):
    all_weight_protein_matches = []
    for protein_match in protein_matches:
        weight = protein_match.Weight
        weight_protein_matches = Weight_Protein_Matches(weight, [protein_match])
        handle_weight_protein_matches(weight_protein_matches,all_weight_protein_matches)
    return all_weight_protein_matches

Nothing really exciting here (the way code should be) – just lots of loops. I did try to avoid mutable variables and I am not happy with that one early return in handle_weight_protein_matches. I then took the functions out for a spin.

file_path = '/Users/jamesdixon/Documents/Hypedsearch_All/hypedsearch/data/database/sample_database.fasta'
fasta_parser = fasta.read(file_path) #n = 279
proteins = generate_proteins(fasta_parser)
proteins_matches = generate_proteins_matches(proteins,2) #n=103163 for kmer=2
all_weight_protein_matches = generate_all_weight_protein_matches(proteins_matches) #n=176 for kmer=2 and round to 2 decimal places
print(all_weight_protein_matches)

And it ran like a champ

So then I thought “I am not a fan of all of those loops and named tuples for data structures leaves a lot to be desired. I wonder if I can implement this in F#? Also I was inspired by Don teaching Guido F# last week might have entered my thinking. Turns out, it is super easy in F# thanks to the high-order functions in the language.

Step one was to read the data from the file. Unlike python, there is a not a .fasta type provider AFAIK so I wrote some code to parse the contents (thank you to Fyodor Soikin for answering my Stack Overflow question on the chunk function)

type FastaEntry = {Description:String; Sequence:String}

let chunk lines =
    let step (blocks, currentBlock) s =
        match s with
        | "" -> (List.rev currentBlock :: blocks), []
        | _ -> blocks, s :: currentBlock
    let (blocks, lastBlock) = Array.fold step ([], []) lines
    List.rev (lastBlock :: blocks)

let generateFastaEntry (chunk:String List) =
    match chunk |> Seq.length with
    | 0 -> None
    | _ ->
        let description = chunk |> Seq.head
        let sequence = chunk |> Seq.tail |> Seq.reduce (fun acc x -> acc + x)
        Some {Description=description; Sequence=sequence}

let parseFastaFileContents fileContents = 
    chunk fileContents
    |> Seq.map(fun c -> generateFastaEntry c)
    |> Seq.filter(fun fe -> fe.IsSome)
    |> Seq.map(fun fe -> fe.Value)

Generating the amino acids and proteins was roughly equivalent to the python code – though I have to admit that the extra characters when setting up the amino acid record types was annoying compared to the name tuple syntax. On the flip side, no for..each – just high order .skip, .head, .map to do the work and .toList to keep the data aligned.

type AminoAcid = {Id:String; Weight:float}
type Protein = {Name:string; AminoAcids: AminoAcid List}

let extractProteinName (description:string) =
    description.Split('|')
    |> Seq.skip 1
    |> Seq.head

let generateAminoAcid (character:char) =
    match character with
    | 'A' -> {Id="A"; Weight=71.037114}| 'R' -> {Id="R"; Weight=156.101111} 
    | 'N' -> {Id="N"; Weight=114.042927} | 'D' -> {Id="D"; Weight=115.026943} 
    | 'C' -> {Id="C"; Weight=103.009185} | 'E' -> {Id="E"; Weight=129.042593} 
    | 'Q' -> {Id="Q"; Weight=128.058578} | 'G' -> {Id="G"; Weight=57.021464} 
    | 'H' -> {Id="H"; Weight=137.058912} | 'I' -> {Id=":I"; Weight=113.084064} 
    | 'L' -> {Id="L"; Weight=113.084064} | 'K' -> {Id="K"; Weight=128.094963} 
    | 'M' -> {Id="M"; Weight=131.040485} | 'F' -> {Id="F"; Weight=147.068414} 
    | 'P' -> {Id="P"; Weight=97.052764} | 'S' -> {Id="S"; Weight=87.032028} 
    | 'T' -> {Id="T"; Weight=101.047679}| 'U' -> {Id="U"; Weight=150.95363} 
    | 'W' -> {Id="W"; Weight=186.079313} | 'Y' -> {Id="Y"; Weight=163.06332} 
    | 'V' -> {Id="V"; Weight=99.068414} | 'X' -> {Id="X"; Weight=0.0} 
    | 'B' -> {Id="B"; Weight=113.084064} | 'Z' -> {Id="Z"; Weight=0.0}
    | _ -> {Id="Z"; Weight=0.0}

let extractAminoAcids(sequence:string) =
    sequence.ToUpper().ToCharArray()
    |> Seq.map(fun c -> generateAminoAcid c)
    |> Seq.toList

let generateProtein(fastaEntry:FastaEntry)=
    let name = extractProteinName fastaEntry.Description
    let aminoAcids = extractAminoAcids fastaEntry.Sequence
    {Name=name;AminoAcids=aminoAcids}

let generateProteins parsedFileContents =
    parsedFileContents
    |> Seq.map(fun fc -> generateProtein fc)
    |> Seq.toList

On to the ProteinMatch data structure

type ProteinMatch = {ProteinName:string; Weight:float; StartIndex:int; EndIndex:int}

let generateProteinMatch (protein: Protein) (index:int) (aminoAcids:AminoAcid array) (kmerLength:int) =
    let name = protein.Name
    let weight = 
        aminoAcids 
        |> Array.map(fun aa -> aa.Weight)
        |> Array.reduce(fun acc x -> acc + x)
    let startIndex = index * kmerLength 
    let endIndex = index * kmerLength + kmerLength      
    {ProteinName = name; Weight = weight; StartIndex = startIndex; EndIndex = endIndex}

let generateProteinMatches (protein: Protein) (kmerLength:int) =
    protein.AminoAcids
    |> Seq.windowed(kmerLength)
    |> Seq.mapi(fun idx aa -> generateProteinMatch protein idx aa kmerLength)

let generateAllProteinMatches (proteins: Protein list) (kmerLength:int) =
    proteins 
    |> Seq.map(fun p -> generateProteinMatches p kmerLength)
    |> Seq.reduce(fun acc pm -> Seq.append acc pm)

So I love the windowed function for creating the slices of kmers. Compared to the python, I find the code is much more readable and much more testable – plus the bonus of parallelism by adding “PSeq”.

To the last data structure

type WeightProteinMatches = {Weight:float; ProteinMatchs:ProteinMatch list}

let generateWeightProteinMatches (proteinMatches: ProteinMatch list)=
    proteinMatches
    |> Seq.map(fun pm -> {ProteinName=pm.ProteinName;StartIndex=pm.StartIndex;EndIndex=pm.EndIndex;Weight=System.Math.Round(pm.Weight,2)})
    |> Seq.groupBy(fun pm -> pm.Weight)
    |> Seq.map(fun (w,pm) -> {Weight= w; ProteinMatchs = pm |> Seq.toList })

More love: the groupBy function for the summation is perfect. The groupBy function in python does not return a similar data structure of the index and the associated collections – the way F# does it makes building up that data structure a snap. Less Code, fewer errors, more fun.

Once the functions were in place, I took them out for a spin

let filePath = @"/Users/jamesdixon/Documents/Hypedsearch_All/hypedsearch/data/database/sample_database.fasta"
let fileContents = File.ReadLines(filePath) |> Seq.cast |> Seq.toArray
let parsedFileContents = parseFastaFileContents fileContents
let proteins = generateProteins parsedFileContents
let proteinMatches = generateAllProteinMatches proteins 2
let weightProteinMatches = generateWeightProteinMatches (proteinMatches |> Seq.toList)

And it ran like a champ

Since I am obviously a F# fan-boy, I much preferred writing the F#. I am curious what my python-centric colleges will think when I present this…

Filed under Uncategorized

Jamie Dixon's Home

Azure Functions Using F#

Building Amino Acid Lookup Dictionaries Using Python and F#

Categories

Recent Posts

Archives

Blogroll

Meta