In Russia, The Gene Domain Models You

When I first started getting interested in Computational Biology, I had a very simplistic mental model of genes based on the “central dogma of biology”: DNA makes RNA and RNA makes proteins.  I thought that the human genome was just computer code – instead of binary ones and zeros, it was base-four “A” “T” “C” “G”.*.  Therefore, to understand it, just start mapping the genome like the human genome project did and then see the areas that I am interested in: the four billion nucleotides are memory on the hard drive with some of the genome being the operating system and some being installed software.  When the computer is powered on, the ones and zeros are read off of the hard drive and placed into memory – the equivalent would be the DNA being read out of the chromosomes and placed into RNA.

Man, was I wrong.

DNA is not just your operating system and installed programs – it is the complete instructions to make the computer, turn it on, write, compile, and then run the most sophisticated software in the world.  Oh, it also has the instructions to replicate itself and correct errors that might be introduced during replication.  And we can barely figure out how to write Objective C for a phone app…

So carrying the analogy of the human genome to computer, even if we could determine which part of bytes on the hard drive is for building a transistor, for example, we have another problem.  Unlike a computer’s hard drive where each byte is allocated to only one purpose, each nucleotide can be used by none, one, or many genes.  It is a fricken madhouse – a single “C-G” pair might be used in creating a val amino acid in one protein for one gene, frameshifted to creating an ala amino acid in a different protein for another gene, and then be used in the regulation region of yet another gene being read in the reverse direction.

The implication is that it does not appear possible to “build up” and domain model of the genome.  You can’t take a series of nucleotides of DNA like “AAAGGTTCGTAGATGCTAG” and know anything about what it is doing.  Rather, it looks like the domain model has to work in reverse:  Given a gene like TP53, allocate different sections of DNA to the different regions: Promoter, Intons, Exons, UTRs, etc….

From a F# modeling point of view, DNA is certainly an “OR” type:

type DNA = A | T | C| G

With the caveat that we sometimes don’t know if a location is really an A, rather it is not a G or T. Sweeping aside that problem, then individual nucleotides are an “AND” type:

Type Nucleotide = {Index: int; DNA: DNA; etc…}

let nucleotide0 = {Index=0; DNA=C; etc…}

and then gene would look something like this:

let tP53 = {Gene= TP53; Promoter = [nucleotide0; nucleotide1]; etc…}

Note that I am only 1/3 of the way through my genetics class right now, so I might change my understanding next week.  For now, this is my Mental Model 0.2 of the genome.

*side note: some April 4th, I want to sneak into all of the computational biologists offices and steal the A,C,T,G,Us keys from their keyboards.  Pandemonium would reign.

Looking at SARS-CoV-2 Genome with F#

(This post is part of the FSharp Advent Calendar Series. Thanks Sergy Thion for organizing it)

I have just finished up a Cellular and Molecular Biology class at University of Colorado – Boulder (“CU Boulder”, not “UC Boulder” for the unaware) and I was smitten by a couple of the lectures about the COVID virus.  Briefly, viruses exist to propagate and since they are an obligate intercellular parasite, they need to find a host cell to encode the host cell’s proteins for that end.  That means they enter, hijack some cell proteins, replicate, and then leave to find more cells.

The COVID genome doesn’t really set out to kill humans, that is a side effect of the virus entering the cells (mostly in the lungs) via a common binding site called ACE2.  Since the virus takes up the binding site, some regular bodily functions are inhibited so lots of people get sick and some people die.  Just like farmers in Brazil view destroying the Amazon rain forest as an unfortunate side effect of them creating a farm to support their families (and then have offspring), the virus doesn’t mean to kill all of these people – it’s just business.

From a computational challenge, the COVID genome is 29,674 base pairs long – which is shockingly small. 

You can download the genome from NCBI here.  This is a reference genome, which means it is represents several samples of the virus for analysis – it is not one particular sample.  The download is in FASTA format – which is a pretty common way to represent the nucleotides of a genome.

Firing up Visual Studio Code, I imported the COVID genome and parsed the file to only have the nucleotides

open System

open System.IO

let path = “/Users/jamesdixon/Downloads/sequence.fasta”

let file = File.ReadAllText(path)

let totalLength = file.Length

let prefixLength = 97

let suffixLength = 2

let stringSequence = file.Substring(prefixLength,totalLength-prefixLength-suffixLength)

I then applied some F# magic to get rid of the magic strings and replace them with a typed representation.

type Base = A | C | G | T

let baseFromString (s:String) =

    match s.ToLowerInvariant() with

    | “a” -> Some A

    | “c” -> Some C

    | “t” -> Some T

    | “g” -> Some G

    | _ -> None

let arraySequence = stringSequence.ToCharArray()

let bases =

    arraySequence

    |> Seq.map(fun c -> c.ToString())

    |> Seq.map(fun s -> baseFromString s)

    |> Seq.filter(fun b -> b.IsSome)

    |> Seq.map(fun b -> b.Value)

I then went back to the original page and started looking at highlights of this genome.  For example, the first marker is this:

UTR stands for “Untranslated Region” – basically ignore the first 265 characters.

The most famous gene in this virus is the spike protein

Where “CDS” stands for coding sequences.  Using F#, you can find it as

let spikeLength = abs(21563-25384)

let spike = bases |> Seq.skip 21561 |> Seq.take (spikeLength)

spike

Note that I subtracted two from the CDS value – one because sequences are zero based and one because I am using the “skip” function to go to the first element of the subsequence.

Going back to the first gene listed, you can see in the notes “translated by -1 ribosomal frameshift”

One of the cool things about viruses is that because their genome is so small, they can generate multiple genes out of the same base sequence.  They do this with a technique called “frameshifting” where the bases are read in groups of three (a codon) from different start locations – effectively using the same base pair in either the first, second, or their position of a codon.  I believe that the ORF1ab gene is read with a frameshift of -1, so the F# would be:

 let orf1abLength = abs(266-21555)

let orf1ab = bases |> Seq.skip 263 |> Seq.take (spikeLength)

orf1ab

I subtracted three from the CDS value – one because of zero-based, one for the skip function, and one for the frameshift.  I am not 100% that this is the correct interpretation, but it is a good place to start.

I then wanted to look at how the codons mapped to amino acids.  I am sure this mapping is done thousands of times in different programs/languages – The F# type system makes the mapping a bit less error prone.  Consider the following snippet:

type AminoAcid = Phe | Leu | Iie | Met | Val | Ser | Pro | Thr | Ala | Tyr | His | Gin | Asn | Lys | Asp | Glu

type Codon = Stop | AminoAcid of AminoAcid

I am sure other implementations put the Stop codon with the Amino Acid even though it is not an amino acid.  Keeping them separate is more correct – and can prevent bugs later on.

I then started creating a function to map three bases into the correct codon.  I initially did something like this:

let AminoAcidFromBases bases =

    match bases with

    | TTT -> Phe | TTC -> Phe

    | TTA -> Leu | TTG -> Leu | CTT -> Leu | CTC -> Leu | CTA -> Leu | CTG -> Leu

    | ATT -> Iie | ATC -> Iie | ATA -> Iie

Even though the compiler barfs on this syntax, it is the most intuitive and matches the domain best.  I then started coding for the code, rather the domain, by using a sequence like this

let AminoAcidFromBases bases =

    match bases with

    | [TTT] -> Phe | [TTC] -> Phe

    | [TTA] -> Leu | [TTG] -> Leu | [CTT] -> Leu | [CTC] -> Leu | [CTA] -> Leu | [CTG] -> Leu

I also had the problem of incomplete cases – I need the “else” condition like this:

    | [] -> None

Which means I then have to go back and put Some in front of all of the results:

    | [TTT] -> Some Phe | [TTC] -> Some Phe

But then my code is super cluttered

Also, if I want the function to be “CondonFromBases” versus “AminoAcidFromBasis”, I would have to add this

    | [TTT] -> Some (AminoAcid Phe) | [TTC] -> Some (AminoAcid Phe)

And ugh, super-duper clutter.

I am still thinking through the best way to represent this part of the domain.  Any suggestions are welcome.

Hopefully this post will inspire some F# people to start looking at computational biology – there are tons of great data out there and lots of good projects with which to get involved.

Gist is here

Functional Bioinformatics Algorithms: Part 2

Pressing on with more bioinformatic algorithms implemented in a functional style, the next algorithm found in Bioinformatics Algorithms by Compeau and Pevzner is to find the most frequent pattern in a string of text.

I started writing in the imperative style from the book like so (the length of the substring to be found is called the “k-mer” so it gets the parameter name “k”)

type Count = {Text:string; PatternCount:int}
let frequentWords (text:string) (k:int) =
    let patternCount (text:string) (pattern:string) =
        text 
        |> Seq.windowed pattern.Length 
        |> Seq.map(fun c -> new string(c))
        |> Seq.filter(fun s -> s = pattern)
        |> Seq.length

    let counts = new List<Count>()
    for i = 0 to text.Length-k do
        let pattern = text.Substring(i,k)
        let patternCount = patternCount text pattern
        let count = {Text=text; PatternCount=patternCount}
        counts.add(count)
        counts |> Seq.orderByDesc(fun c -> c.PatternCount)
        let maxCount = counts|> Seq.head
    let frequentPatterns = new List<Count>()
    for i = 0 to counts.length
        if count.[i].PatternCount = maxCount then
            frequentPatterns.add(count.[i])
        else
            ()

But I gave up because, well, the code is ridiculous. I went back to the original pattern count algorithms written in F# and then added in a block to find the most frequent patterns:

let frequentWords (text:string) (k:int) =
    let patternCounts =
        text
        |> Seq.windowed k
        |> Seq.map(fun c -> new string(c))
        |> Seq.countBy(fun s -> s)
        |> Seq.sortByDescending(fun (s,c) -> c)
    let maxCount = patternCounts |> Seq.head |> snd
    patternCounts 
        |> Seq.filter(fun (s,c) -> c = maxCount)
        |> Seq.map(fun (s,c) -> s)

The VS Code linter was not happy with my Seq.countBy implementation… but it works. I think the code is explanatory:

  1. window the string for the length of k

2. do a countyBy on the resulting substrings

3. sort it by descending, find the top substring amount

4. filter the substring list by that top substring count.

The last map returns just the pattern and leaves off the frequency, which I think is a mistake but is how the book implements it. Here is an example of the frequentWords function in action:

let getRandomNuclotide () =
    let dictionary = ["A";"C";"G";"T"]
    let random = new Random()
    dictionary.[random.Next(4)]

let getRandomSequence (length:int) =
    let nuclotides = [ for i in 0 .. length -> getRandomNuclotide() ]
    String.Join("", nuclotides)

let largerText = getRandomSequence 1000000

let currentFrequentWords = frequentWords largerText 9
currentFrequentWords

I didn’t set the seed value for generating the largerText string so the results will be different each time.

Gist is here

Functional Bioinformatics Algorithms

I have been looking at bioinformatics much more seriously recently by taking Algorithms for DNA Sequencing by Ben Langmead on Cousera and working though Bioinformatics Algorithms by Compeau and Pevzner. 

I noticed in both cases that the code samples are very much imperative focused: lots of loops, lots of mutable variables, lots of mess. I endeavored to re-write the code in a more functional style using immutable variables and pipelining functions

Consider the code for Pattern Count, a function that counts how often a pattern appears in a larger string. For example, the pattern “AAA” appears in the larger string “AAATTTAAA” twice. If the larger string was “AAAA”, the pattern “AAA” is also twice since AAA appears in index 0-2 and index 1-3.

Here is the code that appears in the book:

let patternCount (text:string) (pattern:string) =
    let mutable count = 0
    for i = 0 to (text.Length - pattern.Length) do
        let subString = text.Substring(i,pattern.Length)
        if subString = pattern then
            count <- count + 1
        else
            ()
    count

Contrast that to a more functional style:

let patternCount (text:string) (pattern:string) =
    text 
    |> Seq.windowed pattern.Length 
    |> Seq.map(fun c -> new string(c))
    |> Seq.filter(fun s -> s = pattern)
    |> Seq.length

There are three main benefits of the functional style:

  1. The code is much more readable. Each step of the transformation is explicit as a single line of the pipeline. There is almost a one to one match between the text from the book and the code written. The only exception is the Seq.map because the windowed function outputs as a char array and we need to transform it back into a string.
  2. The code is auditable. The pipeline can be stopped at each step and output reviewed for correctness.
  3. The code is reproducible. Because each of the steps uses immutable values, pattern count will produce the same result for a given input regardless of processor, OS, or other external factors.

In practice, the pattern count can be used like this:

let text = "ACAACTCTGCATACTATCGGGAACTATCCT"
let pattern = "ACTAT"
let counts = patternCount text pattern

val counts : int = 2

In terms of performance, I added in a function to make a sequence of ten million nucleotides and then searched it:

let getRandomNuclotide () =
    let dictionary = ["A";"C";"G";"T"]
    let random = new Random()
    dictionary.[random.Next(4)]

let getRandomSequence (length:int) =
    let nuclotides = [ for i in 0 .. length -> getRandomNuclotide() ]
    String.Join("", nuclotides)

let largerText = getRandomSequence 10000000
#time
let counts = patternCount largerText pattern

Real: 00:00:00.814, CPU: 00:00:00.866, GC gen0: 173, gen1: 1, gen2: 0
val counts : int = 9816

It ran in about one second on my macbook using only 1 processor. If I wanted to make it faster and run this for the four billion nucleotides found in human DNA, I would use the Parallel Seq library, which is a single letter change to the code. That would be a post for another time…

The gist is here

Web Crawling Using F#

(This post is part of the FSharp Advent Calendar Series. Thanks Sergy Thion for organizing it)

Recently, I had the need to get articles from some United States government websites. You would think in 2019 that these sites might have apis and you would think wrong. In each case, I needed to crawl the site’s HTML and then extract the information. I started doing this with Python and its beautiful soup library but I can into the fundamental problem that getting the html was much harder than parsing the site. To illustrate, consider this website

I need to go through all 8 pages of the grid and download the .pdfs that are associated with the “View Report” link. The challenge in this particular site is that they didn’t do any url parameters so there is no way to go through the grid via the uri. Looking at the page source, they are using ASP.NET and in typical enterprise-derpy manner, named their table “GridView1”

The way to get to the next page is to press on the “Next” link defined like this:

They over-achieved in the bloated View State for a simple page category though.

post04

#Sigh

So as bad as this site is, F# made getting the data a snap. I fired up Visual Studio and created a new .NET Core F# project. I added a script file. I realized that the button-press to get to the next page was going to be a pain to program, so I decided to use the .NET framework WebBrowser class. It’s nice because it has all of the apis I needed for the traversal and I didn’t have to make the control visible.

My first function was to get the uris from the grid – easy enough using the HtmlDocument and HtmlElement classes:

let getPdfUris (document:HtmlDocument) =

    let collection = document.GetElementsByTagName(“a”)

    collection

    |> Seq.cast

    |> Seq.filter(fun e -> e.OuterText = “View Report”)

    |> Seq.map(fun e -> e.GetAttribute(“href”))

Note the key word I used to filter was “View Report”, so at least the web designer stayed consistent there.

Next, I used basically the same logic to find the Next button in the DOM. Note that I am using the TryFind function so if the button is not found, a None is returned:

let getNextButton (document:HtmlDocument) =


    let collection = document.GetElementsByTagName(“a”)

    collection

    |> Seq.cast

    |> Seq.tryFind(fun e -> e.InnerText = “Next”)

So the next function was my giggle moment for this project. To “press” that button to invoke the javascript to go to the next page of the grid, I ised the InvokeMember method of the HtmlClass

let invokeNextButton (nextButton: HtmlElement) =

    nextButton.InvokeMember(“Click”) |> ignore

    printfn “Next Button Invoked”

Yup, that works! I was worried that I was going to have to screw around with the javascript or, worse, that beast called View State. Fortunately, that InvokeMember method worked fine. Another reason why I love the .NET framework.

So with these three functions set up, I created a method to be called each time the document is refreshed

let handlePage (browser:WebBrowser) (totalUris:List) =


    let document = browser.Document


    let uris = getPdfUris document

   totalUris.AddRange(uris)


    let nextButton = getNextButton document


    match nextButton with

    | Some b ->

        invokeNextButton b

    | None -> ()

My C#-only friends spend waaaay to much time worrying about the last page and having no button and how to code it. I used the option type – is there a Next button? Press it and do work. No Button? Do nothing.

I put in a function to save the .pdf to my local file system

let downloadPdf (uri:string) =


    let client = new WebClient();


    let targetFilePath = @”C:\Temp\” + Guid.NewGuid().ToString() + “.pdf”;

    client.DownloadFile(uri,targetFilePath)

And now I can write this all together:

let browser = new WebBrowser()

let uris = new List()

browser.DocumentCompleted.Add(fun _ -> handlePage browser uris)

let uri = https://www.catalog.state.ct.us/cid/portalApps/examinations.aspx&#8221;

browser.Navigate(uri)

printf “Links Done”

uris |> Seq.iter(fun uri -> downloadPdf uri)

printf “Downloads Done”

So I new up the browser, send handlePage to the DocumentCompleted event handler. Every time the Next button is pressed, the document loads and the DocumentCompleted event fires, and the .pdfs are downloaded and the next button is pressed. Until the last page, when there is no button to press.

And it worked like a champ:

post06

Gist is found here 

The Counted Part 3: Law Enforcement Officers Killed In Line Of Duty

As a follow up to this post, I decided to look on the other side of the gun –> police officers killed in the line of duty.  Fortunately, the FBI collects this data here.  It looks like the FBI is a bit behind on their summary reports:

Capture

So taking the 2013 data as the closest data point to The Counted 2015 data, it took a couple of minutes to download the excel spreadsheet and format it as a useable .csv:

image  to image

After importing in the data in R studio, I did a quick summary on the data frame.  The most striking thing out of the gate is how few Officers are killed.  There were 27 in 2013, compared to over 500 people killed by police officers in the 1st half of 2015:

1 officers.killed <- read.csv("./Data/table_1_leos_fk_region_geographic_division_and_state_2013.csv") 2 sum(officers.killed$OfficersKilled) 3

I then added in the state population to do a similar ratio and map:

1 officers.killed.2 <- merge(x=officers.killed, 2 y=state.population.3, 3 by.x="StateName", 4 by.y="NAME") 5 6 officers.killed.2$AdjustedPopulation <- officers.killed.2$POPESTIMATE2014/10000 7 officers.killed.2$KilledRatio <- officers.killed.2$OfficersKilled/officers.killed.2$AdjustedPopulation 8 officers.killed.2$AdjKilledRatio <- officers.killed.2$KilledRatio * 10 9 officers.killed.2$StateName <- tolower(officers.killed.2$StateName) 10 11 choropleth.3 <- merge(x=all.states, 12 y=officers.killed.2, 13 sort = FALSE, 14 by.x = "region", 15 by.y = "StateName", 16 all.x=TRUE) 17 choropleth.3 <- choropleth.3[order(choropleth.3$order), ] 18 summary(choropleth.3) 19 20 qplot(long, lat, data = choropleth.3, group = group, fill = AdjKilledRatio, 21 geom = "polygon") 22

image

So Louisiana and West Virginia seem to have the highest number of officers killed per capita.  I am not surprised, being that I had no expectations about states that would have higher and lower numbers.  It seems likely a case of “gee-wiz” data.

Since there is so few instances, I decided to forgo any more analysis on police killed and instead combined this data with the people who were killed by police:

1 the.counted.state.5 <- merge(x=the.counted.state.4, 2 y=officers.killed.2, 3 by.x="StateName", 4 by.y="StateName") 5 6 names(the.counted.state.5)[names(the.counted.state.5)=="AdjKilledRatio.x"] <- "NonPoliceKillRatio" 7 names(the.counted.state.5)[names(the.counted.state.5)=="AdjKilledRatio.y"] <- "PoliceKillRatio" 8 9 the.counted.state.6 <- data.frame(the.counted.state.5$NonPoliceKillRatio, 10 the.counted.state.5$PoliceKillRatio, 11 log(the.counted.state.5$NonPoliceKillRatio), 12 log(the.counted.state.5$PoliceKillRatio)) 13 14 colnames(the.counted.state.6) <- c("NonPoliceKilledRatio","PoliceKilledRatio","LoggedNonPoliceKilledRatio","LoggedPoliceKilledRatio") 15 16 plot(the.counted.state.6) 17

and certainly the log helps out and there seems to be a relationship between states that have police killed and people being killed by police (my hand-drawn red lines added):

Capture3

With that in mind, I created a couple of  linear models

1 non.police <- the.counted.state.6$LoggedNonPoliceKilledRatio 2 police <- the.counted.state.6$LoggedPoliceKilledRatio 3 police[police==-Inf] <- NA 4 5 model <- lm( non.police ~ police ) 6 summary(model) 7 8 model.2 <- lm( police ~ non.police) 9 summary(model.2) 10

Since there are only 2 variables, the adjusted R square is the same for x~y and y~x.

image

image

The interesting thing is the model has to account that many states had 0 police fatalities but had at least 1 person killed by the police.  The next interesting thing is the value of the coefficient: in starts where there was at least 1 police fatality and 1 person killed by the police, every police fatality increases the number of people killed by police .96 –> and this .96 is the log of the ratio of population.  So it shows that the police are better at killing then getting killed, which makes sense.

The full gist is found here.

Analytics in the Microsoft Stack

Disclaimer:  I really don’t know what I am talking about

I received an email from a coworker/friend yesterday with this in the body:

So, I have a friend who works for a major supermarket chain. In IT, they are straight out of the year 2000. They have tons and tons of data in SQL Server and I think Oracle. The industrial engineers (who do all of the planning) ask the IT group to run queries throughout the day, which takes hours to run. They use Excel for most of their processing. On the weekends, they run reporting queries which take hours and hours to run – all to get just basic information.

This got my wheels spinning about how I would approach the problem with the analytics toolset that I know is available.  The supermarket chain has a couple of problems

  • Lots of data that takes too long to munge through
  • The planners are dependent on IT group for processing the data

I would expect the official Microsoft answer is that they should implement Sql Server Analytics with Power BI.  I would assume if the group threw enough resources at this solution, it would work.  I then thought of a couple of alternative paths:

The first thing that comes to mind is using HDInsight (Microsoft’s Hadoop product)  on Azure.  That way the queries can run in a distributed manner and they can provision machines as they need them -> and when they are not running their queries, they can de-allocate the machines.

The second thought is using AzureML to do their model generation.  However, depending on the size of the datasets, AzureML may not be able to scale.  I have only used Azure ML on smaller datasets.

The third thought was using R?  I don’t think R is the best answer here.  Everything I know about R is that it is designed for data exploration and analysis of datasets that comfortably fit into the local machine’s memory.  Performance on R is horrible and scaling R is a real challenge. 

What about F#?  So this might be a good answer.  If you use the Hive Type Provider, you can get the benefits of HDInsight to do the processing and then have the goodness of the language syntax and REPL for data exploration.  Also, the group could look at MBrace for some kick-butt distributed processing that can scale on Azure. Finally, if they don come up with some kind of insight that lends itself for building analytics or models into an app, you can take the code out of the script file and stick it into a compliable assembly all within Visual Studio. 

What about Python?  No idea, I don’t enough about it

What about Matlab, SAS, etc..  No idea.  I stopped using those tools when R showed up.

What about Watson?  No idea.  I think I will have a better idea once I go to this.

F# Record Types with Entity Framework Code-Last

So based on the experience with code-first, I decided to look at using EF code-last (OK, database first).   I considered three different possibilities

  1. 1) Use AutoMapper
  2. 2) Use Reflection
  3. 3) Hand-Roll everything

AutoMapper

If you are not familiar, Automapper is a library to allow you to,well, map types. The first thing I did was to create a database schema like this:

1 use FamilyDomain 2 3 CREATE TABLE Family 4 ( 5 Id int NOT NULL IDENTITY(1,1) PRIMARY KEY, 6 LastName varchar(255) NOT NULL 7 ) 8 9 CREATE TABLE Parent 10 ( 11 Id int NOT NULL IDENTITY(1,1) PRIMARY KEY, 12 FamilyId int NOT NULL, 13 FirstName varchar(255) NOT NULL 14 ) 15 16 CREATE TABLE Child 17 ( 18 Id int NOT NULL IDENTITY(1,1) PRIMARY KEY, 19 FamilyId int NOT NULL, 20 FirstName varchar(255) NOT NULL, 21 Gender varchar(10) NOT NULL, 22 Grade int NOT NULL 23 ) 24 25 CREATE TABLE Pet 26 ( 27 Id int NOT NULL IDENTITY(1,1) PRIMARY KEY, 28 ChildId int NOT NULL, 29 GivenName varchar(255) NOT NULL 30 ) 31 32 CREATE TABLE HomeAddress 33 ( 34 Id int NOT NULL IDENTITY(1,1) PRIMARY KEY, 35 FamilyId int NOT NULL, 36 StateCode varchar(2) NOT NULL, 37 County varchar(255) NOT NULL, 38 City varchar(255) NOT NULL 39 ) 40 41 ALTER TABLE Parent 42 ADD CONSTRAINT fk_Parent_Family 43 FOREIGN KEY (FamilyId) 44 REFERENCES Family(Id) 45 46 ALTER TABLE HomeAddress 47 ADD CONSTRAINT fk_HomeAddress_Family 48 FOREIGN KEY (FamilyId) 49 REFERENCES Family(Id) 50 51 ALTER TABLE Child 52 ADD CONSTRAINT fk_Child_Family 53 FOREIGN KEY (FamilyId) 54 REFERENCES Family(Id) 55 56 ALTER TABLE Pet 57 ADD CONSTRAINT fk_Pet_Child 58 FOREIGN KEY (ChildId) 59 REFERENCES Child(Id) 60 61 62 INSERT Family VALUES 63 ('Andersen') 64 65 INSERT Parent VALUES 66 (1,'Thomas'), 67 (1,'Mary Kay') 68 69 INSERT Child VALUES 70 (1,'Henriette Thaulow','Female',5) 71 72 INSERT Pet VALUES 73 (1,'Fluffy') 74 75 INSERT HomeAddress VALUES 76 (1,'WA','King','Seattle') 77

I then  installed automapper and entity framework type provider to a FSharp project.

1 #r @"../packages/AutoMapper.3.3.0/lib/net40/AutoMapper.dll" 2 #r "FSharp.Data.TypeProviders.dll" 3 #r "System.Data.Entity.dll" 4 5 open Microsoft.FSharp.Data.TypeProviders 6 open System.Data.Entity 7 open AutoMapper 8 9 //Entity Framework Types via Type Provider 10 let connectionString = @"Server=.;Initial Catalog=FamilyDomain;Integrated Security=SSPI;MultipleActiveResultSets=true" 11 type EntityConnection = SqlEntityConnection<ConnectionString="Server=.;Initial Catalog=FamilyDomain;Integrated Security=SSPI;MultipleActiveResultSets=true",Pluralize=true> 12

I then created some local FSharp record types the reflect the domain:

1 type Pet = {Id:int; GivenName:string} 2 type Child = {Id:int; FirstName:string; Gender:string; Grade:int; Pets: Pet list} 3 type Address = {Id:int; State:string; County:string; City:string} 4 type Parent = {Id:int; FirstName:string} 5 type Family = {Id:int; Parents:Parent list; Children: Child list; Address:Address}

So then I was ready to start mapping.  I started with a basic GET to a single type:

1 //AutoMapper setup 2 Mapper.CreateMap<EntityConnection.ServiceTypes.HomeAddress, Address>() 3 4 //Get one from the database 5 let context = EntityConnection.GetDataContext() 6 let addressQuery = query {for address in context.HomeAddresses do select address} 7 let address = Seq.head addressQuery 8 9 //map database to record type 10 let address' = Mapper.Map<Address>(address) 11

And I got a fail:

Source value:

SqlEntityConnection1.HomeAddress —> System.ArgumentException: Type needs to have a constructor with 0 args or only optional args

Parameter name: type

So I added [<CLIMutable>] to the record types like so

1 [<CLIMutable>] 2 type Pet = {Id:int; GivenName:string} 3 [<CLIMutable>] 4 type Child = {Id:int; FirstName:string; Gender:string; Grade:int; Pets: Pet list} 5 [<CLIMutable>] 6 type Address = {Id:int; State:string; County:string; City:string} 7 [<CLIMutable>] 8 type Parent = {Id:int; FirstName:string} 9 [<CLIMutable>] 10 type Family = {Id:int; Parents:Parent list; Children: Child list; Address:Address} 11

And I get the expected results

image

With one thing kinda interesting.  The State is null because it is defined as “StateCode” on the server and “State” in the domain.  Autopmapper is customizable to allow field name differences so that was a small issue.  Feeling confident, I went ahead and created maps to all of the domain types and pulled down a complex type from the database

1 //AutoMapper setup 2 Mapper.CreateMap<EntityConnection.ServiceTypes.Pet, Pet>() 3 Mapper.CreateMap<EntityConnection.ServiceTypes.Child, Child>() 4 Mapper.CreateMap<EntityConnection.ServiceTypes.HomeAddress, Address>() 5 Mapper.CreateMap<EntityConnection.ServiceTypes.Parent, Parent>() 6 Mapper.CreateMap<EntityConnection.ServiceTypes.Family, Family>() 7 8 //Get Family from the database 9 let context = EntityConnection.GetDataContext() 10 let familyQuery = query {for family in context.Families do select family} 11 let family = Seq.head familyQuery 12 13 //map database to record type 14 let family' = Mapper.Map<Family>(family)

When I attempted to map it, I got a pretty ugly exception

Source value:

System.Data.Objects.DataClasses.EntityCollection`1[SqlEntityConnection1.Parent]

   at AutoMapper.MappingEngine.AutoMapper.IMappingEngineRunner.Map(ResolutionContext context)

So the problem is that automapper is not picking up on the foreign keys, which means I have to write the associations by hand.  Ugh!  I then tried to auto map to F# choice types like this:

1 type Gender = Male | Female

No dice.

Reflection

I quickly spun up another project that uses System.Reflection to map the types.

1 #r "System.Data.Entity.dll" 2 #r "FSharp.Data.TypeProviders.dll" 3 4 open System.Reflection 5 open System.Data.Entity 6 open Microsoft.FSharp.Data.TypeProviders 7 8 let connectionString = "Server=.;Database=FamilyDomain;Trusted_Connection=True;" 9 10 type entityConnection = SqlEntityConnection<ConnectionString = "Server=.;Database=FamilyDomain;Trusted_Connection=True;"> 11 12 let context = entityConnection.GetDataContext() 13 14 //Local Idomatic Types 15 [<CLIMutable>] 16 type Pet = {Id:int; ChildId:int; GivenName:string} 17 [<CLIMutable>] 18 type Child = {Id:int; FirstName:string; Gender:string; Grade:int; Pets: Pet list} 19 [<CLIMutable>] 20 type Address = {Id:int; State:string; County:string; City:string} 21 [<CLIMutable>] 22 type Parent = {Id:int; FirstName:string} 23 [<CLIMutable>] 24 type Family = {Id:int; LastName:string; Parents:Parent list; Children: Child list; Address:Address} 25 26 //Reflection 27 let AssignMatchingPropertyValues sourceObject targetObject = 28 let sourceType = sourceObject.GetType() 29 let targetType = targetObject.GetType() 30 let sourcePropertyInfos = sourceType.GetProperties(BindingFlags.Public ||| BindingFlags.Instance) 31 sourcePropertyInfos 32 |> Seq.map(fun spi -> spi, targetObject.GetType().GetProperty(spi.Name)) 33 |> Seq.iter(fun (spi,tpi) -> tpi.SetValue(targetObject, spi.GetValue(sourceObject,null),null)) 34 targetObject 35 36 37 let newEfPet = entityConnection.ServiceTypes.Pet() 38 let newPet = {Id=0;ChildId=1;GivenName="Duke"} 39 40 AssignMatchingPropertyValues newPet newEfPet 41 42 context.DataContext.AddObject("Pet",newEfPet) 43 context.DataContext.SaveChanges()

Sure enough, reflection does what it is supposed to do:

image

The problem quickly becomes that by using reflection, I have to hand roll all of the relations.  I might as well use Automapper (though apparently reflection is much faster than Automapper, even on a per-call basis).

Another problem with using reflection is that the field names in the database need to match the domain naming exactly.  Finally, like automapper, there is not out of the box way to map choice types

Hand Roll

On my last stop of the entity framework code-last hit parade, I looked at what it would take to roll my own mappings.  This has the greatest amount of yak shaving because I would have to spin up mapping from the domain and to the domain.  The nice thing is that with that kind of detail, naming mismatches can be handled and the nested hierarchy and choice types are accounted for.  I first started with a basic script that handled the gettting and setting as well as nested types:

1 #r "System.Data.Entity.dll" 2 #r "FSharp.Data.TypeProviders.dll" 3 4 open System.Linq 5 open System.Data.Entity 6 open Microsoft.FSharp.Data.TypeProviders 7 8 let connectionString = "Server=.;Database=FamilyDomain;Trusted_Connection=True;" 9 type entity = SqlEntityConnection<ConnectionString = "Server=.;Database=FamilyDomain;Trusted_Connection=True;"> 10 let context = entity.GetDataContext() 11 12 type Pet = {Id:int; ChildId: int; GivenName:string} 13 type Child = {Id:int; FirstName:string; Gender:string; Grade:int; Pets: Pet list} 14 type Address = {Id:int; State:string; County:string; City:string} 15 type Parent = {Id:int; FirstName:string} 16 type Family = {Id:int; LastName:string; Parents:Parent list; Children: Child list; Address:Address} 17 18 let MapPet(efPet: entity.ServiceTypes.Pet) = 19 {Id=efPet.Id; ChildId=efPet.ChildId; GivenName=efPet.GivenName} 20 21 let MapChild(efChild: entity.ServiceTypes.Child) = 22 let pets = efChild.Pet |> Seq.map(fun p -> MapPet(p)) 23 |> Seq.toList 24 {Id=efChild.Id; FirstName=efChild.FirstName; 25 Gender=efChild.Gender;Grade=efChild.Grade;Pets=pets} 26 27 let GetPet(id: int)= 28 let efPet = context.Pet.FirstOrDefault(fun p -> p.Id = id) 29 MapPet(efPet) 30 31 let GetChild(id: int)= 32 let efChild = context.Child.FirstOrDefault(fun c -> c.Id = id) 33 MapChild(efChild) 34 35 let myPet = GetPet(1) 36 37 let myChild = GetChild(1) 38

Of all of the implementations, the hand-rolled actually made the most sense to me.  it was clean and, most importantly, it worked.

image

I then swapped out a Choice type for gender (was a string)

1 type Gender = Male | Female 2 type Pet = {Id:int; ChildId: int; GivenName:string} 3 type Child = {Id:int; FirstName:string; Gender:Gender; Grade:int; Pets: Pet list} 4 type Address = {Id:int; State:string; County:string; City:string} 5 type Parent = {Id:int; FirstName:string} 6 type Family = {Id:int; LastName:string; Parents:Parent list; Children: Child list; Address:Address} 7

And then added the choice type mapping and then updated child mapping

1 let MapGender(efGender) = 2 match efGender with 3 | "Male" -> Male 4 | _ -> Female 5 6 let MapChild(efChild: entity.ServiceTypes.Child) = 7 let pets = efChild.Pet |> Seq.map(fun p -> MapPet(p)) 8 |> Seq.toList 9 {Id=efChild.Id; FirstName=efChild.FirstName; 10 Gender=MapGender(efChild.Gender); 11 Grade=efChild.Grade;Pets=pets} 12

Sure enough, it worked like a champ

image

And finally, I tested the add on both the happy path and an expected exception.

1 let SavePet(pet: Pet)= 2 let efPet = entity.ServiceTypes.Pet() 3 efPet.ChildId <- pet.ChildId 4 efPet.GivenName <- pet.GivenName 5 context.DataContext.AddObject("Pet",efPet) 6 context.DataContext.SaveChanges() 7 8 let newPet = {Id=0;ChildId=1;GivenName="Lucky Sue"} 9 SavePet(newPet) 10 11 let failurePet = {Id=0;ChildId=0;GivenName="Should Fail"} 12 SavePet(failurePet)

  Both worked as expected.  Here is the exception case where there is not a child to be associated to a pet:

System.Data.UpdateException: An error occurred while updating the entries. See the inner exception for details. —> System.Data.SqlClient.SqlException: The INSERT statement conflicted with the FOREIGN KEY constraint "fk_Pet_Child". The conflict occurred in database "FamilyDomain", table "dbo.Child", column ‘Id’.

The statement has been terminated.

   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)

So of all three ways, hand-rolling worked the best for me.

F# Record Types With Entity Framework Code-First

I was spinning up a data layer in a new FSharp project and I thought I would take EF Code-first out for a test drive.  I have use EF-CF in a couple of C# projects so I am familiar with the premise (and the promise) of code-first.  The FSharp project uses record types, nested record types, and choice types exclusively so I I thought of attaching each of these types for code first in turn.  The first article that I ran across was this one, which seemed like a good start.  I went ahead a created a family record type like so, matching the example verbatim except I swapped out the class implementation with a record type:

 

1 #r "../packages/EntityFramework.6.1.2/lib/net45/EntityFramework.dll" 2 3 open System.Collections.Generic 4 open System.ComponentModel.DataAnnotations 5 open System.Data.Entity 6 7 type Family = {Id:int; LastName:string; IsRegistered:bool} 8 9 type CLFamily() = 10 inherit DbContext() 11 [<DefaultValue>] 12 val mutable m_families: DbSet<Family> 13 member public this.Families with get() = this.m_families 14 and set v = this.m_families <- v 15 16 let db = new CLFamily() 17 let family = {Id=0;LastName="New Family"; IsRegistered=true} 18 db.Families.Add(family) |> ignore 19 db.SaveChanges() |> ignore 20

But I ran into this:

 image

image

So I added the Key attribute to the Record type

image

So I hit up stack overflow with this question and sure enough, I forgot to add a reference to that assembly.  Once I added it, then it compiled.  I then ran the script and I got the following error message:

1 <add name="CLFamily" 2 connectionString="Server=.;Database=FamilyDomain;Trusted_Connection=True;" 3 providerName="System.Data.SqlClient"/> 4

 

image

Ugh!  It was still hitting the default connection string.    I went ahead and adjusted my script to account for the connection string and I swapped out the backing values with CLIMutable:

1 #r "../packages/EntityFramework.6.1.2/lib/net45/EntityFramework.dll" 2 #r "C:/Program Files (x86)/Reference Assemblies/Microsoft/Framework/.NETFramework/v4.5.1/System.ComponentModel.DataAnnotations.dll" 3 4 open System.Collections.Generic 5 open System.ComponentModel.DataAnnotations 6 open System.Data.Entity 7 8 [<CLIMutable>] 9 type Family = {[<Key>]Id:int; LastName:string; IsRegistered:bool;} 10 11 12 type FamilyContext() = 13 inherit DbContext() 14 [<DefaultValue>] val mutable families: DbSet<Family> 15 member this.Families with get() = this.families and set f = this.families <- f 16 17 let context = new FamilyContext() 18 let connectionString = "Server=.;Database=FamilyDomain;Trusted_Connection=True;" 19 context.Database.Connection.ConnectionString <- connectionString 20 let family = {Id=0; LastName="Test"; IsRegistered=true} 21 context.Families.Add(family) |> ignore 22 context.SaveChanges() |> ignore

And sure enough, the table is created in the database and the record is persisted:

image image

And the cool thing is that even though this is a record type, the Id does adjust to the identity value given by the database.

With that out of the way, I went to tackle nested types.  I added a Child class and a list of children to the family class. 

1 [<CLIMutable>] 2 type Child = {[<Key>]Id:int; FamilyId: int; FirstName:string; Gender:string; Grade:int} 3 4 [<CLIMutable>] 5 type Family = {[<Key>]Id:int; LastName:string; IsRegistered:bool; Children:Child list} 6 7 type FamilyContext() = 8 inherit DbContext() 9 [<DefaultValue>] val mutable families: DbSet<Family> 10 member this.Families with get() = this.families and set f = this.families <- f 11 [<DefaultValue>] val mutable children: DbSet<Child> 12 member this.Chidlren with get() = this.children and set c = this.children <- c 13 14 let context = new FamilyContext() 15 let connectionString = "Server=.;Database=FamilyDomain;Trusted_Connection=True;" 16 context.Database.Connection.ConnectionString <- connectionString 17 let children = [{Id=0; FamilyId=0; FirstName="Test"; Gender="Male"; Grade=5}] 18 let family = {Id=0; LastName="Test"; IsRegistered=true; Children=children } 19 context.Families.Add(family) |> ignore 20 context.SaveChanges() |> ignore

Everything compiled and  ran, but the Children table was not added to the database –> though the new record was added.

image image image

Going back to stack overflow, it looks like EF Code First will not auto-update the schema unless you add some more glue code.  Ugh.  At that point, I might as well give up on code-first if all it brings is not having to write sql scripts…