MongoDb | Jamie Dixon's Home

Screen Scraping The Department Of Health

January 21, 2014 Leave a comment

As part of TRINUG’s Analytics SIG, some of the people were interested in health inspections found here. I created a Public Records Request (PRR) on their website’s intact form:

And in the comments, I said this:

I would like to make a Public Records Request on this data. I would like the following fields for all inspections 1/1/2013 to 12/31/2013:

InspectionId
InspectorId
EstablishmentId
EstablishmenName
EstablishmentAddress
EstablishmentCity
EstablishmentZip
EstablishmentLat
EstablishmentLong
EstablishmentTypeId
EstablishmentTypeDesc
InspectionScore
NumberOfNonCriticalViolations
NumberOfCriticalViolations
InspectionDate

After a week, I did not hear back (their response is supposed to be 48 hours) so I emailed the director:

Not wanting to wait any longer for the data, I decided to do some screen scraping. To that end, I did an on-line report and go something like this:

with the pages of the report down here:

I then went into source and checked out the pagination. Fortunately, it was uri-based so there are 144 different uris like this one:

I pulled down all of the uris and put them into Excel. I then trimmed off the <a and the text after the word “class”:

Fortunately, there was no restaurant with the word “class” in its name. I know have 144 uris ready to go. I then saved the uris into a .csv.

My next step is to suck these uris into a a database. I have learned the hard way that screen scraping is fraught with mal-formed data and unstable connections. Therefore, I will make a request and pull down a page and store it when the getting is good. I will then parse that page separately. Since the data appears in the past, I am less concerned about the data changing after I pull it local.

When I spun up an instance of Sql Server and tried to import the data, I kept getting things like this.

So it is pretty obvious that Sql Server doesn’t make it easy to import text like uris (and I can image HTML). I decided to spin up an instance of MongoDb. Also, because the F# MongoDb driver is not in NuGet, I decided to go with C#.

I then fired up a C# and read all of the uris into a List

static List<String> GetUrisFromFileSystem()
{
    var path = @"C:\HealthDepartment\Inspections.csv";
    var contents = File.ReadAllText(path);
    var splitContents = contents.Split('\n');
    var contentList = splitContents.ToList<String>();
    contentList.RemoveAt(0);
    return contentList;
}

I then wrote the list into MongoDb.

static void LoadDataIntoMongo(List<String> uris)
{
    var connectionString = "mongodb://localhost";
    var client = new MongoClient(connectionString);
    var server = client.GetServer();
    var database = server.GetDatabase("HealthDepartment");
    var collection = database.GetCollection<String>("UriEntities");
    foreach (String uri in uris)
    {
        var entity = new UriEntity { Uri = uri };
        collection.Insert(entity);
        var id = entity.Id;
        Console.WriteLine(id);
    }
 
    
}

The 1 gotcha is that I made my UriEntity class have a string as the Id. This is not idomatic to Mongo and I got this:

It needs to be a a type of ObjectId. Once I made that switch, I got this:

The fact that MongoDb makes things so much easier than SqlServer is really impressive.

With the Uris in Mongo, I then wanted to make a request to the individual pages. I created a method that got to the contents of the page:

static String GetHtmlForAUri(String uri)
{
    var fullyQualifiedUri = "http://wake.digitalhealthdepartment.com/&quot; + uri;
    var request = WebRequest.Create(fullyQualifiedUri);
    var response = request.GetResponse();
    using(var stream = response.GetResponseStream())
    {
        using(var reader = new StreamReader(stream))
        {
            return reader.ReadToEnd();
        }
    }
 
}

I then fired up class to take the contents that are associated with the Uri:

public class PageContentEntity
{
    public ObjectId Id { get; set; }
    public ObjectId UriId { get; set; }
    public String PageContent { get; set; }
}

And created a method to persist the contents:

static void LoadPageContentIntoMongo(PageContentEntity entity)
{
    var connectionString = "mongodb://localhost";
    var client = new MongoClient(connectionString);
    var server = client.GetServer();
    var database = server.GetDatabase("HealthDepartment");
    var collection = database.GetCollection<PageContentEntity>("PageContentEntities");
    collection.Insert(entity);
    Console.WriteLine(entity.Id);
}

And then a method to put everything together

static void LoadAllPageContentIntoMongo()
{
    var connectionString = "mongodb://localhost";
    var client = new MongoClient(connectionString);
    var server = client.GetServer();
    var database = server.GetDatabase("HealthDepartment");
    var collection = database.GetCollection<UriEntity>("UriEntities");
    foreach(var uriEntity in collection.FindAllAs<UriEntity>())
    {
        String pageContent =  GetHtmlForAUri(uriEntity.TargetUri);
        var pageEntity = new PageContentEntity()
        {
            UriId = uriEntity.Id,
            PageContent =pageContent
        };
        LoadPageContentIntoMongo(pageEntity);
    }
}

So sure enough, I know have all of the pages local. Doing something with it – that is the next trick…

Note that as soon as I finished up this piece, I got a note from the director of the department saying that they are looking at my request and will get back to me soon.

Filed under MongoDb

Jamie Dixon's Home

Screen Scraping The Department Of Health

Categories

Recent Posts

Archives

Blogroll

Meta