Screen Scraping The Department Of Health

As part of TRINUG’s Analytics SIG, some of the people were interested in health inspections found here.  I created a Public Records Request (PRR) on their website’s intact form:

image

And in the comments, I said this:

I would like to make a Public Records Request on this data.  I would like the following fields for all inspections 1/1/2013 to 12/31/2013: 

InspectionId
InspectorId
EstablishmentId
EstablishmenName
EstablishmentAddress
EstablishmentCity
EstablishmentZip
EstablishmentLat
EstablishmentLong
EstablishmentTypeId
EstablishmentTypeDesc
InspectionScore
NumberOfNonCriticalViolations
NumberOfCriticalViolations
InspectionDate

 

After a week, I did not hear back (their response is supposed to be 48 hours) so I emailed the director:

image

Not wanting to wait any longer for the data, I decided to do some screen scraping.  To that end, I did an on-line report and go something like this:

image

with the pages of the report down here:

image

I then went into source and checked out the pagination.  Fortunately, it was uri-based so there are 144 different uris like this one:

image

I pulled down all of the uris and put them into Excel.  I then trimmed off the <a and the text after the word “class”:

image

Fortunately, there was no restaurant with the word “class” in its name.  I know have 144 uris ready to go.  I then saved the uris into a .csv.

My next step is to suck these uris into a a database.  I have learned the hard way that screen scraping is fraught with mal-formed data and unstable connections.  Therefore, I will make a request and pull down a page and store it when the getting is good.  I will then parse that page separately.  Since the data appears in the past, I am less concerned about the data changing after I pull it local.

When I spun up an instance of Sql Server and tried to import the data, I kept getting things like this. 

image

 

So it is pretty obvious that Sql Server doesn’t make it easy to import text like uris (and I can image HTML).  I decided to spin up an instance of MongoDb.  Also, because the F# MongoDb driver is not in NuGet, I decided to go with C#. 

image

I then fired up a C# and read all of the uris into a List

  1. static List<String> GetUrisFromFileSystem()
  2. {
  3.     var path = @"C:\HealthDepartment\Inspections.csv";
  4.     var contents = File.ReadAllText(path);
  5.     var splitContents = contents.Split('\n');
  6.     var contentList = splitContents.ToList<String>();
  7.     contentList.RemoveAt(0);
  8.     return contentList;
  9. }

I then wrote the list into MongoDb.

  1. static void LoadDataIntoMongo(List<String> uris)
  2. {
  3.     var connectionString = "mongodb://localhost";
  4.     var client = new MongoClient(connectionString);
  5.     var server = client.GetServer();
  6.     var database = server.GetDatabase("HealthDepartment");
  7.     var collection = database.GetCollection<String>("UriEntities");
  8.     foreach (String uri in uris)
  9.     {
  10.         var entity = new UriEntity { Uri = uri };
  11.         collection.Insert(entity);
  12.         var id = entity.Id;
  13.         Console.WriteLine(id);
  14.     }
  15.  
  16.     
  17. }

 

The 1 gotcha is that I made my UriEntity class have a string as the Id.  This is not idomatic to Mongo and I got this:

image

It needs to be a a type of ObjectId.  Once I made that switch, I got this:

image

The fact that MongoDb makes things so much easier than SqlServer is really impressive.

With the Uris in Mongo, I then wanted to make a request to the individual pages.  I created a method that got to the contents of the page:

  1. static String GetHtmlForAUri(String uri)
  2. {
  3.     var fullyQualifiedUri = "http://wake.digitalhealthdepartment.com/&quot; + uri;
  4.     var request = WebRequest.Create(fullyQualifiedUri);
  5.     var response = request.GetResponse();
  6.     using(var stream = response.GetResponseStream())
  7.     {
  8.         using(var reader = new StreamReader(stream))
  9.         {
  10.             return reader.ReadToEnd();
  11.         }
  12.     }
  13.  
  14. }

I then fired up class to take the contents that are associated with the Uri:

  1. public class PageContentEntity
  2. {
  3.     public ObjectId Id { get; set; }
  4.     public ObjectId UriId { get; set; }
  5.     public String PageContent { get; set; }
  6. }

And created a method to persist the contents:

  1. static void LoadPageContentIntoMongo(PageContentEntity entity)
  2. {
  3.     var connectionString = "mongodb://localhost";
  4.     var client = new MongoClient(connectionString);
  5.     var server = client.GetServer();
  6.     var database = server.GetDatabase("HealthDepartment");
  7.     var collection = database.GetCollection<PageContentEntity>("PageContentEntities");
  8.     collection.Insert(entity);
  9.     Console.WriteLine(entity.Id);
  10. }

And then a method to put everything together

  1. static void LoadAllPageContentIntoMongo()
  2. {
  3.     var connectionString = "mongodb://localhost";
  4.     var client = new MongoClient(connectionString);
  5.     var server = client.GetServer();
  6.     var database = server.GetDatabase("HealthDepartment");
  7.     var collection = database.GetCollection<UriEntity>("UriEntities");
  8.     foreach(var uriEntity in collection.FindAllAs<UriEntity>())
  9.     {
  10.         String pageContent =  GetHtmlForAUri(uriEntity.TargetUri);
  11.         var pageEntity = new PageContentEntity()
  12.         {
  13.             UriId = uriEntity.Id,
  14.             PageContent =pageContent
  15.         };
  16.         LoadPageContentIntoMongo(pageEntity);
  17.     }
  18. }

So sure enough, I know have all of the pages local.  Doing something with it – that is the next trick…

Note that as soon as I finished up this piece, I got a note from the director of the department saying that they are looking at my request and will get back to me soon.