Screen Scraping The Department Of Health
January 21, 2014 Leave a comment
As part of TRINUG’s Analytics SIG, some of the people were interested in health inspections found here. I created a Public Records Request (PRR) on their website’s intact form:
And in the comments, I said this:
I would like to make a Public Records Request on this data. I would like the following fields for all inspections 1/1/2013 to 12/31/2013:
InspectionId
InspectorId
EstablishmentId
EstablishmenName
EstablishmentAddress
EstablishmentCity
EstablishmentZip
EstablishmentLat
EstablishmentLong
EstablishmentTypeId
EstablishmentTypeDesc
InspectionScore
NumberOfNonCriticalViolations
NumberOfCriticalViolations
InspectionDate
After a week, I did not hear back (their response is supposed to be 48 hours) so I emailed the director:
Not wanting to wait any longer for the data, I decided to do some screen scraping. To that end, I did an on-line report and go something like this:
with the pages of the report down here:
I then went into source and checked out the pagination. Fortunately, it was uri-based so there are 144 different uris like this one:
I pulled down all of the uris and put them into Excel. I then trimmed off the <a and the text after the word “class”:
Fortunately, there was no restaurant with the word “class” in its name. I know have 144 uris ready to go. I then saved the uris into a .csv.
My next step is to suck these uris into a a database. I have learned the hard way that screen scraping is fraught with mal-formed data and unstable connections. Therefore, I will make a request and pull down a page and store it when the getting is good. I will then parse that page separately. Since the data appears in the past, I am less concerned about the data changing after I pull it local.
When I spun up an instance of Sql Server and tried to import the data, I kept getting things like this.
So it is pretty obvious that Sql Server doesn’t make it easy to import text like uris (and I can image HTML). I decided to spin up an instance of MongoDb. Also, because the F# MongoDb driver is not in NuGet, I decided to go with C#.
I then fired up a C# and read all of the uris into a List
- static List<String> GetUrisFromFileSystem()
- {
- var path = @"C:\HealthDepartment\Inspections.csv";
- var contents = File.ReadAllText(path);
- var splitContents = contents.Split('\n');
- var contentList = splitContents.ToList<String>();
- contentList.RemoveAt(0);
- return contentList;
- }
I then wrote the list into MongoDb.
- static void LoadDataIntoMongo(List<String> uris)
- {
- var connectionString = "mongodb://localhost";
- var client = new MongoClient(connectionString);
- var server = client.GetServer();
- var database = server.GetDatabase("HealthDepartment");
- var collection = database.GetCollection<String>("UriEntities");
- foreach (String uri in uris)
- {
- var entity = new UriEntity { Uri = uri };
- collection.Insert(entity);
- var id = entity.Id;
- Console.WriteLine(id);
- }
- }
The 1 gotcha is that I made my UriEntity class have a string as the Id. This is not idomatic to Mongo and I got this:
It needs to be a a type of ObjectId. Once I made that switch, I got this:
The fact that MongoDb makes things so much easier than SqlServer is really impressive.
With the Uris in Mongo, I then wanted to make a request to the individual pages. I created a method that got to the contents of the page:
- static String GetHtmlForAUri(String uri)
- {
- var fullyQualifiedUri = "http://wake.digitalhealthdepartment.com/" + uri;
- var request = WebRequest.Create(fullyQualifiedUri);
- var response = request.GetResponse();
- using(var stream = response.GetResponseStream())
- {
- using(var reader = new StreamReader(stream))
- {
- return reader.ReadToEnd();
- }
- }
- }
I then fired up class to take the contents that are associated with the Uri:
- public class PageContentEntity
- {
- public ObjectId Id { get; set; }
- public ObjectId UriId { get; set; }
- public String PageContent { get; set; }
- }
And created a method to persist the contents:
- static void LoadPageContentIntoMongo(PageContentEntity entity)
- {
- var connectionString = "mongodb://localhost";
- var client = new MongoClient(connectionString);
- var server = client.GetServer();
- var database = server.GetDatabase("HealthDepartment");
- var collection = database.GetCollection<PageContentEntity>("PageContentEntities");
- collection.Insert(entity);
- Console.WriteLine(entity.Id);
- }
And then a method to put everything together
- static void LoadAllPageContentIntoMongo()
- {
- var connectionString = "mongodb://localhost";
- var client = new MongoClient(connectionString);
- var server = client.GetServer();
- var database = server.GetDatabase("HealthDepartment");
- var collection = database.GetCollection<UriEntity>("UriEntities");
- foreach(var uriEntity in collection.FindAllAs<UriEntity>())
- {
- String pageContent = GetHtmlForAUri(uriEntity.TargetUri);
- var pageEntity = new PageContentEntity()
- {
- UriId = uriEntity.Id,
- PageContent =pageContent
- };
- LoadPageContentIntoMongo(pageEntity);
- }
- }
So sure enough, I know have all of the pages local. Doing something with it – that is the next trick…
Note that as soon as I finished up this piece, I got a note from the director of the department saying that they are looking at my request and will get back to me soon.