Restaurant Classifier: Async For Faster Performance?

Going back to my restaurant classifier using F# from last week, I decided to speed things up some.  Each request to the Yellow Pages API takes 1 second, so with the 5,682 records, I am looking at a little over 1.5 hours to pull down the data when running serial.

I first thought about making my methods async so I changed the API call method to async and used the Http.AsyncRequest method like so (line 10 below):

  1. member this.GetCatagoriesAsync(restaurantName: string, restaurantAddress: string) =
  2.          async{
  3.              if(String.IsNullOrEmpty(restaurantName)) then
  4.                  failwith("restaurantName cannot be null or empty.")
  5.              if(String.IsNullOrEmpty(restaurantAddress)) then
  6.                  failwith("restaurantAddress cannot be null or empty.")
  7.              let cleanedName = this.CleanName(restaurantName)
  8.              let cleanedAddress = this.CleanAddress(restaurantAddress);
  9.              let uri = "http://pubapi.atti.com/search-api/search/devapi/search?term="+cleanedName+"&searchloc="+cleanedAddress+"&format=json&key=qj5l8pphj5"
  10.              let! response = FSharp.Net.Http.AsyncRequest(uri, headers=["user-agent", "None"])
  11.              let ypResult = ypProvider.Parse(response)
  12.              try
  13.                  return ypResult.SearchResult.SearchListings.SearchListing.[0].Categories
  14.              with
  15.                  | ex -> return String.Empty
  16.          }

I then made the covering function async also (line 11 below)

  1. member this.IsRestaurantInCatagoryAsync(restaurantName: string, restaurantAddress: string, restaurantCatagory: string) =
  2.     async {
  3.         if(String.IsNullOrEmpty(restaurantName)) then
  4.             failwith("restaurantName cannot be null or empty.")
  5.         if(String.IsNullOrEmpty(restaurantAddress)) then
  6.             failwith("restaurantAddress cannot be null or empty.")
  7.         if(String.IsNullOrEmpty(restaurantCatagory)) then
  8.             failwith("restaurantCatagory cannot be null or empty.")
  9.  
  10.         System.Threading.Thread.Sleep(new System.TimeSpan(0,0,1))
  11.         let! catagories = this.GetCatagoriesAsync(restaurantName, restaurantAddress)
  12.         if(String.IsNullOrEmpty(catagories)) then return false
  13.         else return this.IsCatagoryInCatagories(catagories,restaurantCatagory)
  14.     }

The problem is that invoking the covering function via an anonymous method did not work easily.

image

After screwing around with the synax a bit, I went over to stack overflow where I found out two things:

  • There is not an easy way to do it (I was hoping for a Seq.FilterAsyc method)
  • Thomas Petricek is above my pay-grade. 

In any event, I decided to drop the async and just look at parallelism.   Turns out that there is a Parallel Seq class called PSeq, it is just not in the FSharp core library yet.   I created a PSeq file in my project, moved it to the top and dropped the code in.   I then changed the method call to use PSeq to invoke the serial methods:

  1. member public this.GetChineseRestaurants () =
  2.     let catagoryRepository = new RestaurantCatagoryRepository()
  3.     let catagory = "Chinese"
  4.     this.GetRestaurants()
  5.             |> PSeq.filter(fun (name, address) -> catagoryRepository.IsRestaurantInCatagory(name, address,catagory))
  6.             |> Seq.toList    

When I first invoked it and looked at Fiddler (OT: did anyone notice that Fiddler’s new logo looks alot like a FSharp one?  Probably just a coincidence), it was clear that things were running in parallel and that performance would improve.  I have two cores on this workstation so my time be cut in half. 

image

With the parallel method in my back pocket, I decided to see the ultimate result of the restaurant classification.  I created a quick console app

  1. class Program
  2. {
  3.     static void Main(string[] args)
  4.     {
  5.         Console.WriteLine("Start");
  6.  
  7.         Stopwatch stopwatch = new Stopwatch();
  8.         stopwatch.Start();
  9.         RestaurantBuilder builder = new RestaurantBuilder();
  10.         var restaurants = builder.GetChineseRestaurants();
  11.         
  12.         foreach (var restaurant in restaurants)
  13.         {
  14.             Console.WriteLine(restaurant.Item1 + ":" + restaurant.Item2);
  15.         }
  16.         
  17.         stopwatch.Stop();
  18.         Console.WriteLine("Number of Chinese Restaurants: " + restaurants.Count());
  19.         Console.WriteLine(stopwatch.Elapsed.ToString());
  20.         Console.WriteLine("End");
  21.         Console.ReadKey();
  22.     }
  23. }

I then ran the search on YP.com using my 4 core laptop and got the following results:

image

Compared to my original classifier based on name:

image

So the results make sense.  The YP serial search would take at least 94.7 minutes, the YP parallel search took 41 minutes, and the in-memory name search took 3 seconds.  The YP search(s) found restaurants that the name did not (Wang’s Kitchen, Crazy Fire Mongolian Grill, etc…) – 275 to 221, or 24% more restaurants.

I think that the next step is to look at the classifier and see how many restaurants are in both datasets and why the ones that are not in the YP one – where they are (did they even pay to be in the Yellow Pages?).  Perhaps there is another YP category that can be considered.  Also, it would be interesting to see of the restaurants that are in the name search and in the Yellow Pages that were not classified as Chinese – the false positive rate.  Finally, I did see some 500s in Fiddler that had “read time out” so there is room for improvement to account for the transient faults…