Polymorphism In Action

The swim team that I help out with is purchasing a timing system. The timing system’s data needs to interface with the existing datamart, which is stored in a SQL Server database. The database serves a variety of functions – from season registration, meet registration, team records, etc…

The system that the board chose is the industry leader with its own proprietary data storage mechanism. Their system does not allow any kind of RPC from a RDBMS system. The only way to get into and out of the system is via text files. These text files are structured using USA Swimming’s Standard Data Interchange Format – found here with a simplified version found here. I cracked open the RFC and was amazed – they are not using XML! Rather, it is a flat file format with each row of data conforming to certain field lengths. It is like going back to FORTRAN programming – without the sexiness of the green screen. Here is an example of a row of data:

D01 Whitaker, Taylor 122691TAY*WH 1226199118FF 1003 UNOV 1:25.00Y

And here is how the RFC tells you how to parse the payload:

Start /
Length Mandatory Type Description
1/2 M1* CONST "D0"
3/1 M2* CODE ORG Code 001, table checked
4/8 future use
12/28 M1 NAME swimmer name
40/12 M2 ALPHA USS#
52/1 CODE ATTACH Code 016, table checked
53/3 CODE CITIZEN Code 009, table checked
56/8 M2 DATE swimmer birth date
64/2 ALPHA swimmer age or class (such as Jr or Sr)
66/1 M1 CODE SEX Code 010, table checked
67/1 M1# CODE EVENT SEX Code 011, table checked
68/4 M1# INT event distance
72/1 M1# CODE STROKE Code 012, table checked
73/4 ALPHA Event Number
77/4 M1# CODE EVENT AGE Code 025, table checked
81/8 M2 DATE date of swim
89/8 TIME seed time
97/1 * CODE COURSE Code 013, table checked

Therefore, to interface with our SQL Server database, I need a way of translating the data that is in the database into the structured format (and back again).

There are a series of files – A0, B1, etc… I was thinking of making a class that is a concrete implementation of the RFC for that row of data. I then realized that I could use the power of an Object Oriented Language to make my solution less brittle and to try out some cool polymorphic techniques.

My first step was to create an Interface that handles each chunk of data. Just because USA Swimming isn’t going to use XML, it doesn’t mean I can’t use XML language syntax – heck it is all structured data.

I started with the attribute:

public interface IAttribute { int AttributeId { get; set; } int Start { get; set; } int Length { get; set; } bool Manditory { get; set; } AttributeType AttributeType { get; set; } string Description { get; set; } string AttributeValue { get; set; } string PaddedValue { get; } }

The AttributeValue is a string implementation of the value of the property. The PaddedValue is the same value with the padding and left/right justification applied.

I then thought about how I wanted to implement the 15 or so different elements (called “records”, which has nothing to do with swim meet records. Someone didn’t understand the importance of domain-specific language). Since the elements are fairly stable, I went with a Strategy Pattern.

Each element is a collection of Attributes. Therefore, I created one to see if I was barking up the right tree:

public class FileDescription: Collection<IAttribute> { public string ElementId { get { return "A0"; } } public string ElementDescription { get { return "File Description Record"; } } public FileDescription() { this.Add(new Attribute { AttributeId = 1, Start = 1, Length = 2, Manditory = true, AttributeType = AttributeType.CONST, Description = "A0"}); this.Add(new Attribute { AttributeId = 2, Start = 3, Length = 1, Manditory = false, AttributeType = AttributeType.CODE, Description = "ORG Code 001, table checked" }); this.Add(new Attribute { AttributeId = 3, Start = 4, Length = 8, Manditory = false, AttributeType = AttributeType.ALPHA, Description = "SDIF version number (same format as the version number from the title page)" }); etc… } }

I then realized that there is no easy way to get the correct Attribute from the collection except via its index number. Holding my nose, I then created a new unit test to see if I could create a fake for this class. The first test passed. I then wanted to add a ToString() override that spits out the FileDescription’s values – as if I was writing it out to the flat file. Here is what I came up with:

public override string ToString() { StringBuilder stringBuilder = new StringBuilder(); stringBuilder.Append(this[0].AttributeValue); stringBuilder.Append(this[1].AttributeValue); … stringBuilder.Append(this[12].AttributeValue); return stringBuilder.ToString(); }

I then realized that I would have to duplicate this code in every class that implements the Collection<IAttribute> I then created an abstract class to handle some of this work for me:

public abstract class Record : Collection<IAttribute> { public string RecordId { get { return this[0].AttributeValue.ToString(); } set { this[0].AttributeValue = value; } } public string OrganizationCode { get { return this[1].AttributeValue.ToString(); } set { this[1].AttributeValue = value; } } public const int MaxRecordLength = 160; public override string ToString() { StringBuilder stringBuilder = new StringBuilder(); foreach (Attribute attribute in this) { stringBuilder.Append(attribute.AttributeValue); } return stringBuilder.ToString(); } public string ToPaddedString() { int padLength = 0; StringBuilder stringBuilder = new StringBuilder(); stringBuilder.Append(ToString()); padLength = MaxRecordLength - stringBuilder.Length; for (int i = 0; i < padLength; i++) { stringBuilder.Append(string.Empty); } return stringBuilder.ToString(); } }

I then changed the concrete classes to inherit from the Record class. I finished my Fake for the unit tests like this:

file[0].AttributeValue = "A0"; file[1].AttributeValue = "1"; file[2].AttributeValue = "V3"; file[3].AttributeValue = "20"; file[5].AttributeValue = "Hy-Tek, Ltd"; file[6].AttributeValue = "6.0X"; file[7].AttributeValue = "Hy-Tek, Ltd -USS"; file[8].AttributeValue = "252-633-5177"; file[9].AttributeValue = "02172011";

I then realized that I hate working with indexers so I added properties that give a friendly name to the indexer:

public string SoftwareName { get { return this[5].AttributeValue.ToString(); } set { this[5].AttributeValue = value; } }

I then rewrote my Fake and still got green on the unit tests.

file.RecordId = "A0"; file.OrganizationCode = "1"; file.SDIFVersionNumber = "V3"; file.FileCode = "20"; file.SoftwareName = "Hy-Tek, Ltd"; file.SoftwareVersion = "6.0X"; file.ContactName = "Hy-Tek, Ltd -USS"; file.ContactPhone = "252-633-5177"; file.FileCreation = "02172011";

Tests were still green, so I feel good.

I then wanted to tackle the actual values that will be assigned to each attribute. There are two kinds of values – the value that come from the database (three is you want to count the .NET types versus the SQL Server ones) and the one that comes from/goes into the output file. The output file is a structured string file so perhaps I can just store the Value as a string and translate it into the native types. I would need a Translation Factory that takes the native types and pushes it into the string correctly – which seems like the right thing to do.

Before creating the Translation class , I created the classes that represent the data in the SQL Server database. I chose using EF.

I then added the Translation class that takes all of the datafrom the EF classes, translates it, and sticks it into the SDIF format. An example looks like this:

public List<IndividualEvent> CreateIndividualEventRecords(int meetId) { List<IndividualEvent> individualEvents = new List<IndividualEvent>(); IndividualEvent individualEvent = null; using (HurricaneEntities context = new HurricaneEntities()) { var q = from mea in context.tblMeetEventAssignments .Include("tblMeetEvent") .Include("tblMeetEvent.tblRaceStroke") .Include("tblMeetEvent.tblAgeGroup") .Include("tblMeetSwimmerCheckIn") .Include("tblMeetSwimmerCheckIn.tblSwimmerSeason") .Include("tblMeetSwimmerCheckIn.tblSwimmerSeason.tblSwimmer") where mea.tblMeetSwimmerCheckIn.MeetID == meetId select new { FirstName = mea.tblMeetSwimmerCheckIn.tblSwimmerSeason.tblSwimmer.FirstName, LastName = mea.tblMeetSwimmerCheckIn.tblSwimmerSeason.tblSwimmer.LastName, DateOfBirth = mea.tblMeetSwimmerCheckIn.tblSwimmerSeason.tblSwimmer.DateOfBirth, GenderId = mea.tblMeetSwimmerCheckIn.tblSwimmerSeason.tblSwimmer.GenderID, RaceStrokeId = mea.tblMeetEvent.RaceStrokeID, AgeGroupId = mea.tblMeetEvent.AgeGroupID, AgeDesc = mea.tblMeetEvent.tblAgeGroup.AgeDesc, RaceLengthId = mea.tblMeetEvent.tblAgeGroup.RaceLengthID}; foreach (var databaseRecord in q) { individualEvent = new IndividualEvent(); individualEvent.RecordId = "D0"; individualEvent.OrganizationCode = "1"; individualEvent.SwimmerName = databaseRecord.FirstName + " " + databaseRecord.LastName; individualEvent.USSwimmingNumber = "??????"; individualEvent.SwimmerBirthDate = CreateFormattedDate(databaseRecord.DateOfBirth); individualEvent.SwimmerAgeOrClass = "??"; individualEvent.SwimmerSex = CreateFormattedGender(databaseRecord.GenderId); individualEvent.EventSex = CreateFormattedGender(databaseRecord.GenderId); individualEvent.EventDistance = CreateFormattedEventDistance(databaseRecord.RaceLengthId); individualEvent.EventStroke = CreateFormattedEventStroke(databaseRecord.RaceStrokeId); individualEvent.EventAge = databaseRecord.AgeDesc; individualEvent.SeedTime = "99.99"; individualEvent.EventCourseCode = "Y"; individualEvents.Add(individualEvent); } } return individualEvents; }

An example of the Individual Helper functions that does the actual translation:

public string CreateFormattedEventDistance(int raceLengthId) { switch (raceLengthId) { case 1: return "15"; case 2: return "25"; case 3: return "50"; case 4: return "100"; default: return "0"; } }

After hooking up all of my translations, I was ready to create an output file.   All I had to do was write this

static void WriteToFile() { string fileName = @"C:\Users\Public\HHTest01.SD3"; MeetSetupFactory factory = new MeetSetupFactory(); Collection<string> collection = factory.CreateMeetSetUp(72); System.IO.File.WriteAllLines(fileName, collection); }

and because of polymorphism, the output came out perfectly:

 

A01V3      01                              Tff LLC.            1.0X      James Dixon         9193884228  02212011                                              
B11        Olive Chapel                  Highcroft Pool                                                                  0720201007202010            Y         
C11              Highcroft Hurricanes          HHST            100 Highcroft Drive                         Cary                NC27519     USA                 
D01        Sloan Dixon                 040420Dix*Sl    040420020708MM25  2    7-8         99.99   Y                                                              
D01        Sloan Dixon                 040420Dix*Sl    040420020708MM25  3    7-8         99.99   Y                                                             

Guest Lecture At The University of Michigan

I had the honor of presenting at the University Of Michigan’s School Of Public Health/Information HMP605 Health Information Technology class yesterday.

photo (4)

Contrary to the picture – there were actually students in the class – about 30 divided between the School Of Information and the School Of Public Health.

I talked about topics that I wished I knew on my first day on the job as a software developer.  I also talked about what potential project managers and IT decision makers should know about software development to maximize their company’s (and own) success.  I also fielded some questions from the students.  I found the students bright, articulate, and very eager.  Hopefully, they will hit the ground running come May when they start in their new jobs.

 

No code this week – I was too busy preparing for the presentation.   See you next Tuesday.

VS2010 Parallel Debugger

I just worked through the Parallel Debugger Lab from Microsoft. Holy cow it is awesome.  You can set a break and immediate see all of the threads running with their call stack:

image

You can then drill into each thread (or set of threads you are interested in) and see the associated tasks:

image

And coolest of all, you can see each thread that is deadlocked and where the deadlock is coming from:

image

My only grip has nothing to do with the parallel debugger, it has to do with the poo quality of code in the lab.  Take a look at this code:

image

Horrible naming, unchecked arguments, single-line for multiple commands, weak-typing, uhg!!!!

Improving performance using in-memory caching

I was thinking about how to make my random name generator more performant. I thought about a caching strategy where the EF dataset was loaded into memory at startup and then do queries on the in-memory set. I would use a Singleton pattern to make sure only 1 set was created. After searching around a bit using Bing, I realized that EF does not lend itself to this pattern -> every call could potentially go back to the database and the EF itself is rather heavy. I decided to go with a lightweight POCO using a Collection base.

My first stop was to create a Console application to see what kind of performance boost I could reasonably expect to obtain. I created 2the POCO and the collection like so:

public class LastName { public string Name { get; set; } public int Rank { get; set; } public double Frequency { get; set; } public double CumlFrequency { get; set; } } public class LastNameCollection: Collection<LastName> { }

I then made an EF class:

image

And then did a quick test.

class Program { public static LastNameCollection lastNameCollection = new LastNameCollection(); static void Main(string[] args) { Console.WriteLine("---Start---"); LoadAllData(); Stopwatch stopWatch = Stopwatch.StartNew(); Console.WriteLine("{0} was found remotely in {1} seconds", SearchForNameRemotely(), stopWatch.Elapsed.TotalSeconds); stopWatch.Restart(); Console.WriteLine("{0} was found locally in {1} seconds", SearchForNameLocally(), stopWatch.Elapsed.TotalSeconds); stopWatch.Stop(); Console.WriteLine("----End----"); Console.ReadKey(); } static void LoadAllData() { using (Tff.EntityFrameworkLoad.DB_9203_tffEntities entities = new DB_9203_tffEntities()) { var lastNameQuery = from lastName in entities.Census_LastName select lastName; foreach (Census_LastName censusLastName in lastNameQuery) { lastNameCollection.Add(new LastName { Name = censusLastName.LastName, Rank = censusLastName.Rank, Frequency = censusLastName.Frequency, CumlFrequency = censusLastName.Frequency }); } } } static string SearchForNameRemotely() { using (Tff.EntityFrameworkLoad.DB_9203_tffEntities entities = new DB_9203_tffEntities()) { Census_LastName lastNameFound = (from lastName in entities.Census_LastName where lastName.LastName == "Frankenstein" //Not case-sensitive select lastName).First(); return lastNameFound.LastName; } } static string SearchForNameLocally() { LastName lastNameFound = (from lastName in lastNameCollection where lastName.Name.ToLower() == "frankenstein" //NOTE Case-sensitivity select lastName).First(); return lastNameFound.Name; } }

Sure enough, the in-memory gain was approximately 300%:

image

I then layered on PLIQ to see if the additional processors increase the speed. Last time, I showed that PLINQ did NOT have a positive impact on performance using the LINQ to EF queries.

I changed the functions to add in the AsParallell() extension method like this:

using (Tff.EntityFrameworkLoad.DB_9203_tffEntities entities = new DB_9203_tffEntities()) { Census_LastName lastNameFound = (from lastName in entities.Census_LastName.AsParallel() where lastName.LastName == "Frankenstein" //Not case-sensitive select lastName).First(); return lastNameFound.LastName; }

Interestingly, the EF LINQ then threw an exception:

image

When I just ran AsParallel () on the in-memory copy, the performance degraded:

image

I started to implement this pattern in the Random project to see if I can get the performance boost without the Parallelism. I would assume with a web service you would create the collections in the APP_Start event handler and then query them during each method call. I put it into the Application_Start method, however performance did not increase in my unit tests – it looks like that app_start was getting called each time and I am not sure how to hold the application in memory without requests – my hosting provider recycles frequently.

I then put the collections load into the singleton method of my Global class and my tests flew:

 

private static LastNameCollection _lastNameCollection = null; public static LastNameCollection LastNameCollection { get { if (_lastNameCollection == null) { _lastNameCollection = new LastNameCollection(); PopulateLastNameCollectionFromDatabase(); } return _lastNameCollection; } }

I then pushed it out to production and sure enough, performance increased when I did multiple calls.

Improvements Parallelism Can Make On The Random Service

Now that I have the Random Service set up and I have gone through the parallel extensions lab, I thought I could apply what I learned about parallelism to the random generator.  I first needed a way of measuring the time functions take and then breaking down the components of the functions into parts that can benefit from parallelism and those parts that can’t.  Also, I am curious to see how my local machine compares to my provider in terms of the benefits of parallelism.  My big assumption is that the number of records created by the random factory is fairly small – samples of 100, 200 and the like.

To that end, I created a quick performance test harness.  I started with the phone number generation because it was very straight forward:

        static void GeneratePhoneNumbersPerformanceTestLocalMachine()
        {
            int numberOfRecords = 1000;
            RandomFactory randomFactory = new RandomFactory();
            Stopwatch stopwatch = Stopwatch.StartNew();
            randomFactory.GetPhoneNumbers(numberOfRecords);
            stopwatch.Stop();
            Trace.WriteLine(String.Format("{0} phone numbers were generated in {1} seconds on the local machine",numberOfRecords, stopwatch.Elapsed.TotalSeconds));
        }

 

The problem my 1st attempt is pretty clear:

clip_image002[4]

Parallelism might help – but I am really not concerned about improving on thousandths of a second. 

I then added the GetDate() and Get UniqueIds() functions – because these methods do not hit the database or walk a large dataset:

 Starting performance test at 1/31/2011 8:04:32 AM

1000 phone numbers were generated in 0.0022865 seconds on the local machine

1000 dates were generated in 0.0008908 seconds on the local machine

1000 unique ids were generated in 0.000591 seconds on the local machine

Ending performance test at 1/31/2011 8:04:32 AM

I then decided to test GetLastName() using a 25%, 50%, 75%, and 100% prevalence thresholds (lower prevalence means fewer records to fetch and walk):

Starting performance test at 1/31/2011 8:01:36 AM

1000 phone numbers were generated in 0.0023858 seconds on the local machine

1000 dates were generated in 0.0006502 seconds on the local machine

1000 unique ids were generated in 0.0006484 seconds on the local machine

1000 last names (25% prevalence) were generated in 1.2553884 seconds on the local machine

1000 last names (50% prevalence) were generated in 0.3628737 seconds on the local machine

1000 last names (75% prevalence) were generated in 1.3719554 seconds on the local machine

1000 last names (100% prevalence) were generated in 8.9350157 seconds on the local machine

Ending performance test at 1/31/2011 8:01:48 AM

Interestingly, it looks like the connection is being re-used, so the 50% is faster than the 25%.  Note the spike at 100% though – perhaps Parallelism might help there.  I finished my testing suite for GetFirstNames(), GetAddresses(), GetPeople(), GetEmployees().  Here is the final results:

Starting performance test at 1/31/2011 9:28:36 AM

1000 phone numbers were generated in 0.0022125 seconds on the local machine

1000 dates were generated in 0.000647 seconds on the local machine

1000 unique ids were generated in 0.0006895 seconds on the local machine

1000 last names (25% prevalence) were generated in 1.4208552 seconds on the local machine

1000 last names (50% prevalence) were generated in 0.3804186 seconds on the local machine

1000 last names (75% prevalence) were generated in 1.4271377 seconds on the local machine

1000 last names (100% prevalence) were generated in 11.5619451 seconds on the local machine

1000 male first names (25% prevalence) were generated in 0.126765 seconds on the local machine

1000 male first names (50% prevalence) were generated in 0.0956216 seconds on the local machine

1000 male first names (75% prevalence) were generated in 0.1013383 seconds on the local machine

1000 male first names (100% prevalence) were generated in 0.2053033 seconds on the local machine

1000 female first names (25% prevalence) were generated in 0.1130885 seconds on the local machine

1000 female first names (50% prevalence) were generated in 0.0998854 seconds on the local machine

1000 female first names (75% prevalence) were generated in 0.1070964 seconds on the local machine

1000 female first names (100% prevalence) were generated in 0.9740046 seconds on the local machine

1000 both first names (25% prevalence) were generated in 0.1893091 seconds on the local machine

1000 both first names (50% prevalence) were generated in 0.2349195 seconds on the local machine

1000 both first names (75% prevalence) were generated in 0.2078913 seconds on the local machine

1000 both first names (100% prevalence) were generated in 1.1015048 seconds on the local machine

1000 street addresses were generated in 12.6074157 seconds on the local machine

1000 people (100% prevalence, both genders) were generated in 28.0779342 seconds on the local machine

1000 employees (100% prevalence, both genders) were generated in 29.1355036 seconds on the local machine

Ending performance test at 1/31/2011 9:30:05 AM

 

So, now it’s time to parallelize.  Taking the path of biggest bang for the CPU cycle, I decided to looks at last names and street addresses.  Diving into the code, I changed this:

            for (int i = 0; i < numberOfNames; i++)
            {
                randomIndex = random.Next(1, lastNameQuery.Count);
                selectedLastName = lastNameQuery[randomIndex-1];
                lastNames.Add(selectedLastName.LastName);

To this:

            Parallel.For(0, numberOfNames, (index) =>
            {
                randomIndex = random.Next(1, lastNameQuery.Count);
                selectedLastName = lastNameQuery[randomIndex – 1];
                lastNames.Add(selectedLastName.LastName);

And the output was:

1000 last names (100% prevalence) were generated in 11.8207654 seconds on the local machine

It went up!  OK, so the performance hit is not on the for loop – it is the fetching of the records from the database. 

I changed this code

            List<string> lastNames = new List<string>();
            var context = new Tff.Random.tffEntities();
            List<Census_LastName> lastNameQuery =

             (from lastName in context.Census_LastName
             where lastName.CumlFrequency < pervalence
             select lastName).ToList<Census_LastName>();

To this:

            List<string> lastNames = new List<string>();
            var context = new Tff.Random.tffEntities();
            List<Census_LastName> lastNameQuery =

              (from lastName in context.Census_LastName
              .AsParallel()
              where lastName.CumlFrequency < pervalence
              select lastName).ToList<Census_LastName>();

And the output was:

1000 last names (100% prevalence) were generated in 17.2297868 seconds on the local machine

Wow, I wonder if each thread is creating its own database connection?  I fired up SQL Performance monitor against a local machine instance of my database (my provider does not let me have sysadmin on the database – ahh some rain in those clouds).  No surprise that when I moved the database to my local machine, performance improved dramatically:

1000 last names (100% prevalence) were generated in 2.1497559 seconds on the local machine

In any event, I slapped on SQL Performance Monitor.  It looks like there is only 1 call being made (I expected 4 for my quad processor):

clip_image003[4]

 It looks like it I want to speed things up, I need to speed up the database call and PLINQ can’t help there.  The best way would be to take the entire datasets and cache them then memory when the application starts up.

I will confirm these ideas using some of the lab tools with VS2010 in the coming weeks…

 

Parallelism Labs From Microsoft

 

I did the PLINQ Lab this morning.  The lab itself is fairly short and givers a great overview of both the power of Parallelism and the ease of use in C#.  In addition, the last exercise shows how use Extension Methods on your IEnermable sources to further manipulate the data.  My only gripe is that the VM screen real estate is very small:

image

And you can’t change the resolution on the VM desktop to see more of the code.  The other gripe I have (only) is that the performance on the VM stinks – you literally wait 1-2 seconds after typing a character to see the intellisense to come up.  This kind of context delay makes it harder to retain the information in the lab.

I then started the Introducing .NET4 Parallel Extensions lab.  The screen delays were even worse so I took matters into my own hands.  I took some screen shots of the lab created a local project based on the starting solution.  One of the 1st tasks was to create a set of 20 random Employees.  Instead of hard-coding values into the list, and limiting the list to only 20 employees, I decided to create a random employee generator as a WCF Service.  That is the subject of this blog post

I had fun recreating the lab.  I then went through each exercise.  It did a good job explaining each of the aspects of Parallelism syntax.  I have 1 note and 1 gripe.  The note is that in the PLINQ,  you can see how the TaskManager split the dataset in two process  1 took the 1st 50% and process 2 took the last 50%.  Presumably, if I had a quad machine, it would be divided into four:

image

My 1 gripe has to do with the overuse of the ‘var’ keyword and the use of unreadable code in a public project.  Take a swing though this syntax:

            var q = employeeList.AsParallel()
                .Where(x => x.Id % 2 == 0)
                .OrderBy(x => x.Id)
                .Select(x => PayrollServices.GetEmployeeInfo(x))
                .ToList();
            foreach (var e in q)
            {
                Console.WriteLine(e);
            }

 

foreach(var e in q)??? Ugh!  A little more thought about variable names (q should be employeeListQuery, x should be employee, e should also be employee).  Oh well, the struggle continues…

Big (Random) Generator

I needed to create a random value generator for working on the Parallel Extension Labs that I blogged about here.  The class that the lab has is pretty straight forward:

    public class Employee
    {
        public int Id { get; set; }
        public string FirstName { get; set; }
        public string LastName { get; set; }
        public string Address { get; set; }
        public DateTime HireDate { get; set; }
 
        public override string ToString()
        {
            return string.Format("Employee{0}: {1} {2} Hired:{3}", Id, FirstName, LastName, HireDate.ToShortDateString());
        }
    }

 

(I added the ToString() as a convience).  I decided that I would create a WCF Service to provide the data – primarily because I haven’t worked with WCF in 4.0 at all.

So, I created a WCF Service Application, added it to Source Control with my typical 3 branching strategy, and then published it to provider.  Everything deployed correctly so I dug into the actual service.

image

The Service returns native .NET types (Strings, DateTimes, and Guids) as well a Person and Employee classes:

image

Each of the values need to be random yet close enough to be plausible.  I started with Phone Number:

        public List<string> GetPhoneNumbers(int numberOfPhoneNumbers)
        {
            List<string> phoneNumbers = new List<string>();
            System.Random random = new System.Random();
            int  areaCodeNumber = 0;
            int prefixNumber = 0;
            int suffixNumber = 0;
 
            for (int i = 0; i < numberOfPhoneNumbers; i++)
            {
                areaCodeNumber = random.Next(100,999);
                prefixNumber =  random.Next(100,999);
                suffixNumber = random.Next(1000,9999);
                phoneNumbers.Add(String.Format("{0}-{1}-{2}"
                    areaCodeNumber, prefixNumber, suffixNumber));
            }
 
            return phoneNumbers;

And for the singular:

        public string GetPhoneNumber()
        {
            return GetPhoneNumbers(1).First();

I used this Collection/Singular pattern throughout the service.  In addition, I implemented the singular consistently: create the plural and then take the first.

I then added some Unit Tests for each of my methods:

        [TestMethod()]
        public void GetPhoneNumberTest()
        {
            string notExpected = string.Empty;
            string actual = randomFactory.GetPhoneNumber();
            Assert.AreNotEqual(notExpected, actual);
        }
 
        [TestMethod()]
        public void GetPhoneNumbersTest()
        {
            int expected = 3;
            int actual = randomFactory.GetPhoneNumbers(3).Count;
            Assert.AreEqual(expected, actual);

 

This pattern of testing was also applied consistently across all of the methods.

Once I had the easy mathods done (Get Phone Number, Get Dates, etc..),  I tackled the methods that required external data.  To generate random names, I started with the US Census where I downloaded the first and last names into an MSAccess database.  I then turned around and put the data into a SQL Server database on WinHost.  BTW: I ran into this problem, took me 30 minutes to figure it out).  Once the data was in the database, I could fire up EF:

image

The data is composed of actual names, the Frequency that they appear in America, the Cumulative Frequency that each name contains, and the rank of popularity:

image

(OT: my daughter wrote this:

image

Who knew?)

Anyway, I then created a method that pulls the records from database below the prevalence of the name and then returns a certain number of the records randomally:

        public List<string> GetLastNames(int numberOfNames, int pervalence)
        {
            if (pervalence > 100 || pervalence < 0)
                throw new ArgumentOutOfRangeException("’Pervalence’ needs to be between 0 and 100.");
 
            List<string> lastNames = new List<string>();
            var context = new Tff.Random.tffEntities();
            List<Census_LastName> lastNameQuery = (from lastName in context.Census_LastName
                                                   where lastName.CumlFrequency < pervalence
                                                   select lastName).ToList<Census_LastName>();
            System.Random random = new System.Random();
            int randomIndex = 0;
            Census_LastName selectedLastName = null;
 
            for (int i = 0; i < numberOfNames; i++)
            {
                randomIndex = random.Next(1, lastNameQuery.Count);
                selectedLastName = lastNameQuery[randomIndex-1];
                lastNames.Add(selectedLastName.LastName);
            }
 
            return lastNames;

I am not happy with this implementation – I will add Parallelism to this to speed up the processing later – and I might implement a .Random() extension method to the LINQ.  In any event, the data came back and my unit tests passed.  I then implemented a similar method for the male and female first names.

With the names out of the way, I need to figure out street addresses.  I first thought about using Google’s reverse GPS mapping API and throwing in random GPS coordinates like this:

                string uri = @"http://maps.googleapis.com/maps/api/geocode/xml?latlng=40.714224,-73.961452&sensor=false&quot;;
                WebRequest request = WebRequest.Create(uri);
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                Stream dataStream = response.GetResponseStream();
                StreamReader reader = new StreamReader(dataStream);
                string responseFromServer = reader.ReadToEnd();
                Console.WriteLine(responseFromServer);
                XmlDocument xmlDocument = new XmlDocument();
                xmlDocument.LoadXml(responseFromServer);
                XmlNodeList xmlNodeList = xmlDocument.GetElementsByTagName("formatted_address");
                string address = xmlNodeList[0].InnerText;
                reader.Close();
                dataStream.Close();
                response.Close();

 

The problem is that I don’t know exact coordinates so I would have to keep generating random ones until I got a hit – which means I would limit my search to a major metro area (doing this in a low-density state would mean many,many requests to find an actual street address).  Also, I would have the danger of actually using a real address.  Finally, Google limtis the number of requests per day, so I would be throttled – esp with a shotgun approach.

Instead, I went back to the census and found a data table with lots (not all) zip codes, cities, and states.  I then realized all I had to do was create a fake street number – easy enough, a fake street name using the last name table, and a random zip code.  Volia: a plausible yet random address.

Here is the EF Class:

image

And here is the code (split across 3 functions):

        public List<string> GetStreetAddresses(int numberOfAddresses)
        {
            List<string> streetAddresses = new List<string>();
            List<string> streetNames = GetLastNames(numberOfAddresses, 100);
            List<string> streetSuffixs = GetRandomStreetSuffixs(numberOfAddresses);
            List<string> zipCodes = GetZipCodes(numberOfAddresses);
 
            string streetNumber = string.Empty;
 
            System.Random random = new System.Random();
 
            for (int i = 0; i < numberOfAddresses; i++)
            {
                streetNumber = random.Next(10, 999).ToString();
                streetAddresses.Add(String.Format("{0} {1} {2} {3}", streetNumber, streetNames[i], streetSuffixs[i], zipCodes[i]));
            }
 
            return streetAddresses;

And:

private List<string> GetZipCodes(int numberOfZipCodes)
        {
            List<string> zipCodes = new List<string>();
            var context = new Tff.Random.tffEntities();
            List<Census_ZipCode> zipCodeQuery = (from zipCode in context.Census_ZipCode
                                                 select zipCode).ToList<Census_ZipCode>();
            System.Random random = new System.Random();
            int randomIndex = 0;
            Census_ZipCode selectZipCode = null;
 
            for (int i = 0; i < numberOfZipCodes; i++)
            {
                randomIndex = random.Next(1, zipCodeQuery.Count);
                selectZipCode = zipCodeQuery[randomIndex-1];
                zipCodes.Add(String.Format("{0}, {1} {2}", selectZipCode.City, selectZipCode.StateAbbreviation, selectZipCode.ZipCode));
            }
 
            return zipCodes;

Finally:

        private List<string> GetRandomStreetSuffixs(int numberOfSuffixes)
        {
            List<String> suffixes = new List<string>();
            List<string> returnValue = new List<string>();
            suffixes.Add("STREET");
            suffixes.Add("ROAD");
            suffixes.Add("DRIVE");
            suffixes.Add("WAY");
            suffixes.Add("CIRCLE");
 
 
            System.Random random = new System.Random();
            int randomIndex = 0;
            for(int i=0; i < numberOfSuffixes; i++)
            {
                randomIndex = random.Next(1,suffixes.Count);
                returnValue.Add(suffixes[randomIndex-1]);
            }
 
            return returnValue;

 

Now, when you hit the service, you can get a plausible yet totally fake dataset of people and employees:

            Random.RandomFactoryClient client = new Random.RandomFactoryClient();
            List<Random.Employee> employees = client.GetEmployees(20, 50, Random.Gender.Both, 10);
 
            for (int i = 0; i < employees.Count; i++)
            {
                Add(new Employee
                {
                    Id = i,
                    FirstName = employees[i].FirstName,
                    LastName = employees[i].LastName,
                    HireDate = employees[i].HireDate,
                    Address = employees[i].StreetAddress
                });

Spit out to the Console:

image

In case you want to use the service, you can find it here.

VERY IMPORTANT: I set the return values to be of type List<T>.  I know this is considered bad practice from an interoperability standpoint.  If you are using VS2010 and you want to consume the service, make sure you do this when you attach to the reference:

image

Results may vary.