//Build Word Count!

I started working with HadoopFs last week to see if I could get a better understanding of how to write FSharp mappers.  Since everyone uses word counts when doing a “Hello World” using hadoop, I thought I would also.

image

I decided to compare Satya’s //Build keynote from 2014 and 2015 to see if there was any shift in his focus between last year and this.  Isaac Abraham managed to reduce the 20+ lines of catastrophic C# code in the Azure HDInsight tutorial into 2 lines of F# code

C#

1 static void Main(string[] args) 2 { 3 if (args.Length > 0) 4 { 5 Console.SetIn(new StreamReader(args[0])); 6 } 7 8 string line; 9 string[] words; 10 11 while ((line = Console.ReadLine()) != null) 12 { 13 words = line.Split(' '); 14 15 foreach (string word in words) 16 Console.WriteLine(word.ToLower()); 17 } 18 }

F#

1 let result = testString.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries) |> Seq.countBy id 2 result 3

I added the data files to my solution and then added way to locate those files via a relative path.

1 let baseDirectory = __SOURCE_DIRECTORY__ 2 let baseDirectory' = Directory.GetParent(baseDirectory) 3 let filePath = "Data\Build_Keynote2014.txt" 4 let fullPath = Path.Combine(baseDirectory'.FullName, filePath) 5 let buildKeynote = File.ReadAllText(fullPath)

I then ran the mapper that Isaac created and got what I expected

1 buildKeynote.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries) 2 |> Seq.countBy id 3 |> Seq.sortBy(fun (w,c) -> c) 4 |> Seq.toList 5 |> List.rev

Capture

Interestingly, the 1st word that really jumps out is “Windows” at 26 times.

I then loaded in the 2015 Build keynote and ran the same function

1 let filePath' = "Data\Build_Keynote2015.txt" 2 let fullPath' = Path.Combine(baseDirectory'.FullName, filePath') 3 let buildKeynote' = File.ReadAllText(fullPath') 4 5 buildKeynote'.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries) 6 |> Seq.countBy id 7 |> Seq.sortBy(fun (w,c) -> c) 8 |> Seq.toList 9 |> List.rev

Capture2

And the 1st interesting word is “Platform” at 9 mentions.  “Windows” fell to 2 mentions.

1 result |> Seq.filter(fun (w,c) -> w = "Windows")

image

And just because I couldn’t resist

1 result |> Seq.filter(fun (w,c) -> w = "F#") 2 result |> Seq.filter(fun (w,c) -> w = "C#") 3

image

So I am feeling pretty good about HadoopFs and will now start trying to implement it on my instance of Azure this weekend.

 

“Word Counts”: Using FSharp and HDInsight

 

I decided to learn a bit more about HDINisght, Microsoft’s implementation of Hadoop on Azure.  I was surprised about the dirth of tutorials on-line (not even Pluralsight) with only this one seemingly having what I wanted.  I started down the tutorial path –> and rewrite the map and reduce programs in F#.

Here is the original mapper code (in C#)

1 static void Main(string[] args) 2 { 3 if (args.Length > 0) 4 { 5 Console.SetIn(new StreamReader(args[0])); 6 } 7 8 string line; 9 string[] words; 10 11 while ((line = Console.ReadLine()) != null) 12 { 13 words = line.Split(' '); 14 15 foreach (string word in words) 16 Console.WriteLine(word.ToLower()); 17 } 18 }

And here it is in F#

1 [<EntryPoint>] 2 let main argv = 3 if argv.Length > 0 then 4 let inputString = argv.[0] 5 Console.SetIn(new StreamReader(inputString)) 6 let mutable continueLooping = true 7 while continueLooping do 8 let line = Console.ReadLine() 9 match String.IsNullOrEmpty(line) with 10 | true -> 11 continueLooping <- false 12 | false -> 13 let words = line.Split(' ') 14 words |> Seq.iter(fun w -> Console.WriteLine(w.ToLower())) 15 0

 

And here is the original reducer in C#

1 static void Main(string[] args) 2 { 3 string word, lastWord = null; 4 int count = 0; 5 6 if (args.Length > 0) 7 { 8 Console.SetIn(new StreamReader(args[0])); 9 } 10 11 while ((word = Console.ReadLine()) != null) 12 { 13 if (word != lastWord) 14 { 15 if(lastWord != null) 16 Console.WriteLine("{0}[{1}]", lastWord, count); 17 18 count = 1; 19 lastWord = word; 20 } 21 else 22 { 23 count += 1; 24 } 25 } 26 Console.WriteLine(count); 27 }

and here it is in F#

1 [<EntryPoint>] 2 let main argv = 3 if argv.Length > 0 then 4 let inputString = argv.[0] 5 Console.SetIn(new StreamReader(inputString)) 6 let mutable continueLooping = true 7 let mutable lastWord = String.Empty 8 let mutable count = 0 9 while continueLooping do 10 let word = Console.ReadLine() 11 match String.IsNullOrEmpty(word), word = lastWord, String.IsNullOrEmpty(lastWord) with 12 | true,_,_ -> 13 continueLooping <- false 14 | false,true,_ -> 15 count <- count + 1 16 | false,false,true -> 17 count <- 1 18 lastWord <- word 19 | false,false,false -> 20 Console.WriteLine("{0}[{1}]",lastWord,count) 21 Console.WriteLine(count) 22 0

 

The biggest difference is that the conditional if..thens of the imperative style C# is replaced by pattern matching, which I feel makes the logic much more understandable.  The use of the mutable keyword is a smell, but I am not sure how to loop user input in a Console app without it.

In any event, with the programs complete and pushed out to the Hadoop file system, I ran it via the Azure Powershell

 image

 

image

And looking at the output, nothing is coming down.

image

Drat.  I then tried to run the C# program and nothing is coming down.  I wonder if it is a problem with the original code or perhaps the data I am using?  The tutorial does not include a link to a dataset that works with the programs so I am a bit out of luck.  More investigation needed, as it were.