//Build Word Count!
June 3, 2015 2 Comments
I started working with HadoopFs last week to see if I could get a better understanding of how to write FSharp mappers. Since everyone uses word counts when doing a “Hello World” using hadoop, I thought I would also.
I decided to compare Satya’s //Build keynote from 2014 and 2015 to see if there was any shift in his focus between last year and this. Isaac Abraham managed to reduce the 20+ lines of catastrophic C# code in the Azure HDInsight tutorial into 2 lines of F# code
C#
1 static void Main(string[] args) 2 { 3 if (args.Length > 0) 4 { 5 Console.SetIn(new StreamReader(args[0])); 6 } 7 8 string line; 9 string[] words; 10 11 while ((line = Console.ReadLine()) != null) 12 { 13 words = line.Split(' '); 14 15 foreach (string word in words) 16 Console.WriteLine(word.ToLower()); 17 } 18 }
F#
1 let result = testString.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries) |> Seq.countBy id 2 result 3
I added the data files to my solution and then added way to locate those files via a relative path.
1 let baseDirectory = __SOURCE_DIRECTORY__ 2 let baseDirectory' = Directory.GetParent(baseDirectory) 3 let filePath = "Data\Build_Keynote2014.txt" 4 let fullPath = Path.Combine(baseDirectory'.FullName, filePath) 5 let buildKeynote = File.ReadAllText(fullPath)
I then ran the mapper that Isaac created and got what I expected
1 buildKeynote.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries) 2 |> Seq.countBy id 3 |> Seq.sortBy(fun (w,c) -> c) 4 |> Seq.toList 5 |> List.rev
Interestingly, the 1st word that really jumps out is “Windows” at 26 times.
I then loaded in the 2015 Build keynote and ran the same function
1 let filePath' = "Data\Build_Keynote2015.txt" 2 let fullPath' = Path.Combine(baseDirectory'.FullName, filePath') 3 let buildKeynote' = File.ReadAllText(fullPath') 4 5 buildKeynote'.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries) 6 |> Seq.countBy id 7 |> Seq.sortBy(fun (w,c) -> c) 8 |> Seq.toList 9 |> List.rev
And the 1st interesting word is “Platform” at 9 mentions. “Windows” fell to 2 mentions.
1 result |> Seq.filter(fun (w,c) -> w = "Windows")
And just because I couldn’t resist
1 result |> Seq.filter(fun (w,c) -> w = "F#") 2 result |> Seq.filter(fun (w,c) -> w = "C#") 3
So I am feeling pretty good about HadoopFs and will now start trying to implement it on my instance of Azure this weekend.
Pingback: F# Weekly #23, 2015 | Sergey Tihon's Blog
You could simplify even more by sorting by “-c” that way you don’t even need to do `List.rev`
It can shortens code at readability cost (until you get used to it)
I also wonder why in C# you use the overload which only takes an array of char (as a paramarray) and in F# you use the overload which also takes a StringSplitOptions ?
`buildKeyNote.Split [| ‘ ‘ |]` would have been a more “direct” translation.