//Build Word Count!

I started working with HadoopFs last week to see if I could get a better understanding of how to write FSharp mappers.  Since everyone uses word counts when doing a “Hello World” using hadoop, I thought I would also.

image

I decided to compare Satya’s //Build keynote from 2014 and 2015 to see if there was any shift in his focus between last year and this.  Isaac Abraham managed to reduce the 20+ lines of catastrophic C# code in the Azure HDInsight tutorial into 2 lines of F# code

C#

1 static void Main(string[] args) 2 { 3 if (args.Length > 0) 4 { 5 Console.SetIn(new StreamReader(args[0])); 6 } 7 8 string line; 9 string[] words; 10 11 while ((line = Console.ReadLine()) != null) 12 { 13 words = line.Split(' '); 14 15 foreach (string word in words) 16 Console.WriteLine(word.ToLower()); 17 } 18 }

F#

1 let result = testString.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries) |> Seq.countBy id 2 result 3

I added the data files to my solution and then added way to locate those files via a relative path.

1 let baseDirectory = __SOURCE_DIRECTORY__ 2 let baseDirectory' = Directory.GetParent(baseDirectory) 3 let filePath = "Data\Build_Keynote2014.txt" 4 let fullPath = Path.Combine(baseDirectory'.FullName, filePath) 5 let buildKeynote = File.ReadAllText(fullPath)

I then ran the mapper that Isaac created and got what I expected

1 buildKeynote.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries) 2 |> Seq.countBy id 3 |> Seq.sortBy(fun (w,c) -> c) 4 |> Seq.toList 5 |> List.rev

Capture

Interestingly, the 1st word that really jumps out is “Windows” at 26 times.

I then loaded in the 2015 Build keynote and ran the same function

1 let filePath' = "Data\Build_Keynote2015.txt" 2 let fullPath' = Path.Combine(baseDirectory'.FullName, filePath') 3 let buildKeynote' = File.ReadAllText(fullPath') 4 5 buildKeynote'.Split([| ' ' |], StringSplitOptions.RemoveEmptyEntries) 6 |> Seq.countBy id 7 |> Seq.sortBy(fun (w,c) -> c) 8 |> Seq.toList 9 |> List.rev

Capture2

And the 1st interesting word is “Platform” at 9 mentions.  “Windows” fell to 2 mentions.

1 result |> Seq.filter(fun (w,c) -> w = "Windows")

image

And just because I couldn’t resist

1 result |> Seq.filter(fun (w,c) -> w = "F#") 2 result |> Seq.filter(fun (w,c) -> w = "C#") 3

image

So I am feeling pretty good about HadoopFs and will now start trying to implement it on my instance of Azure this weekend.

 

2 Responses to //Build Word Count!

  1. Pingback: F# Weekly #23, 2015 | Sergey Tihon's Blog

  2. sehnsucht47 says:

    You could simplify even more by sorting by “-c” that way you don’t even need to do `List.rev`

    // ...
    |> Seq.countBy id
    |> Seq.sortBy (fun _, c -> -c) // wildcard for w because not used
    |> Seq.toList
    
    // same thing using composition
    // ...
    |> Seq.countBy id
    |> Seq.sortBy (snd >> (~-)) // compose snd with negate function
    |> Seq.toList
    
    // composition can also be used for filter
    result |> Seq.filter (fst >> (=) "F#") // compose fst with partial application of equal function

    It can shortens code at readability cost (until you get used to it)

    I also wonder why in C# you use the overload which only takes an array of char (as a paramarray) and in F# you use the overload which also takes a StringSplitOptions ?
    `buildKeyNote.Split [| ‘ ‘ |]` would have been a more “direct” translation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: