Old English NLP

I started working through the SharpNLP library as referenced in my last blog post. To recap, the SharpNLP library depends on some source binaries from the SharpEntropy library which depend on sonce data files that have to be placed in a specific location on your file system.

I cracked open the SharpNLP with reflector and took a spin through the more commonly used methods. It looks like you deal with strings and a couple of classes called “chunk” and “word” that have a couple of supporting enumerations.

With the library, you can split paragraphs into sentences, sentences into words, sentences into phrases (what they call chunks), and then identify each word and/or phrase with its part of speech. These functions seem perfect for my Old English translator.

Here is an example of a basic console application with the parser output.

public class Program { static void Main(string[] args) { Console.WriteLine("--Start--"); string testSentence = "A quick brown dog jumped over a lazy cat."; Parse parse = ParseSentence(testSentence); PrintParse(parse); Console.WriteLine("---End---"); Console.ReadLine(); } private static Parse ParseSentence(string sentence) { string path = @"C:\ProgramData\OpenNLP\Models\\"; EnglishTreebankParser parser = new EnglishTreebankParser(path, true, false); return parser.DoParse(sentence); } private static void PrintParse(Parse parse) { Parse rootNode = parse.GetChildren()[0]; Parse[] parses = rootNode.GetChildren(); foreach (Parse currentParse in parses) { Console.WriteLine(String.Format("{0} : {1}", currentParse.ToString(), currentParse.Type)); } } }

image

 

I then wanted to change the order from S-V-O to S-O-V or O-S-V. In this example, the output should be “A quick brown dog over a lazy cat jumped.” or even “Over a lazy cat a quick brown dog jumped.”

Looking at the output, the VP needs to be broken down even further. I changed the PrintParse to handle the children (using recursion):

private static void PrintParse(Parse parse) { Parse[] parses = parse.GetChildren(); foreach (Parse currentParse in parses) { Console.WriteLine(String.Format("{0} : {1}", currentParse.ToString(), currentParse.Type)); PrintParse(currentParse); } }

The output is now:

image

Things looks better – except I can’t figure out what the “TK” is – it is duplicating each word and it is not a part of speech (at least according to the WordType enum)…

Removing the TK, I get this (I dropped the Console window and went to the Output window) :

A quick brown dog jumped over a lazy cat . : S

A quick brown dog : NP

A : DT

quick : JJ

brown : JJ

dog : NN

jumped over a lazy cat : VP

jumped : VBD

over a lazy cat : PP

over : IN

a lazy cat : NP

a : DT

lazy : JJ

cat : NN

. : .

So to take a sentence and take it from S-V-O to O-S-V, I need to map the parsing word types to either a S,V,or O. In this example:

  • NP = S
  • VBD = V
  • PP=O

I can then rearrange the words into Old English. The challenge is that the word types are not context-free – the 1st NP that is in the sentence (“A quick brown dog”)is the Subject, but the second NP (“a lazy cat”) is the object – the PP is for the prepositional phrase. So the 1st NP that is encountered is the subject and the second NP is ignored if it is in a PP, but if it is alone (“A quick brown dog jumped a lazy cat”), then it becomes the object. I can build a table with the different patterns – at least the common ones, and leave it at that. Here is an example of the function (yes, I know this is procedural code at its worst, I’ll refactor once I have all of my thoughts down)

private static void AssignOldEnglishOrder(List<Phrase> phrases) { int nounCount = 0; int verbCount = 0; int objectCount = 0; foreach (Phrase phrase in phrases) { switch (phrase.ParseType) { case "NP": if (nounCount == 0) { phrase.SentenceSection = "S"; } else if (nounCount == 1) { phrase.SentenceSection = "O"; } else { phrase.SentenceSection = "Z"; } nounCount++; break; case "VBD": if (verbCount == 0) { phrase.SentenceSection = "V"; } else { phrase.SentenceSection = "Z"; } verbCount++; break; case "PP": if (objectCount == 0) { phrase.SentenceSection = "O"; } else { phrase.SentenceSection = "Z"; } objectCount++; break; default: break; } } }

I then whipped up a function that spits out the phrase in Old-English:

private static void PrintPhrases(List<Phrase> phrases) { StringBuilder stringBuilder = new StringBuilder(); foreach (Phrase phrase in phrases) { switch (phrase.SentenceSection) { case "O": stringBuilder.Insert(0, phrase.Text); break; case "S": if (stringBuilder.Length > 1) stringBuilder.Insert(1, phrase.Text); else stringBuilder.Append(phrase.Text); break; case "V": stringBuilder.Append(phrase.Text); break; case "Z": break; default: stringBuilder.Append(phrase.Text); break; } } Debug.WriteLine(stringBuilder.ToString()); }

And the output looks like this:

over a lazy catA quick brown dogjumped

My next steps are:

  • Spacing between words
  • Sentences formatted correctly (all lower case until final output)
  • Find..Replace key words for Old English
  • More patterns to match O-S-V from a variety of input sentences
  • Handle the apostrophe (Don’t comes down as [Don] ‘[t] from the parser)
  • Refactor using Unit tests to better patterns (strategy seems appropriate here)
  • Stick into a WCF service
  • Probably a bunch more stuff that I haven’t thought of

That’s it. I’ll pick it up more later.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: