Natural Language Processing in C# with OpenNLP

Dec 2023
4 min

OpenNLP is an open-source .NET library for natural language processing in C# applications.

OpenNLP is a port to C# of the popular Java library. It provides a variety of tools and models for common tasks such as:

Sentence splitter
Tokenizer
Part-of-speech tagger
Chunker
Coreference
Named entity recognition
Parse trees

It is designed to be used by developers and data scientists in a wide range of applications, from sentiment analysis to chatbots and recommendation systems.

The repository includes trained models for the English language. If you want them in another language, you will have to look for models of OpenNLP in Spanish (and in most cases, convert them).

It is also possible to train your own models. The repository indicates how to do it, although be prepared because it is a task that requires a lot of text and a lot of time.

How to use OpenNLP

We can easily add the library to a .NET project through the corresponding Nuget package.

Install-Package OpenNLP

Here are some examples of how to use OpenNLP extracted from the library documentation

Sentence splitter

A sentence splitter divides a paragraph into sentences.

var paragraph = "Mr. & Mrs. Smith is a 2005 American romantic comedy action film. The film stars Brad Pitt and Angelina Jolie as a bored upper-middle class married couple. They are surprised to learn that they are both assassins hired by competing agencies to kill each other.";

var modelPath = "path/to/EnglishSD.nbin";
var sentenceDetector = new EnglishMaximumEntropySentenceDetector(modelPath);
var sentences = sentenceDetector.SentenceDetect(paragraph);
/* 
 * sentences = ["Mr. & Mrs. Smith is a 2005 American romantic comedy action film.", 
 * "The film stars Brad Pitt and Angelina Jolie as a bored upper-middle class married couple.", 
 * "They are surprised to learn that they are both assassins hired by competing agencies to kill each other."]
 */

Technically, the sentence detector will calculate the probability that a specific character (’.’, ’?’, or ’!’ in the case of English) marks the end of a sentence.

Tokenizer

A tokenizer divides a text into words, symbols, or other meaningful elements.

// Regular tokenizer
var modelPath = "path/to/EnglishTok.nbin";
var sentence = "- Sorry Mrs. Hudson, I'll skip the tea.";
var tokenizer = new EnglishMaximumEntropyTokenizer(modelPath);
var tokens = tokenizer.Tokenize(sentence);
// tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."]

For English, a specific rule-based tokenizer (based on regexes) was created and has a better precision. This tokenizer doesn’t need any model.

// English tokenizer
var tokenizer = new EnglishRuleBasedTokenizer();
var sentence = "- Sorry Mrs. Hudson, I'll skip the tea.";
var tokens = tokenizer.Tokenize(sentence);
// tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."]

Part-of-speech tagger

A part-of-speech tagger assigns a part of speech (noun, verb, etc.) to each token in a sentence.

var modelPath = "path/to/EnglishPOS.nbin";
var tagDictDir = "path/to/tagdict/directory";
var posTagger = EnglishMaximumEntropyPosTagger(modelPath, tagdictDir);
var tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."];
var pos = posTagger.Tag(tokens);
// pos = [":", "NNP", "NNP", "NNP", ".", "PRP", "MD", "VB", "DT", "NN", "."]

Chunker

A chunker is an alternative to a full parser that provides the partial syntactic structure of a sentence (e.g., noun/verb groups).

var modelPath = "path/to/EnglishChunk.nbin";
var chunker = EnglishTreebankChunker(modelPath);
var tokens = ["-", "Sorry", "Mrs.", "Hudson", ",", "I", "'ll", "skip", "the", "tea", "."];
var pos = [":", "NNP", "NNP", "NNP", ".", "PRP", "MD", "VB", "DT", "NN", "."];
var chunks = chunker.GetChunks(tokens, tags);
// chunks = [["NP", "- Sorry Mrs. Hudson"], [",", ","], ["NP", "I"], ["VP", "'ll skip"], ["NP", "the tea"], [".", "."]]

Coreference

Coreference resolution detects all expressions that refer to the same entities in a text.

var modelPath = "path/to/coref/dir";
var coreferenceFinder = new TreebankLinker(modelPath);
var sentences = ["Mr. & Mrs. Smith is a 2005 American romantic comedy action film.", 
"The film stars Brad Pitt and Angelina Jolie as a bored upper-middle class married couple.", 
"They are surprised to learn that they are both assassins hired by competing agencies to kill each other."];
var coref = coreferenceFinder.GetCoreferenceParse(sentences);
// coref =

Name entity recognition

Named entity recognition identifies specific entities in sentences. With the current models, you can detect people, dates, places, money, percentages, and time.

var modelPath = "path/to/namefind/dir";
var nameFinder = new EnglishNameFinder(modelPath);
var sentence = "Mr. & Mrs. Smith is a 2005 American romantic comedy action film.";
// specify which types of entities you want to detect
var models = ["date", "location", "money", "organization", "percentage", "person", "time"];
var ner = nameFinder.GetNames(models, sentence);
// ner = Mr. & Mrs. <person>Smith</person> is a <date>2005</date> American romantic comedy action film.

Parse tree

A parser provides the complete syntactic structure of a sentence.

var modelPath = "path/to/models/dir";
var sentence = "- Sorry Mrs Hudson, I'll skiip the tea.";
var parser = new EnglishTreebankParser(_modelPath);
var parse = parser.DoParse(sentence);
// parse = (TOP (S (NP (: -) (NNP Sorry) (NNP Mrs.) (NNP Hudson)) (, ,) (NP (PRP I)) (VP (MD 'll) (VP (VB skip) (NP (DT the) (NN tea)))) (. .)))

OpenNLP is open source, and all the code and documentation is available in the project repository on GitHub - AlexPoint/OpenNlp