Natural Language Parser Pipeline

Towards structure and meaning, one step at a time.

For anyone interested in data, the world today is filled with a massive amount of data. However, a greater part of this data is in unstructured form, mostly in natural languages. To make it useful for machines to read and decipher this data for analysis, the data needs to be processed and converted to useful forms which could be easily understood by machines.

So how do the NLP engines work? How are the documents comprising of a large amount of unstructured data being converted to useful forms? To answer these questions, let us first get accustomed to certain basics of the NLP pipeline.

Corpus And Tokens

In terms of Linguistics, Corpus or Text Corpus is a large collection of structured text documents, usually stored electronically. Corpus acts as a source of data for linguists, which would be broken down and processed for further evaluation. Corpus is further broken down into documents, which is in turn broken down into sentences. Sentences are, in turn, broken down into tokens. Tokens are words or phrases, which help in understanding the context or interpreting the meaning of the sentence.

NLP Prepreprocessing Pipeline

NLP Preprocessing Pipeline comprises a series of Micro steps, during which data is broken down into its constituent forms allowing it to be converted to useful and meaningful information for the machines to process. The key steps in the preprocessing pipelines are as follows

Segmentation

Segmentation is the process of splitting a corpus into a group of semantically meaningful contiguous portions (sentences or sub-sentences). This is usually done along with the punctuations so as not to lose the essence of the sentence. For example, consider the following paragraph.

"Situated in the southern tip of India is the beautiful state of Kerala. Formed on Nov 1, 1956, Kerala is one of the most prominent tourist destinations in India. Blessed with beautiful beaches, backwaters, and hill stations, Kerala was named as one of the 10 paradises in the 1999 October edition of National Geographic Traveler Magazine."

Sentence segmentation would divide the above data as following

  • Situated in the southern tip of India is the beautiful state of Kerala.

  • Formed on Nov 1, 1956, Kerala is one of the most prominent tourist destinations in India.

  • Blessed with beautiful beaches, backwaters, and hill stations, Kerala was named as one of the 10 paradises in the 1999 October edition of National Geographic Traveler Magazine.

Tokenization

Tokenization could be thought of as a special form of Segmentation that focuses on identifying boundaries between semantic units, or in other words, separating into smaller units called tokens. By breaking the sentences into tokens, we could now apply smaller and less complex rules to derive meaning from them. This would help in understanding the syntactic and semantic information in each sentence.

The most common form of tokenization is whitespace tokenization (also known as Unigram tokenization), which splits the sentence based on whitespace. This reduces the sentence into a group of words. The tokenization could be also used to remove punctuations and special characters from the sentences.

For example,

"This place is so beautiful".

The above sentence, when applied with whitespace tokenization would be reduced to

"This" "place" "is" "so" beautiful"

Stop Words Removal

Stop Words Removal is an optional step in the pipeline, which aims to remove common words in natural languages that adds little value while parsing the tokens for analysis. For example, consider the following sentence

There is a book in my bag.

Some of the words in the above sentence ("is", "a", "in") are adding very little value, while other words ("there", "book", "my", "bag") contribute much more into understanding the meaning of the sentence. These stop words could be removed. Some of the most common stop words include "is" "in" "on" "the" "a" and many more.

While Stop Words removal is not mandatory, it does help in reducing the size of the dataset and hence impacts the time required for training with a fewer (and meaningful) token list.

Stemming

Normalization is the process of converting tokens to their base form by reducing redundant information from the words. Linguists consider words to be comprised of Morpheme (base form of the word) and inflectional forms which are usually prefixes/suffixes that are part of words.

Stemming is a rule-based Normalization technique, which allows the removal of inflections forms in words. For example,

"jumping" "jumps" "jumped" are all inflection forms of the same stem word "jump". But as one would expect, a rule-based normalization technique is not a solution for all scenarios, and hence stemming falls short with certain words.

Lemmatization

Lemmatization, on the other hand, is a systematic normalization process, in which the identification of root words is based on vocabulary, word structure, and other grammar rules. This provides much better identification of root words compared to stemming.

For example,

Running, ran, runs -> Run

Eating, eats eaten -> eat

POS Tagging

Tagging refers to the process of attaching descriptors to each token. POS Tagging or Part of Speech Taggings, refers to the process of any one of the parts of speech to the given words in a corpus. This helps in capturing syntactic relations between words. The common Parts of speech tags includes "noun", "verbs", "adjective", "adverbs", "pronoun", "conjunction" etc.

Let us consider an example set of tokens from the following sentence.

"She" "sells" "seashells" "on" "the" "seashore".

Here, the POS tagging would give us the following result,

"She" : Pronoun "Sells" : Verb "Seashells" : Noun "on" : Preposition "the" : determiner "seashore" : Noun

Of course, the words such as "on" and "the" could be removed as a part of "Stop Words Removal", but it made more sense to use them in this example for a sense of completion.

Named Entity Recognition

Named Entity Recognition or NER is the process of identifying entities in a text and classifying them into well-known categories. This helps in identifying key elements in the text. Some of the common categories used include

  • People

  • Location

  • Organization

  • Quantity

  • Time

  • Money

  • Work of Art

  • Percent

  • Event

  • Product Let us consider an example sentence

"Sachin Tendulkar was born in Mumbai in 1973."

The Named Entities recognized in the sentence are

"Sachin Tendulkar" - Perosn "Mumbai" - Location "1973" - Time

Last updated