LogoLogo
  • Dhee.AI
  • Concepts We Work on
    • What is NLP?
    • Natural Language Parser Pipeline
    • Word Embeddings
    • Textual Entailment
    • User Intent Recognition
    • Document Reading
  • Getting Started
    • Train your bot
      • Create your bot project
      • Train for Intent recognition
        • Define the Slots
        • Train the Triggers
        • Train for Stop-Triggers
      • Configure Agent Responses
      • Write your REST Endpoints
      • Configure Endpoints
      • Configure Workflow
        • Read Inputs
        • No Code Approach
        • REST Endpoints Approach
        • Acknowledge User
    • Build and Deploy
      • Build
      • Test
      • Deploy
        • Embedding Widget in Apps
  • Platform Reference
    • Manage Projects
    • Agent Settings
      • Basic
        • Language
        • Domain
        • Voice
        • Avatar Settings
      • Widget
        • Widget Theme
        • Widget Label
      • Advanced
      • Team
        • Development Team
        • Support Team
      • Import Export
      • Emailer
      • Billing
      • Botstore
    • Knowledge Management
      • Document Reading
      • Frequently Asked Questions
    • Intents and Automation
      • Intents
        • Slots
        • Triggers
        • Stop Triggers
        • Special Intents
      • Skills/DSM
        • Dialog Actions
        • Dialog State Transition
          • Slot State
        • Dialog Workflow
        • Skill API
      • Backend API
    • Extended Message Types
    • Entities And Other Data
      • Entities
        • Multilingual Entities
        • Language Specific Entities
        • Custom Entity Types
      • Agent Responses
        • Multilingual Responses
        • Missed Query Responses
        • Support-Unavailable Responses
      • Directive Utterances
        • Customize Inputs
        • Customize Outputs
      • Translations
      • Query Substitutions
      • Abbreviation Texts
    • Test & Deploy
      • Build
      • Test
      • Deploy
        • Launchpad
        • Widget
        • Signal
        • Telegram
        • Google RCS
        • Facebook
        • Alexa
        • Whatsapp
        • Custom App
        • Voice
        • Telephony
        • Email
    • Reports
      • Statistics
        • Summary
        • Daily Reports
          • Conversation Analytics
        • Weekly Reports
        • Monthly Reports
        • Output Spreadsheet
      • Conversations
        • Conversation Logs
        • C-SAT
      • Generate Report
        • Lead Report
        • Category Report
        • Device Demography Report
        • Utterance Report
        • Missed Query Report
        • Location Report
      • Report Settings
        • KPI
        • Schedule Report
      • Personnel Audit
        • Development Team
        • Support Team
          • Supervisory Sessions
          • Login Information
      • Bot Store
        • Reviews
        • Reported Issues
  • The Social Desk
    • Whatsapp
      • Customer Chat
      • Manage Template
      • Outreach Campaign
        • Create new
        • Manage Campaign
    • Reports
    • Settings
      • Team
      • Whatsapp
  • Extras
    • External Links
  • News
    • Dhee.AI's Edge Server for Telephony Achieves Breakthrough Optimization on Intel Architecture
  • Dhee.AI Telephony Edge Servers: Elevating Conversational AI Experience with Edge Computing
  • Pinnacle Teleservices Joins Forces with DheeYantra to Supercharge Conversational AI on WhatsApp
Powered by GitBook
On this page
  • Corpus And Tokens
  • NLP Prepreprocessing Pipeline
  • Segmentation
  • Tokenization
  • Stop Words Removal
  • Stemming
  • Lemmatization
  • POS Tagging
  • Named Entity Recognition
  1. Concepts We Work on

Natural Language Parser Pipeline

Towards structure and meaning, one step at a time.

PreviousWhat is NLP?NextWord Embeddings

Last updated 3 years ago

For anyone interested in data, the world today is filled with a massive amount of data. However, a greater part of this data is in unstructured form, mostly in natural languages. To make it useful for machines to read and decipher this data for analysis, the data needs to be processed and converted to useful forms which could be easily understood by machines.

So how do the NLP engines work? How are the documents comprising of a large amount of unstructured data being converted to useful forms? To answer these questions, let us first get accustomed to certain basics of the NLP pipeline.

Corpus And Tokens

In terms of Linguistics, Corpus or Text Corpus is a large collection of structured text documents, usually stored electronically. Corpus acts as a source of data for linguists, which would be broken down and processed for further evaluation. Corpus is further broken down into documents, which is in turn broken down into sentences. Sentences are, in turn, broken down into tokens. Tokens are words or phrases, which help in understanding the context or interpreting the meaning of the sentence.

NLP Prepreprocessing Pipeline

NLP Preprocessing Pipeline comprises a series of Micro steps, during which data is broken down into its constituent forms allowing it to be converted to useful and meaningful information for the machines to process. The key steps in the preprocessing pipelines are as follows

Segmentation

Segmentation is the process of splitting a corpus into a group of semantically meaningful contiguous portions (sentences or sub-sentences). This is usually done along with the punctuations so as not to lose the essence of the sentence. For example, consider the following paragraph.

"Situated in the southern tip of India is the beautiful state of Kerala. Formed on Nov 1, 1956, Kerala is one of the most prominent tourist destinations in India. Blessed with beautiful beaches, backwaters, and hill stations, Kerala was named as one of the 10 paradises in the 1999 October edition of National Geographic Traveler Magazine."

Sentence segmentation would divide the above data as following

  • Situated in the southern tip of India is the beautiful state of Kerala.

  • Formed on Nov 1, 1956, Kerala is one of the most prominent tourist destinations in India.

  • Blessed with beautiful beaches, backwaters, and hill stations, Kerala was named as one of the 10 paradises in the 1999 October edition of National Geographic Traveler Magazine.

Tokenization

Tokenization could be thought of as a special form of Segmentation that focuses on identifying boundaries between semantic units, or in other words, separating into smaller units called tokens. By breaking the sentences into tokens, we could now apply smaller and less complex rules to derive meaning from them. This would help in understanding the syntactic and semantic information in each sentence.

The most common form of tokenization is whitespace tokenization (also known as Unigram tokenization), which splits the sentence based on whitespace. This reduces the sentence into a group of words. The tokenization could be also used to remove punctuations and special characters from the sentences.

For example,

"This place is so beautiful".

The above sentence, when applied with whitespace tokenization would be reduced to

"This" "place" "is" "so" beautiful"

Stop Words Removal

Stop Words Removal is an optional step in the pipeline, which aims to remove common words in natural languages that adds little value while parsing the tokens for analysis. For example, consider the following sentence

There is a book in my bag.

Some of the words in the above sentence ("is", "a", "in") are adding very little value, while other words ("there", "book", "my", "bag") contribute much more into understanding the meaning of the sentence. These stop words could be removed. Some of the most common stop words include "is" "in" "on" "the" "a" and many more.

While Stop Words removal is not mandatory, it does help in reducing the size of the dataset and hence impacts the time required for training with a fewer (and meaningful) token list.

Stemming

Normalization is the process of converting tokens to their base form by reducing redundant information from the words. Linguists consider words to be comprised of Morpheme (base form of the word) and inflectional forms which are usually prefixes/suffixes that are part of words.

Stemming is a rule-based Normalization technique, which allows the removal of inflections forms in words. For example,

"jumping" "jumps" "jumped" are all inflection forms of the same stem word "jump". But as one would expect, a rule-based normalization technique is not a solution for all scenarios, and hence stemming falls short with certain words.

Lemmatization

Lemmatization, on the other hand, is a systematic normalization process, in which the identification of root words is based on vocabulary, word structure, and other grammar rules. This provides much better identification of root words compared to stemming.

For example,

Running, ran, runs -> Run

Eating, eats eaten -> eat

POS Tagging

Tagging refers to the process of attaching descriptors to each token. POS Tagging or Part of Speech Taggings, refers to the process of any one of the parts of speech to the given words in a corpus. This helps in capturing syntactic relations between words. The common Parts of speech tags includes "noun", "verbs", "adjective", "adverbs", "pronoun", "conjunction" etc.

Let us consider an example set of tokens from the following sentence.

"She" "sells" "seashells" "on" "the" "seashore".

Here, the POS tagging would give us the following result,

"She" : Pronoun "Sells" : Verb "Seashells" : Noun "on" : Preposition "the" : determiner "seashore" : Noun

Of course, the words such as "on" and "the" could be removed as a part of "Stop Words Removal", but it made more sense to use them in this example for a sense of completion.

Named Entity Recognition

Named Entity Recognition or NER is the process of identifying entities in a text and classifying them into well-known categories. This helps in identifying key elements in the text. Some of the common categories used include

  • People

  • Location

  • Organization

  • Quantity

  • Time

  • Money

  • Work of Art

  • Percent

  • Event

  • Product Let us consider an example sentence

"Sachin Tendulkar was born in Mumbai in 1973."

The Named Entities recognized in the sentence are

"Sachin Tendulkar" - Perosn "Mumbai" - Location "1973" - Time