LogoLogo
  • Dhee.AI
  • Concepts We Work on
    • What is NLP?
    • Natural Language Parser Pipeline
    • Word Embeddings
    • Textual Entailment
    • User Intent Recognition
    • Document Reading
  • Getting Started
    • Train your bot
      • Create your bot project
      • Train for Intent recognition
        • Define the Slots
        • Train the Triggers
        • Train for Stop-Triggers
      • Configure Agent Responses
      • Write your REST Endpoints
      • Configure Endpoints
      • Configure Workflow
        • Read Inputs
        • No Code Approach
        • REST Endpoints Approach
        • Acknowledge User
    • Build and Deploy
      • Build
      • Test
      • Deploy
        • Embedding Widget in Apps
  • Platform Reference
    • Manage Projects
    • Agent Settings
      • Basic
        • Language
        • Domain
        • Voice
        • Avatar Settings
      • Widget
        • Widget Theme
        • Widget Label
      • Advanced
      • Team
        • Development Team
        • Support Team
      • Import Export
      • Emailer
      • Billing
      • Botstore
    • Knowledge Management
      • Document Reading
      • Frequently Asked Questions
    • Intents and Automation
      • Intents
        • Slots
        • Triggers
        • Stop Triggers
        • Special Intents
      • Skills/DSM
        • Dialog Actions
        • Dialog State Transition
          • Slot State
        • Dialog Workflow
        • Skill API
      • Backend API
    • Extended Message Types
    • Entities And Other Data
      • Entities
        • Multilingual Entities
        • Language Specific Entities
        • Custom Entity Types
      • Agent Responses
        • Multilingual Responses
        • Missed Query Responses
        • Support-Unavailable Responses
      • Directive Utterances
        • Customize Inputs
        • Customize Outputs
      • Translations
      • Query Substitutions
      • Abbreviation Texts
    • Test & Deploy
      • Build
      • Test
      • Deploy
        • Launchpad
        • Widget
        • Signal
        • Telegram
        • Google RCS
        • Facebook
        • Alexa
        • Whatsapp
        • Custom App
        • Voice
        • Telephony
        • Email
    • Reports
      • Statistics
        • Summary
        • Daily Reports
          • Conversation Analytics
        • Weekly Reports
        • Monthly Reports
        • Output Spreadsheet
      • Conversations
        • Conversation Logs
        • C-SAT
      • Generate Report
        • Lead Report
        • Category Report
        • Device Demography Report
        • Utterance Report
        • Missed Query Report
        • Location Report
      • Report Settings
        • KPI
        • Schedule Report
      • Personnel Audit
        • Development Team
        • Support Team
          • Supervisory Sessions
          • Login Information
      • Bot Store
        • Reviews
        • Reported Issues
  • The Social Desk
    • Whatsapp
      • Customer Chat
      • Manage Template
      • Outreach Campaign
        • Create new
        • Manage Campaign
    • Reports
    • Settings
      • Team
      • Whatsapp
  • Extras
    • External Links
  • News
    • Dhee.AI's Edge Server for Telephony Achieves Breakthrough Optimization on Intel Architecture
  • Dhee.AI Telephony Edge Servers: Elevating Conversational AI Experience with Edge Computing
  • Pinnacle Teleservices Joins Forces with DheeYantra to Supercharge Conversational AI on WhatsApp
Powered by GitBook
On this page
  • One-Hot Encoding
  • Word Embeddings
  • Word2Vec
  • CBOW (Continunous Bag of Words)
  • Skip-gram
  1. Concepts We Work on

Word Embeddings

Meanings are multi-dimensional, son.

PreviousNatural Language Parser PipelineNextTextual Entailment

Last updated 3 years ago

In the previous sections, we have seen how the Preprocessing Pipeline enables us to normalize the large corpus and tagged meta information to individual words. But what does this random group of words mean? How does the machine understand the semantic information associated with the words/sentences?

Words or Text have little meaning to machine learning applications. The first step towards understanding the semantic information associated with the text would be to convert the text to a numerical representation, preferably vectors.

One-Hot Encoding

One of the easiest ways to convert text to numbers is using One-Hot Encoding. One-Hot encoding is a vector representation of all the words in the corpus. In other words, every word appearing in the corpus is represented as a vector consisting of 1s and 0s. This allows a unique representation of every unique word in the vocabulary.

The size of the vector depends on the size of the Vocabulary. If the vocabulary contains 'n-words, then each word would be represented by an n-sized vector. To understand it better, let us consider an example sentence.

"the cat sat on the mat"

The unique words in the text constitute our vocabulary. These could be encoded into a 5-size (number of unique words in vocabulary) vector.

the
cat
sat
on
mat

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

Once we define the unique words in the vocabulary, we could convert the sentence into the following

[10000] 
[01000] 
[00100] 
[00010] 
[10000] 
[00001]

While this does convert the text into numerical representation, it does comprise of few issues.

  • As One-Hot encoding uses a sparse vector whose size matches the size of vocabulary, computation costs would be higher when processing a large corpus.

  • It fails to capture contextual and semantic information of the words.

Some of the other alternatives are Bag Of Words and Tf-Idf, but these two are quite limited. The alternative is to use Word Embeddings.

Word Embeddings

Word Embeddings is one the most popular approaches in representing text as numbers when processing a large corpus. It is capable of preserving the syntactic and semantic information of the word along with the context. Word Embeddings is a learned vector representation of text where words with the same or similar meaning have similar representation and are placed closely in a vector space. The key goals of word embeddings could be outlined as

  • Reduce dimensionality

  • Capture context information

  • Ensure similar words have similar representation.

Word embeddings are useful in

  • Computing similar words

  • Feature Generation

  • Document Clustering

  • Text Classification

  • Natural Language processing

Before we delve into common Word Embedding techniques used, let us take a moment to understand the importance of context. In natural language, the meaning of a word, most often than not, could be inferred by the surrounding words. This makes it possible to predict the target word using surrounding words or vice versa.

Word2Vec

Word2Vec is one of the ways to create efficient word embeddings using a neural network from a large corpus. Word2vec accepts an input of a large corpus and creates an output of feature vectors which efficiently represents the words in the corpus. The model is efficient in detecting synonymous words and completing partial sentences due to its ability to capture word associations.

Level of semantic similarity of words can be represented using cosine similarity between vectors.

Word2Vec relies on a two-layer neural network for parsing and processing the large corpus and producing vector representations. It is a self supervisor learning neural network, which trains itself on a large corpus. If during the training phase, the prediction turns out to be incorrect, then the errors are backpropagated to adjust the associated weights.

Word2Vec is not a single algorithm, instead, it is a group of related models that used to produce word embeddings and was first published by Tomas Mikolov at Google in 2013. The core idea behind the Word2Vec approach is to extract features of a word.

Feature Vector

A feature is a measurable property that can describe a particular characteristic of an object efficiently. The feature vector is an n-dimensional numerical vector that describes the features of a word. To understand it better, let us consider an analogy.

Given two houses, how do you compare them? One way would be to compare the features of the houses to one another - size of the house, number of bedrooms, size of the kitchen, etc. This is what the feature vector does intents to support. Each word in the vocabulary is represented by a vector, which denotes a score for the word based on a collection of features.

Let us consider the following collection of words

[king, horse, queen, man, woman]

The feature matrix could look like

king
horse
queen
man
woman

royality

1

0

1

0

0

authority

1

0

0.9

0.2

0.2

gender

-1

1

1

-1

1

has_4_legs

0

1

0

0

0

is_rich

1

0

1

0.3

0.2

With this set of features, we can easily understand that king and queen are closely related words, and so are men and women. We could go one step ahead and use words to derive other words. For example

king
-
man
+
woman
~
queen

1

1

0

1

1

0.2

0.2

0.9

-1

-1

1

1

0

0

0

0

1

0.3

0.2

1

As you can observe, doing mathematical operations on the vector representation of king, man and woman give you a value that is almost equivalent to a queen. This provides tremendous capabilities in understanding the relationship between words and inferring them.

In real life, the list of features is not hand-coded and is unknown to the developer. The neural network takes responsibility for it.

The two common techniques used in word2vec are

  • CBOW

  • Skip-gram

In both approaches, the training relies on creating a fake problem to train the neural network. The fake problem in both cases would be to use the given word(s) to predict the missing word in the training window.

CBOW (Continunous Bag of Words)

The continuous bag of words relies on iterating over the sentences (sliding the window of size-n) and using the context words (one-hot representation) to predict the target word. In other words, the fake problem, in this case, would be to use the context words to predict the target.

Consider the following example.

"jia brought a box of sweets for her sister"

In the example, with a selected window size of 3, the first two words are used as input to predict the target word. Running the fake problem on the sentence and training the neural network would result in a training set that is similar to the following.

context
target

a, box

brought

brought,a

jia

..,..

..

..,..

..

With a larger corpus, the training set could cover a larger span of vocabulary and hence would be able to predict a lot more words.

Skip-gram

In skip-gram approach uses the distributed representation of input words to predict the context. The fake problem, in this case, is to predict the context words to predict the target word.

For example, consider the following sentence.

"naina thanked her sister for box of sweets"

In the example, similar to CBOW we would use a window of words. Again for this example, let us assume the window size is 3. In the case of skip-gram, we would use the target word to predict the context words. This would produce a training set as the following.

target
context

naina

thanked,her

thanked

her,sister

..

..,..

..

..,..

Dhee.Ai uses Skip-gram Word2vec, fastext and BERT vectors for word embedding in its different modules. 108 dimensional vectors are used by Dhee for optimal performance and accuracy of its semantic analysis routines.