Word Embeddings
Meanings are multi-dimensional, son.
Last updated
Meanings are multi-dimensional, son.
Last updated
In the previous sections, we have seen how the Preprocessing Pipeline enables us to normalize the large corpus and tagged meta information to individual words. But what does this random group of words mean? How does the machine understand the semantic information associated with the words/sentences?
Words or Text have little meaning to machine learning applications. The first step towards understanding the semantic information associated with the text would be to convert the text to a numerical representation, preferably vectors.
One of the easiest ways to convert text to numbers is using One-Hot Encoding. One-Hot encoding is a vector representation of all the words in the corpus. In other words, every word appearing in the corpus is represented as a vector consisting of 1s and 0s. This allows a unique representation of every unique word in the vocabulary.
The size of the vector depends on the size of the Vocabulary. If the vocabulary contains 'n-words, then each word would be represented by an n-sized vector. To understand it better, let us consider an example sentence.
"the cat sat on the mat"
The unique words in the text constitute our vocabulary. These could be encoded into a 5-size (number of unique words in vocabulary) vector.
the | cat | sat | on | mat |
---|---|---|---|---|
1 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 1 |
Once we define the unique words in the vocabulary, we could convert the sentence into the following
While this does convert the text into numerical representation, it does comprise of few issues.
As One-Hot encoding uses a sparse vector whose size matches the size of vocabulary, computation costs would be higher when processing a large corpus.
It fails to capture contextual and semantic information of the words.
Some of the other alternatives are Bag Of Words and Tf-Idf, but these two are quite limited. The alternative is to use Word Embeddings.
Word Embeddings is one the most popular approaches in representing text as numbers when processing a large corpus. It is capable of preserving the syntactic and semantic information of the word along with the context. Word Embeddings is a learned vector representation of text where words with the same or similar meaning have similar representation and are placed closely in a vector space. The key goals of word embeddings could be outlined as
Reduce dimensionality
Capture context information
Ensure similar words have similar representation.
Word embeddings are useful in
Computing similar words
Feature Generation
Document Clustering
Text Classification
Natural Language processing
Before we delve into common Word Embedding techniques used, let us take a moment to understand the importance of context. In natural language, the meaning of a word, most often than not, could be inferred by the surrounding words. This makes it possible to predict the target word using surrounding words or vice versa.
Word2Vec is one of the ways to create efficient word embeddings using a neural network from a large corpus. Word2vec accepts an input of a large corpus and creates an output of feature vectors which efficiently represents the words in the corpus. The model is efficient in detecting synonymous words and completing partial sentences due to its ability to capture word associations.
Level of semantic similarity of words can be represented using cosine similarity between vectors.
Word2Vec relies on a two-layer neural network for parsing and processing the large corpus and producing vector representations. It is a self supervisor learning neural network, which trains itself on a large corpus. If during the training phase, the prediction turns out to be incorrect, then the errors are backpropagated to adjust the associated weights.
Word2Vec is not a single algorithm, instead, it is a group of related models that used to produce word embeddings and was first published by Tomas Mikolov at Google in 2013. The core idea behind the Word2Vec approach is to extract features of a word.
A feature is a measurable property that can describe a particular characteristic of an object efficiently. The feature vector is an n-dimensional numerical vector that describes the features of a word. To understand it better, let us consider an analogy.
Given two houses, how do you compare them? One way would be to compare the features of the houses to one another - size of the house, number of bedrooms, size of the kitchen, etc. This is what the feature vector does intents to support. Each word in the vocabulary is represented by a vector, which denotes a score for the word based on a collection of features.
Let us consider the following collection of words
[king, horse, queen, man, woman]
The feature matrix could look like
king | horse | queen | man | woman | |
---|---|---|---|---|---|
royality | 1 | 0 | 1 | 0 | 0 |
authority | 1 | 0 | 0.9 | 0.2 | 0.2 |
gender | -1 | 1 | 1 | -1 | 1 |
has_4_legs | 0 | 1 | 0 | 0 | 0 |
is_rich | 1 | 0 | 1 | 0.3 | 0.2 |
With this set of features, we can easily understand that king and queen are closely related words, and so are men and women. We could go one step ahead and use words to derive other words. For example
king | - | man | + | woman | ~ | queen |
---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | |||
1 | 0.2 | 0.2 | 0.9 | |||
-1 | -1 | 1 | 1 | |||
0 | 0 | 0 | 0 | |||
1 | 0.3 | 0.2 | 1 |
As you can observe, doing mathematical operations on the vector representation of king, man and woman give you a value that is almost equivalent to a queen. This provides tremendous capabilities in understanding the relationship between words and inferring them.
In real life, the list of features is not hand-coded and is unknown to the developer. The neural network takes responsibility for it.
The two common techniques used in word2vec are
CBOW
Skip-gram
In both approaches, the training relies on creating a fake problem to train the neural network. The fake problem in both cases would be to use the given word(s) to predict the missing word in the training window.
The continuous bag of words relies on iterating over the sentences (sliding the window of size-n) and using the context words (one-hot representation) to predict the target word. In other words, the fake problem, in this case, would be to use the context words to predict the target.
Consider the following example.
"jia brought a box of sweets for her sister"
In the example, with a selected window size of 3, the first two words are used as input to predict the target word. Running the fake problem on the sentence and training the neural network would result in a training set that is similar to the following.
context | target |
---|---|
a, box | brought |
brought,a | jia |
..,.. | .. |
..,.. | .. |
With a larger corpus, the training set could cover a larger span of vocabulary and hence would be able to predict a lot more words.
In skip-gram approach uses the distributed representation of input words to predict the context. The fake problem, in this case, is to predict the context words to predict the target word.
For example, consider the following sentence.
"naina thanked her sister for box of sweets"
In the example, similar to CBOW we would use a window of words. Again for this example, let us assume the window size is 3. In the case of skip-gram, we would use the target word to predict the context words. This would produce a training set as the following.
target | context |
---|---|
naina | thanked,her |
thanked | her,sister |
.. | ..,.. |
.. | ..,.. |
Dhee.Ai uses Skip-gram Word2vec, fastext and BERT vectors for word embedding in its different modules. 108 dimensional vectors are used by Dhee for optimal performance and accuracy of its semantic analysis routines.