Word Embedding Using Python Gensim Package

Shahid Ahmed

9 hours ago

Word Embedding Using Python | insideaiml
Word Embedding Using Python | insideaiml
Table of Content
  • What is Word Embedding?
  • Types of Word Embedding Methods/Algorithms
  • Word2Vec by Google
  • GloVe by Standford
  • Word2Vec Embedding
  • How to visualize Word Embedding?
  • Complete Implementation Example
In this article, we will try to learn how to do word embedding.
Gensim package is used to understand and develop word embedding.

What is Word Embedding?

      Word embedding is a dense vector representation of the word and document where words having the same meaning have a similar representation.
So, you may understand it well in this way. If you have two words that have very similar neighbours (meaning: the context in which it used is about the same), then these words are probably quite similar in meaning or are at least related.
For example, the words happy, cheerful, and cheerful are usually used in a similar context.
The word embedding technique characteristic are listed below-
  • It is a technique which represents the individual words as real-valued vectors in a pre-defined vector space.
  • This technique is widely used in the field of Deep Learning as neural networks in deep learning works on vector values.
  • It is a dense distributed representation for every word.
   

Types of Word Embedding Methods/Algorithms

      Word embedding learn a real valued vector representation from a corpus of word. This learning process can either use with the neural networks model on the task like document classification or is an unsupervised process such as documents statistics.
There are many different methods are available to perform word embedding but, in this article, we will only see two methods of word embedding from text.

Word2Vec by Google

       Word2Vec is a statistical method to learn a word embedding from a corpus of text which is developed by Tomas Mikolov, et. al. at Google in 2013. This method is developed mainly to make neural network learning word embedding more efficient. Today this technique become one of the best methods used for word embedding.
Word embedding by Word2Vec involves analysis of the learned vectors as well as exploration of vector math on representation of words.
The word2vec word embedding method are :
1.      Skip-grams (SG)
2.      Common Bag Of Words (CBOW)

GloVe by Standford

     GloVe(Global vectors for Word Representation), is developed by Pennington et al. at Stanford. This method is an extension of Word2Vec which include
  • Local context-based learning in Word2Vec.
  • Global statistics of matrix factorization techniques like LSA (Latent Semantic Analysis).
We can understand its functionality instead of using a window to define local context, GloVe creates an explicit word co-occurrence matrix for the interaction that occurs using statistics in the whole corpus of text.

Word2Vec Embedding

Let’s see an example, how we can develop Word2vec embedding
In this example, we will develop word2vec using python Genism packages. Gensim provides us word2vec class which can be imported as models.word2vec.
Word2vec implementation requires a lot of text. So here we will be using entire Amazon review corpus. But here, we will apply on a small text to understand it in a better way.

Example

First, we need to import the Word2Vec class from gensim.models as shown below –
from gensim.models import Word2Vec
Next, we need to define the training data. we are using some sentences to implement the word2vec method.
sentences = [
   ['Word', 'embeddings', 'work', 'by', 'using', 'an', 'algorithm'],
   ['this', 'is', 'the', 'InsideAIML', 'website'],
   ['you', 'can', 'read', 'technical','articles', 'for','free'],
   ['We', 'will', 'use', 'the', 'Gensim', 'library'],
   ['learn', 'python', 'programming', 'on', 'insideaiml']
]
Now, we need to train the model. Which can be done as given below−
W2V_model = Word2Vec(sentences, min_count=1)
We can summaries the model by printing it as follows −
print(W2V_model)
Let’s summarize the vocabulary as follows –
words = list(W2V_model.wv.vocab)
print(words)
Next, let’s get the vector for one word- ‘tutorial’.
print(W2V_model['tutorial'])
Let’s now save the model –
W2V_model.save(' W2V_model.bin')
Now if we need to load the model –
Word2Vec_model = Word2Vec.load(' W2V_model.bin')
Let’s now print the saved model as follows–
print(Word2Vec_model )

Complete Implementation of the above Example

from gensim.models import Word2Vec
sentences = [
   ['Word', 'embeddings', 'work', 'by', 'using', 'an', 'algorithm'],
   ['this', 'is', 'the', 'InsideAIML', 'website'],
   ['you', 'can', 'read', 'technical','articles', 'for','free'],
   ['We', 'will', 'use', 'the', 'Gensim', 'library'],
   ['learn', 'python', 'programming', 'on', 'insideaiml']
]
W2V_model = Word2Vec(sentences, min_count=1)
print(W2V_model)
words = list(W2V_model.wv.vocab)
print(words)
print(W2V_model['insideaiml'])
W2V_model.save('W2V_model.bin')
Word2Vec_model = Word2Vec.load('W2V_model.bin')
print(Word2Vec_model)

Output

Word2Vec(vocab=29, size=100, alpha=0.025)
['Word', 'embeddings', 'work', 'by', 'using', 'an', 'algorithm', 'this', 'is', 'the', 'InsideAIML', 'website', 'you', 'can', 'read', 'technical', 'articles', 'for', 'free', 'We', 'will', 'use', 'Gensim', 'library', 'learn', 'python', 'programming', 'on', 'insideaiml']
[ 2.3622620e-03  4.3221487e-04  3.9335857e-03 -1.2020235e-03
 -9.9151593e-04  2.8309512e-03  4.7964812e-03 -3.7568363e-03
 -4.9456498e-03  3.6903718e-03 -4.8737871e-03 -3.9381068e-03
  2.5999357e-03  1.3458870e-03  2.2719600e-03 -1.9624005e-03
  3.5575717e-03  4.6965261e-03 -6.2980008e-04 -3.6406862e-03
 -3.5829267e-03  1.6928543e-03  1.4477138e-03  1.1637001e-03
 -4.7915865e-04  1.3976435e-03 -2.3567895e-03  2.9160331e-03
 -4.4381022e-03  2.0105252e-03 -2.9324128e-03  1.6421793e-03
 -5.0091086e-04  2.3349845e-03 -4.1118253e-04 -9.2874817e-04
  4.4296873e-03 -3.4641903e-03  4.3619485e-03  4.7739753e-03
  4.8495419e-03  3.6664470e-03  3.8093987e-03  3.6490641e-03
 -3.3609912e-04  2.9555541e-03 -1.0483260e-03 -4.3996158e-03
  2.6523159e-03 -3.4169867e-03  1.3806688e-03 -6.9535966e-04
 -5.6781049e-04  3.5429434e-03 -1.8909144e-03  3.0394471e-03
  4.1374662e-03 -4.5139138e-03  4.5683607e-03  2.7697829e-03
  2.3550272e-03  1.3603187e-03 -4.5494111e-03  7.6852361e-04
 -4.8047729e-04 -2.4365645e-03  4.2462661e-03  2.0318357e-03
 -1.9684029e-03  1.5639960e-03 -4.5757894e-03  2.1069648e-03
 -3.5330481e-03 -1.3349410e-03  1.9695498e-03  3.1291901e-03
  4.7138124e-03 -2.2136174e-04 -2.9766995e-03 -4.5496337e-03
 -3.2605783e-03  1.5357189e-03 -1.9210422e-03 -1.8419328e-03
  3.9830280e-05  2.9295796e-04 -4.0149586e-03 -4.4272095e-03
  5.2146171e-04  3.7140078e-03 -3.3862747e-03 -6.4570026e-04
 -4.8357933e-03  3.9663548e-03  3.4471180e-03  3.9999108e-04
  2.2896260e-03  4.4800160e-03  3.8771254e-03 -1.2966482e-03]
Word2Vec(vocab=29, size=100, alpha=0.025)

How to visualize Word Embedding?

Let’s see how we can visualize word embedding. It can be done by using a classical projection method (like PCA) which is used to reduce the high-dimensionality of word vectors into a low dimensionality 2D plots.
Plotting Word Vectors Using PCA
We need to retrieve all the vectors from a trained model as shown below−
vec = W2V_model[W2V_model.wv.vocab]
Next, we need to create a 2-D PCA model of word vectors by using PCA class as shown below −
pca = PCA(n_components=2)
result = pca.fit_transform(vec)
Now, we can plot the result projection using the matplotlib as shown below–
Pyplot.scatter(result[:,0],result[:,1])
We can also annotate the points on the graph with the words themselves. Plot the resulting projection by using the matplotlib is given below–
words = list(W2V_model.wv.vocab)
for i, word in enumerate(words):
   pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))

Complete Implementation Example

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
sentences = [
   ['Word', 'embeddings', 'work', 'by', 'using', 'an', 'algorithm'],
   ['this', 'is', 'the', 'InsideAIML', 'website'],
   ['you', 'can', 'read', 'technical','articles', 'for','free'],
   ['We', 'will', 'use', 'the', 'Gensim', 'library'],
   ['learn', 'python', 'programming', 'on', 'insideaiml']
]
W2V_model = Word2Vec(sentences, min_count=1)
vec = W2V_model[W2V_model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(vec)
pyplot.scatter(result[:, 0], result[:, 1])
words = list(W2V_model.wv.vocab)
for i, word in enumerate(words):
   pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))

pyplot.show()
Graph Output | insideaiml
Graph Output | insideaiml
I hope you enjoyed reading this article and finally, you came to know about Word Embedding Using Python Gensim Package.
To know more about python programming language follow the insideaiml youtube channel.  
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Learning…
  
Recommended course for you 
       
Recommended blog for you 

Submit Review