The Word2Vec Model

Shreya Datta

2 days ago

Editor's note: This post is only one part of a far more thorough and in-depth original, found here, which covers much more than what is included here.
Let’s look at some of the popular word embedding models now and engineering features from our corpora!
 

The Word2Vec Model

  This model was created by Google in 2013 and is a predictive deep learning based model to compute and generate high quality, distributed and continuous dense vector representations of words, which capture contextual and semantic similarity. Essentially these are unsupervised models which can take in massive textual corpora, create a vocabulary of possible words and generate dense word embeddings for each word in the vector space representing that vocabulary. Usually you can specify the size of the word embedding vectors and the total number of vectors are essentially the size of the vocabulary. This makes the dimensionality of this dense vector space much lower than the high-dimensional sparse vector space built using traditional Bag of Words models.
There are two different model architectures which can be leveraged by Word2Vec to create these word embedding representations. These include,
  • The Continuous Bag of Words (CBOW) Model The Skip-gram Model
  • The Skip-gram Model
  • The Skip-gram Model
There were originally introduced by Mikolov et al. and I recommend interested readers to read up on the original papers around these models which include, ‘Distributed Representations of Words and Phrases and their Compositionality’ by Mikolov et al. and ‘Efficient Estimation of Word Representations in Vector Space’ by Mikolov et al. to gain some good in-depth perspective.
 

The Continuous Bag of Words (CBOW) Model

  The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words). Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the context_window words.
   The CBOW model architecture (Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al.)
While the Word2Vec family of models are unsupervised, what this means is that you can just give it a corpus without additional labels or information and it can construct dense word embeddings from the corpus. But you will still need to leverage a supervised, classification methodology once you have this corpus to get to these embeddings. But we will do that from within the corpus itself, without any auxiliary information. We can model this CBOW architecture now as a deep learning classification model such that we take in the context words as our input, X and try to predict the target word, Y. In fact building this architecture is simpler than the skip-gram model where we try to predict a whole bunch of context words from a source target word.
 

Implementing the Continuous Bag of Words (CBOW) Model

  While it’s excellent to use robust frameworks which have the Word2Vec model like gensim, let’s try and implement this from scratch to gain some perspective on how things really work behind the scenes. We will leverage our Bible corpus contained in the norm_bible variable for training our model. The implementation will focus on four parts
  • Build the corpus vocabulary Build a CBOW (context, target) generator Build the CBOW model architecture Train the Model Get Word Embeddings
  • Build a CBOW (context, target) generator Build the CBOW model architecture Train the Model Get Word Embeddings
  • Build the CBOW model architecture Train the Model Get Word Embeddings
  • Train the Model Get Word Embeddings
  • Get Word Embeddings
  • Build a CBOW (context, target) generator Build the CBOW model architecture Train the Model Get Word Embeddings
  • Build the CBOW model architecture Train the Model Get Word Embeddings
  • Train the Model Get Word Embeddings
  • Get Word Embeddings
  • Build the CBOW model architecture Train the Model Get Word Embeddings
  • Train the Model Get Word Embeddings
  • Get Word Embeddings
  • Train the Model Get Word Embeddings
  • Get Word Embeddings
  • Get Word Embeddings
Without further delay, let’s get started!
Build the corpus vocabulary
To start off, we will first build our corpus vocabulary where we extract out each unique word from our vocabulary and map a unique numeric identifier to it.
Output
------

Vocabulary Size: 12425
Vocabulary Sample: [('perceived', 1460), ('flagon', 7287), ('gardener', 11641), ('named', 973), ('remain', 732), ('sticketh', 10622), ('abstinence', 11848), ('rufus', 8190), ('adversary', 2018), ('jehoiachin', 3189)]
Thus you can see that we have created a vocabulary of unique words in our corpus and also ways to map a word to its unique identifier and vice versa. The PAD term is typically used to pad context words to a fixed length if needed.
Build a CBOW (context, target) generator
We need pairs which consist of a target centre word and surround context words. In our implementation, a target word is of length 1 and surrounding context

Submit Review