GloVe: Global Vectors for Word Representation

Kajal Pawar

a year ago

GloVe stands for Global Vector for Word Representation. This an unsupervised learning method which is an extension to the Word2Vec model which help to efficiently learn the word vectors. It is developed by Pennington, et al. at Stanford.
Pennington et al. argue that the online scanning approach used by word2vec algorithm is suboptimal since it doesn’t fully exploit statistical information regarding word co-occurrences. So, they developed a GloVe model which combines the benefits of the Word2vec skip-gram model when it comes to word analogy tasks, with the benefits of matrix factorization methods that can exploit global statistical information.
In classical vector space model, the representations of words were developed by using matrix factorization techniques such as Latent Semantic Analysis (LSA) that do a good job of using global text statistics but are not as good as the learned methods like Word2vec model at capturing meaning and demonstrating it on tasks like calculating analogies.
For example, the male/female relationship is automatically learned, and with the induced vector representations, “King – Man + Woman” results in a vector very close to “Queen.”
Therefore, GloVe is an approach to combine both the global statistics of matrix factorization techniques like LSA with the local context-based learning technique used in word2vec model.
The GloVe model instead of using a window to define local context, it constructs an explicit word co-occurrence or word-context matrix using statistics across the whole text corpus. This result in a learning model that may result in generally better word embeddings.
The GloVe model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

Let’s see how GloVe finds meaning in statistics?

As you may know, almost all unsupervised methods for learning word representations uses the statistics of number of times (word occurrences) a word is coming in a word corpus as a primary source of information.
But still you may be thinking how we can generate meaning form these statistics, and how the resulting word vectors might represent that meaning.
Let me take the same example from the original paper, where Pennington et al. (2014) present a simple example based on the words ice and steam to illustrate it.
They explained that the relationship of these words can be revealed by studying the ratio of their co-occurrence probabilities with various probe words, k.
Let take P (k | w) be the probability that the word k appears in the context of word w. As we may know, the ice word co-occurs more frequently with solid than it does with word gas, whereas the steam word co-occurs more frequently with the gas word than it does with the solid word. Both words co-occur frequently with water word (as it is their shared property) and infrequently — with the unrelated word fashion.
In other words, the probability of solid wrt ice, i.e.; P (solid | ice) will be relatively high, and the probability of solid wrt stream, i.e., P (solid | steam) will be relatively low. Therefore, the ratio of P (solid | ice) / P (solid | steam) will be large. If we take a word such as gas that is related to steam but not to ice, the ratio of P (gas | ice) / P (gas | steam) will instead be small. For a word related to both ice and steam, such as the water we expect the ratio to be close to one.
Let me show you the above example with the help of diagram for better understanding.
So, from the above figure, we can observe that word vector learning occurs with ratios of co-occurrence probabilities rather than the probabilities themselves.
Let me also take a table of calculated probability values from the GloVe paper which will provide you broad understanding of the above example in term of probability values.

Intuition behind GloVe

GloVe predicts surrounding words by maximizing the probability of a context word occurring given a center word by performing a dynamic logistic regression.
While using Glove model, before training the actual model, a co-occurrence matrix, say X, is constructed, where Xij is known as “strength” which represents how often i appears in the context of the word j.
Once X is ready, it’s necessary to decide vector values in continues space for each word in the corpus. In simple words, to build word vectors that show how every pair of words i and j co-occur.
We’ll produce word vectors with a soft constraint that for each word pair of word i and word j.
Where,
bi = scaler bias term wrt word i
bj = scaler bias term wrt word j.
Now, we will minimize an objective function J, which is used to evaluates the sum of all squared error values based on the above equation shown, weighted with a function f. The equation can be given by:
Where, V is the size of the vocabulary.
But there are some co-occurrences present that happen rarely or never are noisy and carry less information than the more frequent ones. To overcome this problem, a weighted least squared regression model is used.
One class of weighting functions found to word well can be parameterized as shown below:
The model generates two different sets of word vectors, W and W1. When X is symmetric, W and W1 are equivalent to each other and differ only as a result of their random initializations; the two sets of vectors should perform equivalently.
For different types of neural networks, training multiple instances of the network and then combining the results can help to reduce overfitting and noise, according to Ciresan et al., 2012. W and W1 are summed up as word vectors. By doing this it will gives a small boost in model performance, with the biggest increase in the semantic analogy task.

Benefits of GloVe model

The GloVe model utilizes the ability to capture global statistics and also simultaneously captures the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec.
As a result, the GloVe model also called a global log-bilinear regression model which is an unsupervised learning method of word representations that perform better than other models on some of the tasks such as word analogy, word similarity and named entity recognition tasks.
Advantages
• Model training is fast.
• Can be used for huge corpora and also it is scalable.
• Also perform good even with small corpus, and small vectors
• Early stopping- We can stop training when improvements become small and doesn’t show any improvements.
Disadvantages
• Uses a lot of memory: the fastest way to construct a term co-occurrence matrix is to keep it in RAM as a hash map and perform co-occurrence increments in a global manner.
• GloVe model is sometimes quite sensitive to the initial learning rate.

Implementation of GloVe model in Python using Keras

The researchers behind GloVe model provide a suite of pre-trained word embeddings on their website released under a public domain license which can be download and can be used by anyone. Click on the below link
• GloVe: Global Vectors for Word Representation
The smallest package of embeddings is about of size 822Mb, called “glove.6B.zip“.
This package was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.
You can download this collection of embeddings and then feed the Keras Embedding layer with this pretrained model weights in your training dataset.
This example is taken from an example in the Keras project: pretrained_word_embeddings.py.
First download the file from the given link above and after downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding. If you try to see inside the file, you will see a token (word) followed by their respective weights (100 numbers) on each line.
For example, below is the first line of the embedding ASCII text file showing the embedding for the word “the”.
the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062
Keras comes a Tokenizer class that can be fit on the training data and it can convert text to sequences consistently by calling the texts_to_sequences()
Method / function on the Tokenizer class, and then it provides access to the dictionary mapping of words to integers in a word_index attribute.
# Crete documents
documents = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']

# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])

# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(documents)
vocab_size = len(t.word_index) + 1

# integer encode the documents
encoded_documents = t.texts_to_sequences(documents)
print(encoded_documents)

# pad documents to a max length of 4 words
max_length = 4
padded_documents = pad_sequences(encoded_documents, maxlen=max_length, padding='post')
print(padded_documents)

Now, we need to load the pretrained GloVe word embedding model as a dictionary of a word to embedding array. This can be done as shown below
# load the whole embedding into memory
embeddings_index = dict()
f = open('glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))
As the training is slow, we will filter out the embedding for the unique words only in our training data.
Now, we will create a matrix of one embedding for each word in the training data.
We can do it by enumerating all the unique words in the Tokenizer.word_index() and locating the embedding weight vector from the loaded GloVe model.
We will get result as a matrix of weights only for words during training.
# create a weight matrix for words in training docs
embedded_matrix = zeros((vocab_size, 100))
for word, i in t.word_index.items():
	embedded_vector = embeddings_index.get(word)
	if embedded_vector is not None:
		embedded_matrix[i] = embedded_vector
Now we can define our model, fit, and evaluate it.
The main difference is that the embedding layer can be seeded with the GloVe word embedding weights.
Here, in our example we have taken the 100-dimensional version model therefore, the Embedding layer must be defined with output_dim equal to 100.
Finally, we do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.
embed = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
Now, Let’s combine all the bits and pieces of the above code and implement it.
from numpy import array
from numpy import asarray
from numpy import zeros
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding

# create documents
documents = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']

# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_documents = t.texts_to_sequences(docs)
print(encoded_documents)

# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_documents, maxlen=max_length, padding='post')
print(padded_docs)

# load the whole embedding into memory
embeddings_index = dict()
f = open('../glove_data/glove.6B/glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 100))
for word, i in t.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

# define model
model = Sequential()
embed = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(embed)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# summarize the model
print(model.summary())

# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
Running the above code, we will get the following output. Note, it may take a bit longer time to run.
[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14]]
 
[[ 6  2  0  0]
 [ 3  1  0  0]
 [ 7  4  0  0]
 [ 8  1  0  0]
 [ 9  0  0  0]
 [10  0  0  0]
 [ 5  4  0  0]
 [11  3  0  0]
 [ 5  1  0  0]
 [12 13  2 14]]
 
Loaded 400000 word vectors.
 
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 4, 100)            1500
_________________________________________________________________
flatten_1 (Flatten)          (None, 400)               0
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 401
=================================================================
Total params: 1,901
Trainable params: 401
Non-trainable params: 1,500
_________________________________________________________________
 
 
Accuracy: 100.000000
I recommend you to go through this official website for the model, at http://nlp.stanford.edu/projects/glove/ and try to understand it.
If you also want to have a look on word2vec model, you may also read my article” Word embedding - Word2vec by google”.
After reading this article, finally, you came to know the importance of the GloVe model and its benefits. For more blogs/courses on data science, machine learning, artificial intelligence, and new technologies do visit us at InsideAIML.
Thanks for reading…

Submit Review