word2vec part 2: graph building and training

Posted on Tue 03 April 2018 in blog

In the last post we built the data preprocessing required for word2vec training. In this post we will build the network and perform the training on the text8 dataset (source), a Wikipedia dump of ~17 million tokens.

Note that we are implementing the skip-gram version of word2vec since it has superior performance. The skip gram model tries to learn: given a center word, try to predict the surrounding words.

Goals:

  • understand skipgram (word2vec) network structure
  • train model
  • save/load model variables
  • find most similar words
In [1]:
import tensorflow as tf
import math
import collections
import numpy as np
import random
import pickle

import pandas as pd
import matplotlib.pylab as plt

from sys import path
path.append('./w2v_tools')
from w2v_tools import batch_gen_w2v, build_dataset
plt.style.use('/Users/pf494t/.matplotlib/stylelib/plf.mplstyle')

%matplotlib inline
/Users/pf494t/anaconda2/envs/py36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

get the source data here

In [2]:
with open('text8') as f:
    text8 = f.read().split(' ')
    
text8 = text8[1:]
In [3]:
len(text8)
Out[3]:
17005207

Use a pre-written function to build necessary objects for frequency, reverse indexing, etc. These are useful for connecting word indices with embeddings we will learn. Also, they define which words are frequent enough to add in the vocabulary. This ensures that computations for, e.g., the similarity matrix are tractable for time and memory.

In [4]:
vocabulary_size = 50000

data, count, dictionary, reverse_dictionary = build_dataset(
            text8, vocabulary_size)
In [5]:
del text8  # reduce memory.

data is a list of in-vocabulary word indices, where the lower the number, the more frequent it is

In [6]:
data[0:10]
Out[6]:
[5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156]

use the batch generator created in the last post to read in data. Note that I added text=data because tf.data.Dataset.from_generator does not expect arguments

In [7]:
def gen():
    yield from batch_gen_w2v(text=data)

Build word2vec network

As we've seen before, first, we build a graph, then later will use it for learning

In [8]:
# paramters fro training
index = 0
batch_size = 32
num_epochs = 50
embedding_size = 64
num_sampled = 2

1: Read in the data

use tf.data.Dataset together with the generator batch_gen_w2v from the last blog post.

2: Represent the data using embeddings

First initialize the word embeddings in the embeddings Variable from a uniform distribution. These are learned during training. Then use tf.nn.embedding_lookup to create a lookup that bridges word indices and embeddings.

3: Define the layers.

Word2vec is a single layer network. Therefore, this implementation basically just pushes the input data through tf.nn.nce_loss. See this explainer on the tensorflow nce implementation. NCE loss works basically by sampling from a noise distribution (words that are not nearby), and computes the loss through binary classification. E.g., nce_weights and nce_bias are similar to learned parameters in logistic regression. The use stochastic gradient descent for learning

4: Compute the cosine similarity

This will be useful in looking for similar words. Also, others have used this for visualizations, a la t-SNE.

5: Save the model

So we can read in the embeddings later from the checkpoint

In [9]:
g = tf.Graph()
with g.as_default():   
    # 1: read in data 
    dataset = tf.data.Dataset.from_generator(
            gen, (tf.int32, tf.int32), (tf.TensorShape([batch_size]), tf.TensorShape([batch_size, 1])))

    iterator = tf.data.Iterator.from_structure(dataset.output_types,
                                               dataset.output_shapes)

    train_init_op = iterator.make_initializer(dataset,name='training_iterator')
    context, label = iterator.get_next()
    
    # 2: Represent the data using embeddings
    # embeddings for all words
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='word_embeddings')
    # embeddings for the current batch
    embed = tf.nn.embedding_lookup(embeddings, context,name='batch_embeddings')

    # 3: Define the layers
    # variables for the NCE loss
    nce_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size),name='trunc_norm'),name='nce_weights')
    nce_biases = tf.Variable(tf.zeros([vocabulary_size]),name='nce_bias')

    loss = tf.reduce_mean(
      tf.nn.nce_loss(weights=nce_weights,
                     biases=nce_biases,
                     labels=label,
                     inputs=embed,
                     num_sampled=num_sampled,
                     num_classes=vocabulary_size,name='nce_loss'),name='nce_loss_reduced')

    # SGD optimizer
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)
    
    # 4: Compute the cosine similarity
    # normalize, then dot product, then sort
    normed_embedding = tf.nn.l2_normalize(embeddings, axis=1,name='normed_embeddings')
    normed_array = tf.nn.l2_normalize(embed, axis=1,name='normed_batch_embeddings')

    cosine_similarity = tf.matmul(normed_array, tf.transpose(normed_embedding, [1, 0]),name='cos_sim')
    closest_words = tf.nn.top_k(cosine_similarity,k=10,name='k_most_similar')
    
    # 5: Save the model
    saver = tf.train.Saver()

Train the model and save the outputs

use 500,000 epochs. Total $epochs * batchsize$: 1e6 * 32 = 3.2e7. So we are basically iterating twice over the whole sequence

In [10]:
num_epochs = 1000000
losses = []

with tf.Session(graph=g) as sess:
    sess.run(tf.global_variables_initializer()) # initialize everything
    sess.run(train_init_op)
    
    for epoch in range(num_epochs):
        l, _ = sess.run([loss,optimizer])
        if epoch % 1000 == 0:
            losses.append(l)
    cw, ct, e1, e2 = sess.run([closest_words, context, normed_array, normed_embedding])
    saver.save(sess, "./models/w2v_pt2.ckpt")
    

Plot the loss over the course of training

plot the rolling average (over 20 epochs)

In [11]:
s = pd.Series(losses)
In [12]:
fig, ax = plt.subplots()

ax.plot(s.rolling(window=20).mean())
ax.set_ylabel('NCE loss')
ax.set_xlabel('epochs (thousands)')

plt.show()

Restore variables

this is a great resource

We can get the embeddings in a couple of ways. First, they should be stored in cw using the above cell by manually computing cosine similarity then using tf.nn.top_k to get the top words.

Let's just load it from file for practice. It was a bit awkward for me to get this. Notice that 'word_embeddings:0' is not what I named the layer initially, nor is it what I saw when I looked at the checkpoint (see below).

In [13]:
with tf.Session() as sess:
    new_saver = tf.train.import_meta_graph('models/w2v_pt2.ckpt.meta')
    new_saver.restore(sess, tf.train.latest_checkpoint('models/'))
    saved_embeddings = sess.run(['word_embeddings:0'])[0]
INFO:tensorflow:Restoring parameters from models/w2v_pt2.ckpt

Load in the embeddings

In [14]:
saved_embeddings.shape
Out[14]:
(50000, 64)

normalize the embeddings

Using standard L2

In [15]:
def normalize_embeddings(emb):
    emb_shape = emb.shape
    sq_emb = np.square(emb) 
    sos_emb = np.sum(sq_emb,axis=1).reshape(emb_shape[0],1)
    normed_emb = np.divide(emb,np.sqrt(sos_emb))
    with open('test.pickle','wb') as f:
        pickle.dump(normed_emb,f,protocol=pickle.HIGHEST_PROTOCOL)    
    return normed_emb
In [16]:
normed_embedding = normalize_embeddings(saved_embeddings)

Get a similarity matrix

of all pairwise comparisons. Note that this could not compute using (100000, 100) dataset. I think this can be done by writing iteratively to file rather than keeping everything in memory. Instead I used smaller data size (50000, 64)

In [17]:
def calc_cos_sim(emb, save=False):
    cos_sim = np.matmul(emb,emb.transpose())
    if save==True:
        with open('cos_sim.pickle','wb') as f:
            pickle.dump(cos_sim,f)
    else:
        return cos_sim
In [18]:
cos_sim = calc_cos_sim(normed_embedding)

Find the words most similar to a given word

In [19]:
def most_similar(sim_matrix,word=None,k_most_similar=10+1):
    if word is None:
        word = random.choice(list(dictionary.keys()))
    word_index = dictionary[word]
    word_vec = sim_matrix[word_index]
    # sort, then find the k–most similar, 
    # then display in reverse order ignoring the original word
    similar_indices = np.argsort(word_vec)[-k_most_similar:][-2::-1] 
    return [reverse_dictionary[sim_idx] for sim_idx in similar_indices]
In [20]:
print('most similar to army:\n')
most_similar(sim_matrix=cos_sim,word='army')
most similar to army:

Out[20]:
['overseas',
 'garment',
 'officer',
 'gradually',
 'gallic',
 'holding',
 'pancreas',
 'gorge',
 'armed',
 'szabo']

not bad! We see words like "police" and "navy" and perhaps there were a lot of Cold War articles.

In [21]:
print('most similar to computer:\n')
most_similar(sim_matrix=cos_sim,word='computer')
most similar to computer:

Out[21]:
['programming',
 'graphics',
 'technology',
 'network',
 'operating',
 'hardware',
 'tools',
 'oriented',
 'protocol',
 'electronic']

This one looks very good, pretty much every word is relevant to computers!

I had to try it...

The famous example in word2vec is

$$king - man + woman = queen$$

How does this model do?

In [22]:
king = normed_embedding[dictionary['king']]
man = normed_embedding[dictionary['man']]
woman = normed_embedding[dictionary['woman']]
kmw = king - man + woman
kmw_sim = np.matmul(normed_embedding,kmw)
In [23]:
print('most similar to king - man + woman:\n')
[reverse_dictionary[word_ix] for word_ix in np.argsort(kmw_sim)[::-1][1:10]]
most similar to king - man + woman:

Out[23]:
['woman',
 'charles',
 'princess',
 'ii',
 'pka',
 'patron',
 'cosmologies',
 'instructions',
 'gandhi']

close-ish

Queen is not in the top-most similar list at all, but princess is third most similar. Not bad! To improve results, its best to add more data and optimize the gradient descent parameters.

Checkpointing

One last exercise to see what else is saved in the checkpoint

In [25]:
from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file
print_tensors_in_checkpoint_file(
                                file_name='models/w2v_pt2.ckpt',\
                                tensor_name='k_most_similar',\
                                all_tensors=True,\
                                all_tensor_names=True\
                                )
tensor_name:  nce_bias
[-3.1701143 -2.974355  -3.4008424 ... -1.958728  -2.8562932 -2.3046377]
tensor_name:  nce_weights
[[-0.12722044  0.6649263   0.05840692 ...  0.23179728  0.15671267
   0.06725932]
 [-0.22675377  0.27465826  0.05471431 ...  0.5259495  -0.33893025
   0.00671034]
 [-0.04375689  0.12019156  0.02563075 ...  0.12051834 -0.16042954
   0.078161  ]
 ...
 [-1.0648519   0.90508664  0.34531596 ...  1.115668    0.02313092
  -0.05429195]
 [-0.42834222  1.512869    0.33662498 ...  1.1603873  -0.2990747
  -0.5536745 ]
 [-0.3995957   1.5934845   0.29227847 ...  1.6017785  -0.1760857
  -0.15670015]]
tensor_name:  word_embeddings
[[-0.23668487 -0.4807616  -0.2754368  ... -0.48803067  0.16866484
   0.3305748 ]
 [-0.14941135 -0.24908157  0.08617698 ... -0.3331622  -0.38400456
  -0.11003053]
 [-0.10980351 -0.42675978 -0.1942253  ... -0.67214876 -0.44179285
   0.5734172 ]
 ...
 [ 0.6119225  -0.07231586 -0.3716953  ...  0.19201861  0.6409408
   0.8552471 ]
 [ 0.41137227 -1.1280757   0.35296682 ...  0.41706795 -0.7159017
  -0.6558757 ]
 [-0.297483   -0.58869714 -0.24134146 ... -0.5376434  -0.5510352
  -0.3320802 ]]

Notice that it is only the tf.Variables that are saved. See the above link (in the restore section) for how to use variables read back in

Conclusions

Here we implemented a skip–gram model with NCE loss. We trained the model on text 8, and windowed over the sequence of 17 million tokens roughly twice. By computing word similarities, we can wee some interpretability in nearby words, but more data would likely give big improvements in performance.

Also, we could try to use negative sampling to approximate the softmax instead of NCE loss. In the end, this probably would give better performance, because negative sampling implicitly increases the window size, by preferentially skipping frequent words.

This task demonstrated how to:

  1. build a graph and feed in a text sequnce for learning
  2. save and extract variables
  3. compute word similarities