word2vec part 2: graph building and training
Posted on Tue 03 April 2018 in blog
In the last post we built the data preprocessing required for word2vec training. In this post we will build the network and perform the training on the text8 dataset (source), a Wikipedia dump of ~17 million tokens.
Note that we are implementing the skip-gram version of word2vec since it has superior performance. The skip gram model tries to learn: given a center word, try to predict the surrounding words.
Goals:
- understand skipgram (word2vec) network structure
- train model
- save/load model variables
- find most similar words
import tensorflow as tf
import math
import collections
import numpy as np
import random
import pickle
import pandas as pd
import matplotlib.pylab as plt
from sys import path
path.append('./w2v_tools')
from w2v_tools import batch_gen_w2v, build_dataset
plt.style.use('/Users/pf494t/.matplotlib/stylelib/plf.mplstyle')
%matplotlib inline
with open('text8') as f:
text8 = f.read().split(' ')
text8 = text8[1:]
len(text8)
Use a pre-written function to build necessary objects for frequency, reverse indexing, etc. These are useful for connecting word indices with embeddings we will learn. Also, they define which words are frequent enough to add in the vocabulary. This ensures that computations for, e.g., the similarity matrix are tractable for time and memory.
vocabulary_size = 50000
data, count, dictionary, reverse_dictionary = build_dataset(
text8, vocabulary_size)
del text8 # reduce memory.
data
is a list of in-vocabulary word indices, where the lower the number, the more frequent it is
data[0:10]
use the batch generator created in the last post to read in data. Note that I added text=data
because tf.data.Dataset.from_generator
does not expect arguments
def gen():
yield from batch_gen_w2v(text=data)
Build word2vec network¶
As we've seen before, first, we build a graph, then later will use it for learning
# paramters fro training
index = 0
batch_size = 32
num_epochs = 50
embedding_size = 64
num_sampled = 2
1: Read in the data¶
use tf.data.Dataset
together with the generator batch_gen_w2v
from the last blog post.
2: Represent the data using embeddings¶
First initialize the word embeddings in the embeddings
Variable from a uniform distribution. These are learned during training. Then use tf.nn.embedding_lookup
to create a lookup that bridges word indices and embeddings.
3: Define the layers.¶
Word2vec is a single layer network. Therefore, this implementation basically just pushes the input data through tf.nn.nce_loss
. See this explainer on the tensorflow nce implementation. NCE loss works basically by sampling from a noise distribution (words that are not nearby), and computes the loss through binary classification. E.g., nce_weights
and nce_bias
are similar to learned parameters in logistic regression. The use stochastic gradient descent for learning
4: Compute the cosine similarity¶
This will be useful in looking for similar words. Also, others have used this for visualizations, a la t-SNE.
5: Save the model¶
So we can read in the embeddings later from the checkpoint
g = tf.Graph()
with g.as_default():
# 1: read in data
dataset = tf.data.Dataset.from_generator(
gen, (tf.int32, tf.int32), (tf.TensorShape([batch_size]), tf.TensorShape([batch_size, 1])))
iterator = tf.data.Iterator.from_structure(dataset.output_types,
dataset.output_shapes)
train_init_op = iterator.make_initializer(dataset,name='training_iterator')
context, label = iterator.get_next()
# 2: Represent the data using embeddings
# embeddings for all words
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0),name='word_embeddings')
# embeddings for the current batch
embed = tf.nn.embedding_lookup(embeddings, context,name='batch_embeddings')
# 3: Define the layers
# variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([vocabulary_size, embedding_size],
stddev=1.0 / math.sqrt(embedding_size),name='trunc_norm'),name='nce_weights')
nce_biases = tf.Variable(tf.zeros([vocabulary_size]),name='nce_bias')
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=label,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size,name='nce_loss'),name='nce_loss_reduced')
# SGD optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)
# 4: Compute the cosine similarity
# normalize, then dot product, then sort
normed_embedding = tf.nn.l2_normalize(embeddings, axis=1,name='normed_embeddings')
normed_array = tf.nn.l2_normalize(embed, axis=1,name='normed_batch_embeddings')
cosine_similarity = tf.matmul(normed_array, tf.transpose(normed_embedding, [1, 0]),name='cos_sim')
closest_words = tf.nn.top_k(cosine_similarity,k=10,name='k_most_similar')
# 5: Save the model
saver = tf.train.Saver()
Train the model and save the outputs¶
use 500,000 epochs. Total $epochs * batchsize$: 1e6 * 32 = 3.2e7
. So we are basically iterating twice over the whole sequence
num_epochs = 1000000
losses = []
with tf.Session(graph=g) as sess:
sess.run(tf.global_variables_initializer()) # initialize everything
sess.run(train_init_op)
for epoch in range(num_epochs):
l, _ = sess.run([loss,optimizer])
if epoch % 1000 == 0:
losses.append(l)
cw, ct, e1, e2 = sess.run([closest_words, context, normed_array, normed_embedding])
saver.save(sess, "./models/w2v_pt2.ckpt")
Plot the loss over the course of training¶
plot the rolling average (over 20 epochs)
s = pd.Series(losses)
fig, ax = plt.subplots()
ax.plot(s.rolling(window=20).mean())
ax.set_ylabel('NCE loss')
ax.set_xlabel('epochs (thousands)')
plt.show()
Restore variables¶
this is a great resource
We can get the embeddings in a couple of ways. First, they should be stored in cw
using the above cell by manually computing cosine similarity then using tf.nn.top_k
to get the top words.
Let's just load it from file for practice. It was a bit awkward for me to get this. Notice that 'word_embeddings:0'
is not what I named the layer initially, nor is it what I saw when I looked at the checkpoint (see below).
with tf.Session() as sess:
new_saver = tf.train.import_meta_graph('models/w2v_pt2.ckpt.meta')
new_saver.restore(sess, tf.train.latest_checkpoint('models/'))
saved_embeddings = sess.run(['word_embeddings:0'])[0]
Load in the embeddings
saved_embeddings.shape
normalize the embeddings¶
Using standard L2
def normalize_embeddings(emb):
emb_shape = emb.shape
sq_emb = np.square(emb)
sos_emb = np.sum(sq_emb,axis=1).reshape(emb_shape[0],1)
normed_emb = np.divide(emb,np.sqrt(sos_emb))
with open('test.pickle','wb') as f:
pickle.dump(normed_emb,f,protocol=pickle.HIGHEST_PROTOCOL)
return normed_emb
normed_embedding = normalize_embeddings(saved_embeddings)
Get a similarity matrix¶
of all pairwise comparisons. Note that this could not compute using (100000, 100) dataset. I think this can be done by writing iteratively to file rather than keeping everything in memory. Instead I used smaller data size (50000, 64)
def calc_cos_sim(emb, save=False):
cos_sim = np.matmul(emb,emb.transpose())
if save==True:
with open('cos_sim.pickle','wb') as f:
pickle.dump(cos_sim,f)
else:
return cos_sim
cos_sim = calc_cos_sim(normed_embedding)
Find the words most similar to a given word¶
def most_similar(sim_matrix,word=None,k_most_similar=10+1):
if word is None:
word = random.choice(list(dictionary.keys()))
word_index = dictionary[word]
word_vec = sim_matrix[word_index]
# sort, then find the k–most similar,
# then display in reverse order ignoring the original word
similar_indices = np.argsort(word_vec)[-k_most_similar:][-2::-1]
return [reverse_dictionary[sim_idx] for sim_idx in similar_indices]
print('most similar to army:\n')
most_similar(sim_matrix=cos_sim,word='army')
not bad! We see words like "police" and "navy" and perhaps there were a lot of Cold War articles.
print('most similar to computer:\n')
most_similar(sim_matrix=cos_sim,word='computer')
This one looks very good, pretty much every word is relevant to computers!
I had to try it...¶
The famous example in word2vec is
$$king - man + woman = queen$$How does this model do?
king = normed_embedding[dictionary['king']]
man = normed_embedding[dictionary['man']]
woman = normed_embedding[dictionary['woman']]
kmw = king - man + woman
kmw_sim = np.matmul(normed_embedding,kmw)
print('most similar to king - man + woman:\n')
[reverse_dictionary[word_ix] for word_ix in np.argsort(kmw_sim)[::-1][1:10]]
close-ish¶
Queen is not in the top-most similar list at all, but princess is third most similar. Not bad! To improve results, its best to add more data and optimize the gradient descent parameters.
Checkpointing¶
One last exercise to see what else is saved in the checkpoint
from tensorflow.python.tools.inspect_checkpoint import print_tensors_in_checkpoint_file
print_tensors_in_checkpoint_file(
file_name='models/w2v_pt2.ckpt',\
tensor_name='k_most_similar',\
all_tensors=True,\
all_tensor_names=True\
)
Notice that it is only the tf.Variables
that are saved. See the above link (in the restore section) for how to use variables read back in
Conclusions¶
Here we implemented a skip–gram model with NCE loss. We trained the model on text 8, and windowed over the sequence of 17 million tokens roughly twice. By computing word similarities, we can wee some interpretability in nearby words, but more data would likely give big improvements in performance.
Also, we could try to use negative sampling to approximate the softmax instead of NCE loss. In the end, this probably would give better performance, because negative sampling implicitly increases the window size, by preferentially skipping frequent words.
This task demonstrated how to:
- build a graph and feed in a text sequnce for learning
- save and extract variables
- compute word similarities