word2vec part 1: exploration and defining data flow

Posted on Mon 02 April 2018 in blog

One of the trends I've been seeing is the use of embeddings for similarity/recommendation in non-NLP domains (this talk and many others). While I've used/thought about word2vec for a while, I wanted to implement this to really get a sense of how it works and ways it can be extended to other domains. Intuitively, word2vec makes sense, and there are a lot of packages that will let you compute it without thinking much about the implementation (e.g. gensim or Spark ML). There are a ton of resources/tutorials out there. In working through this, I learned about the type of data tensorflow expects, choices to objective function to use.

Here I am drawing inspiration from the tensorflow repo tensorflow examples and this Stanford course material.


  • explore objective function
  • read in batches of data using a generator
  • validate generator performance on a sample sequence

Objective exploration

I've seen in several places that computing the softmax for a large vocabulary is expensive and so one way around this is to use noise contrastive estimation (NCE) is used as a cheaper version of the softmax.

Essentially, NCE converts a one–vs–rest problem classification into a binary prediction of "is-true-pair", where negative examples are sampled from a noise distribution

First, how expensive is the softmax? What makes it expensive?

In [1]:
import numpy as np
import collections
import random
import math
import tensorflow as tf
/Users/pf494t/anaconda2/envs/py36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
In [2]:
# say vocab is 100,000 words and computing 200–dim word embeddings
a = np.random.randn(100000,200)
In [3]:
def softmax(arr):
In [4]:
CPU times: user 386 ms, sys: 155 ms, total: 541 ms
Wall time: 541 ms
In [5]:
%time _ = np.sum(a)
%time _ = np.exp(a)
CPU times: user 12.3 ms, sys: 658 µs, total: 13 ms
Wall time: 12.1 ms
CPU times: user 166 ms, sys: 44.1 ms, total: 211 ms
Wall time: 210 ms

Input data in batches of target, context

In order to train our models we need to create a function that batches in data to the computation graph, with size labels ([batch_size,1]) and context words (batch_size,), where the labels are the center words and the context words are words within two (or other) positions of the word. Notice that they are both of dimension batch size, so this tells us we sample the center word and the context word concurrently.

There's a couple ways to do this, I'm drawing inspiration from the official tensorflow code, and this Stanford implementation . Reading the fairly complete code left me scratching my head on multiple occasions, so I'm starting from the ground up. First, write a function that samples from a window both a center word and a context word (could be multiple).

First, create a sequence of ordered integers for testing

In [6]:
w = np.arange(0,128) # sample sequence


In [7]:
index = 0 # track position in sequence
num_samples = 2 # repeat sampling from the window
num_skips = 2 # defines window size to sample from, i.e., 2 * num_skips + 1
batch_size = 32 # size of data batches used at each epoch

Strategy for windowing

To format the data properly, we need to sample the center word and context words from a window. Then move that window across the sequence. I liked the tensorflow implementation which used a deque data structure for the window. This is a fixed length array, so windowing across the text is as easy as .extend()ing the object with the next element.

I will set it up so that for each center word, sample 2 context words randomly

In [8]:
window_size = num_skips * 2 + 1
In [9]:
# create vectors to be populated
context = np.ndarray(batch_size,np.int32)
labels = np.ndarray([batch_size,1],np.int32)
In [10]:
# define the right-most position of the window
leading_edge = index + window_size - 1 

# this is the window
buffer = collections.deque(maxlen = window_size)

# set the initial values
In [11]:
# sample words from the window that are not the center
context_words = [val for ix, val in enumerate(buffer) if ix != num_skips] 

These two for loops capture the logic of the data preparation. Move forward one position, and sample (context, label) _numsamples times. Continue until the batch is complete and the arrays are ready to be fed into the model

In [12]:
index = 0
for arr_ix in range(batch_size//(num_samples)):
    buffer.extend(w[leading_edge:leading_edge + 1])
    for sample in range(num_samples):
        context_words = [val for ix, val in enumerate(buffer) if ix != num_skips]
        context[index + 2*arr_ix + sample] = random.choice(buffer)
        labels[index + 2*arr_ix + sample] = buffer[num_skips]
    leading_edge += 1

Take a look at how this does on our sample sequence

In [13]:
labels[:6], context[:6]
        [4]], dtype=int32), array([2, 3, 5, 1, 3, 5], dtype=int32))

so far so good

This is working as expected. We sample each center word twice (num_samples = 2), then move on. Now, we have what we need to write the batch function.

A couple of notes:

  • use global index to keep track of position in the sequence.
  • for the generator to work, need to put the iteration inside a while True block

Try on a sample text sequence

The way to feed in text to the model is a sequence of integers, that map to both the original words and the learned embeddings. This will be clear in the next blog post

create random text

length = 50,000; vocab = 1,000

In [14]:
data = np.random.choice(np.array(1000),size=50000)
print('sample sequence data:')
print('\ndata length:')
sample sequence data:
[508 655 540 ...  52 422 489]

data length:
In [15]:
def batch_gen_w2v(text = data,num_skips = 2,batch_size = 8, num_samples = 2):
    Generate a batch of (context, labels) in conjunction with tf.data.Dataset
    In this implementation, text must be in memory.
    Note: need to define variables before use
    `index = 0`
    where index is the position in the string
    global index
    window_size = 2 * num_skips + 1
    leading_edge = index + window_size - 1
    buffer = collections.deque(maxlen = window_size)
    context = np.ndarray(batch_size,np.int32)
    labels = np.ndarray([batch_size,1],np.int32)    
    while True:
        for arr_ix in range(batch_size//(num_samples)):
            buffer.extend(text[leading_edge:leading_edge + 1])
            for sample in range(num_samples):
                context[2 * arr_ix + sample] = random.choice(buffer)
                labels[2 * arr_ix + sample] = buffer[num_skips]
            leading_edge = leading_edge + 1 # increment thusly to assign globally
        index = index + batch_size
        yield (context, labels)

feed data to a sample graph

now that we have our generator, test it by recording the center word value as the generator slides across the sequence

In [16]:
batch_size = 8
In [17]:
tf.reset_default_graph() # This avoids conflicts if you run the graph cells multiple times

use data placeholders and iterators, as before

In [18]:
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
In [19]:
dataset = tf.data.Dataset.from_generator(
        batch_gen_w2v, (tf.int32, tf.int32), (tf.TensorShape([batch_size]), tf.TensorShape([batch_size, 1])))

iterator = tf.data.Iterator.from_structure(dataset.output_types,

train_init_op = iterator.make_initializer(dataset,name='training_iterator')
context, labels = iterator.get_next()
In [20]:
iden = tf.identity(labels) # layer to show what data the model "sees". For troubleshooting

Run the graph

Look at the first 3 batches (size: 8) to make sure it matches our expectation

In [21]:
index = 0
iden_list = []
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer()) # initialize everything
    for _ in range(3):

Check to see the results of the generator

In [22]:
first_batch = [v[0] for v in iden_list[0]]
second_batch = [v[0] for v in iden_list[1]]
third_batch = [v[0] for v in iden_list[2]]

batches_str = 'first batch:{}\nsecond batch:{}\nthird batch:{}'
               #.format(first_batch, second_batch, third_batch)
print(batches_str.format(first_batch, second_batch, third_batch))
first batch:[540, 540, 434, 434, 260, 260, 645, 645]
second batch:[469, 469, 16, 16, 60, 60, 432, 432]
third batch:[207, 207, 686, 686, 622, 622, 375, 375]

Compare to ground truth

In [23]:
data_batches = [[v for v in data[2:6]],
                [v for v in data[6:10]],
                [v for v in data[10:14]]]

data_str = 'data first batch: {}\ndata second batch: {}\ndata third batch: {}'

data first batch: [540, 434, 260, 645]
data second batch: [469, 16, 60, 432]
data third batch: [207, 686, 622, 375]

works as expected

each center word is sampled twice then the generator increments the index up to the next position in the sequence


First, we explored the computational cost of a softmax with embedding layers and found that it is prohibitively expensive. More precisely, it is the exponential part of the softmax that is so expensive.

Next, we built a generator to feed sequence data into a neural network model. This really was the hardest part and building it from scratch really gives a handle on what is happening. The generator works as expected and, with this in hand, we are ready for building the graph and training the word2vec model