# word2vec part 1: exploration and defining data flow

Posted on Mon 02 April 2018 in blog

One of the trends I've been seeing is the use of embeddings for similarity/recommendation in non-NLP domains (this talk and many others). While I've used/thought about word2vec for a while, I wanted to implement this to really get a sense of how it works and ways it can be extended to other domains. Intuitively, word2vec makes sense, and there are a lot of packages that will let you compute it without thinking much about the implementation (e.g. `gensim`

or Spark ML). There are a ton of resources/tutorials out there. In working through this, I learned about the type of data tensorflow expects, choices to objective function to use.

Here I am drawing inspiration from the tensorflow repo tensorflow examples and this Stanford course material.

Goals

- explore objective function
- read in batches of data using a generator
- validate generator performance on a sample sequence

### Objective exploration¶

I've seen in several places that computing the softmax for a large vocabulary is expensive and so one way around this is to use noise contrastive estimation (NCE) is used as a cheaper version of the softmax.

Essentially, NCE converts a one–vs–rest problem classification into a binary prediction of "is-true-pair", where negative examples are sampled from a noise distribution

#### First, how expensive is the softmax? What makes it expensive?¶

```
import numpy as np
import collections
import random
import math
import tensorflow as tf
```

```
# say vocab is 100,000 words and computing 200–dim word embeddings
a = np.random.randn(100000,200)
```

```
def softmax(arr):
np.divide(np.exp(arr),np.sum(np.exp(arr)))
```

```
%%time
softmax(a)
```

```
%time _ = np.sum(a)
%time _ = np.exp(a)
```

### Input data in batches of target, context¶

In order to train our models we need to create a function that batches in data to the computation graph, with size labels ([batch_size,1]) and context words (batch_size,), where the labels are the center words and the context words are words within two (or other) positions of the word. Notice that they are both of dimension batch size, so this tells us we sample the center word and the context word concurrently.

There's a couple ways to do this, I'm drawing inspiration from the official tensorflow code, and this Stanford implementation . Reading the fairly complete code left me scratching my head on multiple occasions, so I'm starting from the ground up. First, write a function that samples from a window both a center word and a context word (could be multiple).

First, create a sequence of ordered integers for testing

```
w = np.arange(0,128) # sample sequence
```

parameters

```
index = 0 # track position in sequence
num_samples = 2 # repeat sampling from the window
num_skips = 2 # defines window size to sample from, i.e., 2 * num_skips + 1
batch_size = 32 # size of data batches used at each epoch
```

### Strategy for windowing¶

To format the data properly, we need to sample the center word and context words from a window. Then move that window across the sequence. I liked the `tensorflow`

implementation which used a `deque`

data structure for the window. This is a fixed length array, so windowing across the text is as easy as `.extend()`

ing the object with the next element.

I will set it up so that for each center word, sample 2 context words randomly

```
window_size = num_skips * 2 + 1
```

```
# create vectors to be populated
context = np.ndarray(batch_size,np.int32)
labels = np.ndarray([batch_size,1],np.int32)
```

```
# define the right-most position of the window
leading_edge = index + window_size - 1
# this is the window
buffer = collections.deque(maxlen = window_size)
# set the initial values
buffer.extend(w[index:leading_edge])
```

```
# sample words from the window that are not the center
context_words = [val for ix, val in enumerate(buffer) if ix != num_skips]
```

These two for loops capture the logic of the data preparation. Move forward one position, and sample (context, label) _num*samples* times. Continue until the batch is complete and the arrays are ready to be fed into the model

```
index = 0
for arr_ix in range(batch_size//(num_samples)):
buffer.extend(w[leading_edge:leading_edge + 1])
for sample in range(num_samples):
context_words = [val for ix, val in enumerate(buffer) if ix != num_skips]
context[index + 2*arr_ix + sample] = random.choice(buffer)
labels[index + 2*arr_ix + sample] = buffer[num_skips]
leading_edge += 1
```

Take a look at how this does on our sample sequence

```
labels[:6], context[:6]
```

### so far so good¶

This is working as expected. We sample each center word twice (`num_samples = 2`

), then move on. Now, we have what we need to write the batch function.

A couple of notes:

- use
`global index`

to keep track of position in the sequence. - for the generator to work, need to put the iteration inside a
`while True`

block

```
data = np.random.choice(np.array(1000),size=50000)
print('sample sequence data:')
print(data)
print('\ndata length:')
print(len(data))
```

```
def batch_gen_w2v(text = data,num_skips = 2,batch_size = 8, num_samples = 2):
"""
Generate a batch of (context, labels) in conjunction with tf.data.Dataset
In this implementation, text must be in memory.
Note: need to define variables before use
`index = 0`
where index is the position in the string
"""
global index
window_size = 2 * num_skips + 1
leading_edge = index + window_size - 1
buffer = collections.deque(maxlen = window_size)
buffer.extend(text[index:leading_edge])
context = np.ndarray(batch_size,np.int32)
labels = np.ndarray([batch_size,1],np.int32)
while True:
for arr_ix in range(batch_size//(num_samples)):
buffer.extend(text[leading_edge:leading_edge + 1])
for sample in range(num_samples):
context[2 * arr_ix + sample] = random.choice(buffer)
labels[2 * arr_ix + sample] = buffer[num_skips]
leading_edge = leading_edge + 1 # increment thusly to assign globally
index = index + batch_size
yield (context, labels)
```

### feed data to a sample graph¶

now that we have our generator, test it by recording the center word value as the generator slides across the sequence

```
batch_size = 8
```

```
tf.reset_default_graph() # This avoids conflicts if you run the graph cells multiple times
```

use data placeholders and iterators, as before

```
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
```

```
dataset = tf.data.Dataset.from_generator(
batch_gen_w2v, (tf.int32, tf.int32), (tf.TensorShape([batch_size]), tf.TensorShape([batch_size, 1])))
iterator = tf.data.Iterator.from_structure(dataset.output_types,
dataset.output_shapes)
train_init_op = iterator.make_initializer(dataset,name='training_iterator')
context, labels = iterator.get_next()
```

```
iden = tf.identity(labels) # layer to show what data the model "sees". For troubleshooting
```

### Run the graph¶

Look at the first 3 batches (size: 8) to make sure it matches our expectation

```
index = 0
iden_list = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer()) # initialize everything
sess.run(train_init_op)
for _ in range(3):
iden_list.append(sess.run([iden])[0])
```

Check to see the results of the generator

```
first_batch = [v[0] for v in iden_list[0]]
second_batch = [v[0] for v in iden_list[1]]
third_batch = [v[0] for v in iden_list[2]]
batches_str = 'first batch:{}\nsecond batch:{}\nthird batch:{}'
#.format(first_batch, second_batch, third_batch)
print(batches_str.format(first_batch, second_batch, third_batch))
```

Compare to ground truth

```
data_batches = [[v for v in data[2:6]],
[v for v in data[6:10]],
[v for v in data[10:14]]]
data_str = 'data first batch: {}\ndata second batch: {}\ndata third batch: {}'
print(data_str.format(data_batches[0],data_batches[1],data_batches[2]))
```

### works as expected¶

each center word is sampled twice then the generator increments the index up to the next position in the sequence

# Conclusions¶

First, we explored the computational cost of a softmax with embedding layers and found that it is prohibitively expensive. More precisely, it is the exponential part of the softmax that is so expensive.

Next, we built a generator to feed sequence data into a neural network model. This really was the hardest part and building it from scratch really gives a handle on what is happening. The generator works as expected and, with this in hand, we are ready for building the graph and training the word2vec model