Exploring neural networks for text classification

Posted on Fri 17 November 2017 in blog

I've been working on text classification recently. I've found keras to be a quite good high-level language and great for learning different neural network architectures. In this notebook I will examine Tweet classification using CNN and LSTM model architechtures. While CNNs are widely used in Computer Vision, I saw a paper, and a poster describing CNNs for use in text classification. Also, I heard at MLConf2017 that Facebook recently switched to using 2017 because it gave an order magnitude increase in speed. LSTMs on the other hand are sequence-based models, which account for recency of contextual data, and so are well suited to text modeling. For an excellent intro, see this post by Edwin Chen.

The data I'm using for this is the "distant labeled" Twitter dataset. This is a set of 1.6 million tweets that all contain some emoticon. Thus, positive emoticons are a proxy for positive sentiment, and vice versa for negative sentiment. As such, it is cheap to get training data and potentially good enough for getting useful sentiment training data. I've tested this, and while not shown here, it generalizes very well across datasets to even now when nobody uses :) or :-( to convey their emotions.

I've tried to include practical tips from my experience, since there are enough tutorials on keras already.

Lessons learned

weight normalization matters
the right data pre-processing can make huge differences
callbacks are essential for model benchmarking and testing
get a working model and iterate!

import dependencies¶

In [1]:

import numpy as np
import pandas as pd
import re
from keras.models import Model
from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding,\
        Conv2D, Conv1D, concatenate, merge, LSTM, Bidirectional
from keras.layers.merge import Concatenate
from keras.utils.np_utils import to_categorical
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.optimizers import SGD, adam, Adagrad
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers.normalization import BatchNormalization
from keras.models import load_model
from keras.utils import plot_model
from keras.callbacks import History, EarlyStopping, Callback, CSVLogger, ModelCheckpoint
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
import pydot
import matplotlib.pylab as plt
from sklearn import metrics
import seaborn as sns

np.random.seed(0)

Using Theano backend.

In [2]:

import keras
print keras.__version__

2.1.1

text pre-processing funtions¶

I found these are quite helpful for improving performance. If not included, the vocabulary size gets too big, since each username or url is counted as a unique word

In [3]:

def standardize_html(tweet):
    """replace all http links with 'URL'"""
    return re.sub(r"http\S+", "URL", tweet)

def standardize_www(tweet):
    """replace all www links with 'URL'"""
    return re.sub(r"www\S+", "URL", tweet)

def standardize_mentions(tweet):
    """replace all mentions with 'twittermention'"""
    return re.sub(r"(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z_]+[A-Za-z0-9_]+)", "twittermention", tweet)

define dimensions for input data

In [4]:

glove_vec_size = 200 # dimension of pre-trained glove vector
top_n_words_to_process = 100000 # number of words to keep in tokenizer
max_seq_length = 50 # max word length per tweet

define data dirs

In [5]:

glove_dir = '../glove_pretrained/glove.twitter.27B.200d.txt'
labeled_twts_dir = './'

def loadGloveModel(gloveFile):
    """load pre-trained glove vector into memory as a dictionary"""
    print "Loading Glove Model"
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = [float(val) for val in splitLine[1:]]
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model

glove_word_vecs = loadGloveModel(glove_dir)

Loading Glove Model
Done. 1193515  words loaded!

data processing¶

Split data to train and test, then clean the text. Note that I'm only using a fraction of the tweets (50,000), with more time and GPU power, scaling to much larger sizes would be feasible

In [6]:

df = pd.read_csv('lab_twts_1million.csv',error_bad_lines=False) # labeled data
df.reset_index(drop=True,inplace=True)
#df_sub = df.sample(frac=0.001,random_state=42)
df_sub = df.sample(n=int(5e4),random_state=42)
print 'subsampled training examples: %s' % (df_sub.shape[0])
df_sub['normed_text'] = df_sub['SentimentText']\
    .apply(standardize_mentions)\
    .apply(standardize_html)\
    .apply(standardize_www)
df_train, df_test = train_test_split(df_sub,test_size=0.1,random_state=42)
training_texts = df_train['normed_text'].values
training_labels = to_categorical(df_train['Sentiment'].values)

Skipping line 8836: expected 4 fields, saw 5

Skipping line 535882: expected 4 fields, saw 7

subsampled training examples: 50000

Convert input tweets to padded sequences¶

Keras expects a sequence, e.g., [0,0,0,153,48,...,]. To format the data appropriately, fit a tokenizer to the input tweets, then transform them. To ensure all data is of the same dimensions, pad the front of the sequences with zeros until all tweets are length 50.

In [7]:

tokenizer = Tokenizer(num_words=top_n_words_to_process)
tokenizer.fit_on_texts(training_texts)
raw_train_sequences = tokenizer.texts_to_sequences(training_texts)

word_index = tokenizer.word_index
training_seq = pad_sequences(raw_train_sequences, maxlen=max_seq_length)

Build embedding matrix¶

The embedding matrix will be used to bake the glove vector into the keras Embedding layer. If the word is not in the pre-defined GloVe dictionary, then set it to be a vector of all zeros

In [8]:

embedding_matrix = np.zeros((len(word_index) + 1, glove_vec_size))
for word, i in word_index.items():
    embedding_vector = glove_word_vecs.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Scale the embedding matrix¶

This step is helpful in part because out-of-vocabulary words will be equal to the mean for each dimension, since they have all been mean centered. Also, others have found this to improve performance on nlp tasks.

In [9]:

embedding_mean = np.mean(embedding_matrix,axis=0)
embedding_mean_centered = embedding_matrix - embedding_mean
embedding_stdev = np.std(embedding_mean_centered,axis=0)
normed_embedding_mat = np.divide(embedding_mean_centered,embedding_stdev)

Define the Embedding layer¶

This is the first layer for both the CNN and LSTM network. It incorporates the glove vector and has the necessary dimensions to define the computation graph

In [10]:

# define word embedding layer for subsequent plug-and-play in model architectures
embedding_layer = Embedding(input_dim=len(word_index) + 1,
                            output_dim=glove_vec_size,
                            weights=[normed_embedding_mat],
                            input_length=max_seq_length,
                            trainable=True,name='embedding layer')

Define callbacks¶

Callbacks are a set of classes in keras that allow various interactions during training. These can be especially useful for saving, logging, etc., especially when training can be quite time consuming

Use EarlyStopping to prevent needless training once the model has begun overfitting. This stops training after the loss on the validation set has not improved for 3 consecutive epochs. Note that if you are trying to optimize for perforance here it would probably be best to 1) overfit the data to >99% accuracy on the training set to show your model is sufficiently complex and then 2) introduce regularization to reduce model variance

Use CSVLogger to store all accuracy metrics to file

You can save the best model to file using ModelCheckpoint

In [11]:

early_stopping = EarlyStopping(monitor='val_loss',patience=2)
csv_logger = CSVLogger(filename='cnn_model_initial'+'.csv')
checkpointer = ModelCheckpoint(filepath='cnn_model_initial'+'.hdf5',verbose=1,save_best_only=True)

Define architecture for the CNN network¶

this is loosely based on the inception architecture described in this video, where the full model is here. In order to implement this architecture, we need to use the functional API to accomodate the non-sequential graph.

The idea to this architecture is that multiple layers of convolution and pooling are able to learn higher order features. Then you can concatenate all layers together in order to maintain higher- and lower order features.

You may be wondering what a 1x1 convolution is. Functionally, these reduce dimensionality and introduce a ReLu operation.

Additionally, batch norm is a helpful internal normalization step and the for help on how to use it in practice, see here

In [12]:

# input data as word embeddings
sequence_input = Input(shape=(max_seq_length,), dtype='int32',name='input')
embedded_sequences = embedding_layer(sequence_input)
embedded_sequences = Dropout(0.5)(embedded_sequences)

# define serial and parallel convolutional layers
first_1x1 = Convolution1D(filters=128,kernel_size=1, activation='relu')(embedded_sequences)
first_1x1 = Dropout(0.5)(first_1x1)

second_5x5 = Convolution1D(filters=128,kernel_size=5, activation='relu')(first_1x1)
second_5x5 = Dropout(0.5)(second_5x5)
second_3x3 = Convolution1D(filters=128,kernel_size=3, activation='relu')(first_1x1)
second_3x3 = Dropout(0.5)(second_3x3)

first_pool = MaxPooling1D(pool_size=3)(embedded_sequences)
first_pool = Dropout(0.5)(first_pool)
second_1x1 = Convolution1D(filters=128,kernel_size=3, activation='relu')(first_pool)
second_1x1 = Dropout(0.5)(second_1x1)

flat_first_1x1 = Flatten()(first_1x1)
flat_second_5x5 = Flatten()(second_5x5)
flat_second_3x3 = Flatten()(second_3x3)
flat_second_1x1 = Flatten()(second_1x1)

merged_flat = concatenate([flat_first_1x1,flat_second_5x5,flat_second_3x3,flat_second_1x1])
merged_flat = Dropout(0.5)(merged_flat)

batch_normed = BatchNormalization()(merged_flat)

hidden_layer = Dense(128, activation='relu',name='hidden_layer')(batch_normed)

preds = Dense(training_labels.shape[1], activation='softmax')(hidden_layer)

cnn_model = Model(inputs=sequence_input, outputs=preds)
cnn_model.compile(loss='categorical_crossentropy',
              optimizer='adam',              
              metrics=['acc'],)

Show the CNN architecture¶

In [13]:

plot_model(cnn_model,to_file='cnn_initial_architecture.png')
fig, ax = plt.subplots(1,figsize=(3,3),dpi=500)
m = plt.imread('model.png')
ax.imshow(m, interpolation='nearest')
plt.axis('off')
plt.tight_layout(pad=0)

plt.show()

fit the CNN model on training data¶

and store relevant files

In [14]:

%%time
cnn_model.fit(training_seq, training_labels,\
            epochs=200, batch_size=256,validation_split=0.2,callbacks=[csv_logger,early_stopping,checkpointer],\
            verbose=0)

Epoch 00001: val_loss improved from inf to 0.66872, saving model to cnn_model_initial.hdf5
Epoch 00002: val_loss improved from 0.66872 to 0.56254, saving model to cnn_model_initial.hdf5
Epoch 00003: val_loss improved from 0.56254 to 0.51160, saving model to cnn_model_initial.hdf5
Epoch 00004: val_loss improved from 0.51160 to 0.48894, saving model to cnn_model_initial.hdf5
Epoch 00005: val_loss improved from 0.48894 to 0.47835, saving model to cnn_model_initial.hdf5
Epoch 00006: val_loss improved from 0.47835 to 0.46669, saving model to cnn_model_initial.hdf5
Epoch 00007: val_loss did not improve
Epoch 00008: val_loss improved from 0.46669 to 0.46147, saving model to cnn_model_initial.hdf5
Epoch 00009: val_loss improved from 0.46147 to 0.46056, saving model to cnn_model_initial.hdf5
Epoch 00010: val_loss improved from 0.46056 to 0.45561, saving model to cnn_model_initial.hdf5
Epoch 00011: val_loss did not improve
Epoch 00012: val_loss did not improve
CPU times: user 48min 23s, sys: 6min 57s, total: 55min 21s
Wall time: 43min 8s

Out[14]:

<keras.callbacks.History at 0x1588811d0>

It took a total of 43 minutes to train 12 epochs for the CNN model. We will come back to check the performance afterswitching gears to the LSTM model

LSTM model¶

LSTM models are more complex and slower, but suited nicely to text data. A couple things to note. First is that I'm using bidirectional LSTMs. Consider the sentence: "I'm going to get in my ..... and go to the store. Reading this sentence forward tells you that ..... is likely something you can get into, and reading it backwards suggests ..... can take you to the store. The bidirectionality captures this.

To use consecutive LSTM sequences, the first LSTM must return a sequence as the output. This enables stacking which allow for increased model complexity.

In [15]:

checkpointer = ModelCheckpoint(filepath='lstm_model_initial'+'.hdf5',verbose=1,save_best_only=True)
csv_logger = CSVLogger(filename='lstm_model_initial'+'.csv')

In [16]:

sequence_input = Input(shape=(max_seq_length,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

LSTM_layer = Bidirectional(LSTM(units=128,dropout=0.5, recurrent_dropout=0.5,return_sequences=True))(embedded_sequences)
LSTM_layer = Bidirectional(LSTM(units=128,dropout=0.5, recurrent_dropout=0.5))(LSTM_layer)

batch_normed = BatchNormalization()(LSTM_layer)
hidden_layer = Dense(128, activation='relu')(batch_normed)
hidden_layer = Dropout(0.9)(hidden_layer)

preds = Dense(training_labels.shape[1], activation='softmax')(hidden_layer)

lstm_text_model = Model(inputs=sequence_input, outputs=preds)

lstm_text_model.compile(loss='categorical_crossentropy',
                optimizer='adam',
                metrics=['acc'],)

fit the LSTM model¶

and store pertinent files

In [17]:

%%time
lstm_text_model.fit(training_seq, training_labels,\
            epochs=200, batch_size=256,validation_split=0.2,callbacks=[early_stopping,csv_logger,checkpointer],\
            verbose=0)

Epoch 00001: val_loss improved from inf to 0.49246, saving model to lstm_model_initial.hdf5
Epoch 00002: val_loss improved from 0.49246 to 0.45506, saving model to lstm_model_initial.hdf5
Epoch 00003: val_loss did not improve
Epoch 00004: val_loss did not improve
CPU times: user 1h 28min 41s, sys: 15min 32s, total: 1h 44min 13s
Wall time: 1h 28min 16s

Out[17]:

<keras.callbacks.History at 0x16ec22f10>

The LSTM model took substantially longer than the CNN model, about twice as much total time elapsed for one third the number of training epochs.

visualize the LSTM architecture¶

In [18]:

plot_model(lstm_text_model,to_file='lstm_initial_architecture.png')

fig, ax = plt.subplots(1,figsize=(2,2),dpi=500)
m = plt.imread('lstm_initial_architecture.png')
ax.imshow(m)
plt.axis('off')
plt.show()

Model evaluation¶

Now we compare the performance of the two models. First, pull the best performing models back into memory

In [62]:

lstm_best = load_model('lstm_model_initial.hdf5')
cnn_best = load_model('cnn_model_initial.hdf5')

process the testing tweets to padded sequences

In [63]:

testing_texts = df_test['normed_text'].values
testing_labels = to_categorical(df_test['Sentiment'].values)
raw_test_sequences = tokenizer.texts_to_sequences(testing_texts)
testing_seq = pad_sequences(raw_test_sequences, maxlen=max_seq_length)

then predict the labels

In [64]:

lstm_preds = lstm_best.predict(testing_seq)
cnn_preds = cnn_best.predict(testing_seq)

In [65]:

def collapse_one_hot(row):
    """For checking accuracy of multiclass classifiers (e.g., neural nets) 
    Take binary classification with two columns and collapse to a single column. 
    Returns `0` for the first class and `1` for the second class"""
    return 1 if row[0] < row[1] else 0

convert to binary for accuracy scoring

In [66]:

lstm_preds_binary = np.apply_along_axis(collapse_one_hot, 1, lstm_preds)
cnn_preds_binary = np.apply_along_axis(collapse_one_hot, 1, cnn_preds)
test_labels_normed = np.apply_along_axis(collapse_one_hot, 1, testing_labels)

In [74]:

lstm_acc = metrics.accuracy_score(test_labels_normed,lstm_preds_binary)
cnn_acc = metrics.accuracy_score(test_labels_normed,cnn_preds_binary)
print "LSTM model accuracy: %s\nCNN model accuracy: %s" % (lstm_acc,cnn_acc)

LSTM model accuracy: 0.7918
CNN model accuracy: 0.7946

the accuracies are essentially the same, with the CNN model slightly outperforming the LSTM

plot ROC curve for each model¶

In [69]:

fpr1, tpr1, thresholds1 = metrics.roc_curve(test_labels_normed, lstm_preds[:,0])
fpr2, tpr2, thresholds2 = metrics.roc_curve(test_labels_normed, cnn_preds[:,0])

In [72]:

fig = plt.figure(figsize=(5,4), dpi=100)
ax = fig.add_subplot(111)

plt.plot(tpr1, fpr1, color='green',label='lstm')
plt.plot(tpr2, fpr2, color='darkorange',label='cnn')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

Conclusions¶

I hope you found this post helpful. In it I showed a couple of architectures that I spent just a bit of time optimizing. Both models acheived ~79% accuracy, even when trained with only 50,000 examples. I don't show it here, but I've found these accuracy metrics to be maintained for other datasets, in different domains, and from a more recent time, when people don't use emoticons as much :( with gifs, emojis, videos, etc.

The CNN trained twice about twice as fast and performed still slightly better than the LSTM. Since the data size is so small, this isn't terribly surprising. With more data, the LSTM may flex its complexity to outperform the CNN. Another thing to note is that I did not do any hyperparameter optimization. Take a look at all those dropouts, for example. Perhaps the regularization could be improved.