Exploring neural networks for text classification
Posted on Fri 17 November 2017 in blog
I've been working on text classification recently. I've found keras to be a quite good high-level language and great for learning different neural network architectures. In this notebook I will examine Tweet classification using CNN and LSTM model architechtures. While CNNs are widely used in Computer Vision, I saw a paper, and a poster describing CNNs for use in text classification. Also, I heard at MLConf2017 that Facebook recently switched to using 2017 because it gave an order magnitude increase in speed. LSTMs on the other hand are sequence-based models, which account for recency of contextual data, and so are well suited to text modeling. For an excellent intro, see this post by Edwin Chen.
The data I'm using for this is the "distant labeled" Twitter dataset. This is a set of 1.6 million tweets that all contain some emoticon. Thus, positive emoticons are a proxy for positive sentiment, and vice versa for negative sentiment. As such, it is cheap to get training data and potentially good enough for getting useful sentiment training data. I've tested this, and while not shown here, it generalizes very well across datasets to even now when nobody uses :) or :-( to convey their emotions.
I've tried to include practical tips from my experience, since there are enough tutorials on keras already.
Lessons learned
- weight normalization matters
- the right data pre-processing can make huge differences
- callbacks are essential for model benchmarking and testing
- get a working model and iterate!
import dependencies¶
import numpy as np
import pandas as pd
import re
from keras.models import Model
from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding,\
Conv2D, Conv1D, concatenate, merge, LSTM, Bidirectional
from keras.layers.merge import Concatenate
from keras.utils.np_utils import to_categorical
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.optimizers import SGD, adam, Adagrad
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers.normalization import BatchNormalization
from keras.models import load_model
from keras.utils import plot_model
from keras.callbacks import History, EarlyStopping, Callback, CSVLogger, ModelCheckpoint
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
import pydot
import matplotlib.pylab as plt
from sklearn import metrics
import seaborn as sns
np.random.seed(0)
import keras
print keras.__version__
text pre-processing funtions¶
I found these are quite helpful for improving performance. If not included, the vocabulary size gets too big, since each username or url is counted as a unique word
def standardize_html(tweet):
"""replace all http links with 'URL'"""
return re.sub(r"http\S+", "URL", tweet)
def standardize_www(tweet):
"""replace all www links with 'URL'"""
return re.sub(r"www\S+", "URL", tweet)
def standardize_mentions(tweet):
"""replace all mentions with 'twittermention'"""
return re.sub(r"(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z_]+[A-Za-z0-9_]+)", "twittermention", tweet)
define dimensions for input data
glove_vec_size = 200 # dimension of pre-trained glove vector
top_n_words_to_process = 100000 # number of words to keep in tokenizer
max_seq_length = 50 # max word length per tweet
define data dirs
glove_dir = '../glove_pretrained/glove.twitter.27B.200d.txt'
labeled_twts_dir = './'
def loadGloveModel(gloveFile):
"""load pre-trained glove vector into memory as a dictionary"""
print "Loading Glove Model"
f = open(gloveFile,'r')
model = {}
for line in f:
splitLine = line.split()
word = splitLine[0]
embedding = [float(val) for val in splitLine[1:]]
model[word] = embedding
print "Done.",len(model)," words loaded!"
return model
glove_word_vecs = loadGloveModel(glove_dir)
data processing¶
Split data to train and test, then clean the text. Note that I'm only using a fraction of the tweets (50,000), with more time and GPU power, scaling to much larger sizes would be feasible
df = pd.read_csv('lab_twts_1million.csv',error_bad_lines=False) # labeled data
df.reset_index(drop=True,inplace=True)
#df_sub = df.sample(frac=0.001,random_state=42)
df_sub = df.sample(n=int(5e4),random_state=42)
print 'subsampled training examples: %s' % (df_sub.shape[0])
df_sub['normed_text'] = df_sub['SentimentText']\
.apply(standardize_mentions)\
.apply(standardize_html)\
.apply(standardize_www)
df_train, df_test = train_test_split(df_sub,test_size=0.1,random_state=42)
training_texts = df_train['normed_text'].values
training_labels = to_categorical(df_train['Sentiment'].values)
Convert input tweets to padded sequences¶
Keras expects a sequence, e.g., [0,0,0,153,48,...,]. To format the data appropriately, fit a tokenizer to the input tweets, then transform them. To ensure all data is of the same dimensions, pad the front of the sequences with zeros until all tweets are length 50.
tokenizer = Tokenizer(num_words=top_n_words_to_process)
tokenizer.fit_on_texts(training_texts)
raw_train_sequences = tokenizer.texts_to_sequences(training_texts)
word_index = tokenizer.word_index
training_seq = pad_sequences(raw_train_sequences, maxlen=max_seq_length)
Build embedding matrix¶
The embedding matrix will be used to bake the glove vector into the keras Embedding
layer. If the word is not in the pre-defined GloVe dictionary, then set it to be a vector of all zeros
embedding_matrix = np.zeros((len(word_index) + 1, glove_vec_size))
for word, i in word_index.items():
embedding_vector = glove_word_vecs.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
embedding_mean = np.mean(embedding_matrix,axis=0)
embedding_mean_centered = embedding_matrix - embedding_mean
embedding_stdev = np.std(embedding_mean_centered,axis=0)
normed_embedding_mat = np.divide(embedding_mean_centered,embedding_stdev)
Define the Embedding layer¶
This is the first layer for both the CNN and LSTM network. It incorporates the glove vector and has the necessary dimensions to define the computation graph
# define word embedding layer for subsequent plug-and-play in model architectures
embedding_layer = Embedding(input_dim=len(word_index) + 1,
output_dim=glove_vec_size,
weights=[normed_embedding_mat],
input_length=max_seq_length,
trainable=True,name='embedding layer')
Define callbacks¶
Callbacks are a set of classes in keras that allow various interactions during training. These can be especially useful for saving, logging, etc., especially when training can be quite time consuming
Use EarlyStopping
to prevent needless training once the model has begun overfitting. This stops training after the loss on the validation set has not improved for 3 consecutive epochs. Note that if you are trying to optimize for perforance here it would probably be best to 1) overfit the data to >99% accuracy on the training set to show your model is sufficiently complex and then 2) introduce regularization to reduce model variance
Use CSVLogger
to store all accuracy metrics to file
You can save the best model to file using ModelCheckpoint
early_stopping = EarlyStopping(monitor='val_loss',patience=2)
csv_logger = CSVLogger(filename='cnn_model_initial'+'.csv')
checkpointer = ModelCheckpoint(filepath='cnn_model_initial'+'.hdf5',verbose=1,save_best_only=True)
Define architecture for the CNN network¶
this is loosely based on the inception architecture described in this video, where the full model is here. In order to implement this architecture, we need to use the functional API to accomodate the non-sequential graph.
The idea to this architecture is that multiple layers of convolution and pooling are able to learn higher order features. Then you can concatenate all layers together in order to maintain higher- and lower order features.
You may be wondering what a 1x1 convolution is. Functionally, these reduce dimensionality and introduce a ReLu operation.
Additionally, batch norm is a helpful internal normalization step and the for help on how to use it in practice, see here
# input data as word embeddings
sequence_input = Input(shape=(max_seq_length,), dtype='int32',name='input')
embedded_sequences = embedding_layer(sequence_input)
embedded_sequences = Dropout(0.5)(embedded_sequences)
# define serial and parallel convolutional layers
first_1x1 = Convolution1D(filters=128,kernel_size=1, activation='relu')(embedded_sequences)
first_1x1 = Dropout(0.5)(first_1x1)
second_5x5 = Convolution1D(filters=128,kernel_size=5, activation='relu')(first_1x1)
second_5x5 = Dropout(0.5)(second_5x5)
second_3x3 = Convolution1D(filters=128,kernel_size=3, activation='relu')(first_1x1)
second_3x3 = Dropout(0.5)(second_3x3)
first_pool = MaxPooling1D(pool_size=3)(embedded_sequences)
first_pool = Dropout(0.5)(first_pool)
second_1x1 = Convolution1D(filters=128,kernel_size=3, activation='relu')(first_pool)
second_1x1 = Dropout(0.5)(second_1x1)
flat_first_1x1 = Flatten()(first_1x1)
flat_second_5x5 = Flatten()(second_5x5)
flat_second_3x3 = Flatten()(second_3x3)
flat_second_1x1 = Flatten()(second_1x1)
merged_flat = concatenate([flat_first_1x1,flat_second_5x5,flat_second_3x3,flat_second_1x1])
merged_flat = Dropout(0.5)(merged_flat)
batch_normed = BatchNormalization()(merged_flat)
hidden_layer = Dense(128, activation='relu',name='hidden_layer')(batch_normed)
preds = Dense(training_labels.shape[1], activation='softmax')(hidden_layer)
cnn_model = Model(inputs=sequence_input, outputs=preds)
cnn_model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'],)
Show the CNN architecture¶
plot_model(cnn_model,to_file='cnn_initial_architecture.png')
fig, ax = plt.subplots(1,figsize=(3,3),dpi=500)
m = plt.imread('model.png')
ax.imshow(m, interpolation='nearest')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()
fit the CNN model on training data¶
and store relevant files
%%time
cnn_model.fit(training_seq, training_labels,\
epochs=200, batch_size=256,validation_split=0.2,callbacks=[csv_logger,early_stopping,checkpointer],\
verbose=0)
It took a total of 43 minutes to train 12 epochs for the CNN model. We will come back to check the performance afterswitching gears to the LSTM model
LSTM model¶
LSTM models are more complex and slower, but suited nicely to text data. A couple things to note. First is that I'm using bidirectional LSTMs. Consider the sentence: "I'm going to get in my ..... and go to the store. Reading this sentence forward tells you that ..... is likely something you can get into, and reading it backwards suggests ..... can take you to the store. The bidirectionality captures this.
To use consecutive LSTM sequences, the first LSTM must return a sequence as the output. This enables stacking which allow for increased model complexity.
checkpointer = ModelCheckpoint(filepath='lstm_model_initial'+'.hdf5',verbose=1,save_best_only=True)
csv_logger = CSVLogger(filename='lstm_model_initial'+'.csv')
sequence_input = Input(shape=(max_seq_length,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
LSTM_layer = Bidirectional(LSTM(units=128,dropout=0.5, recurrent_dropout=0.5,return_sequences=True))(embedded_sequences)
LSTM_layer = Bidirectional(LSTM(units=128,dropout=0.5, recurrent_dropout=0.5))(LSTM_layer)
batch_normed = BatchNormalization()(LSTM_layer)
hidden_layer = Dense(128, activation='relu')(batch_normed)
hidden_layer = Dropout(0.9)(hidden_layer)
preds = Dense(training_labels.shape[1], activation='softmax')(hidden_layer)
lstm_text_model = Model(inputs=sequence_input, outputs=preds)
lstm_text_model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'],)
fit the LSTM model¶
and store pertinent files
%%time
lstm_text_model.fit(training_seq, training_labels,\
epochs=200, batch_size=256,validation_split=0.2,callbacks=[early_stopping,csv_logger,checkpointer],\
verbose=0)
The LSTM model took substantially longer than the CNN model, about twice as much total time elapsed for one third the number of training epochs.
visualize the LSTM architecture¶
plot_model(lstm_text_model,to_file='lstm_initial_architecture.png')
fig, ax = plt.subplots(1,figsize=(2,2),dpi=500)
m = plt.imread('lstm_initial_architecture.png')
ax.imshow(m)
plt.axis('off')
plt.show()
Model evaluation¶
Now we compare the performance of the two models. First, pull the best performing models back into memory
lstm_best = load_model('lstm_model_initial.hdf5')
cnn_best = load_model('cnn_model_initial.hdf5')
process the testing tweets to padded sequences
testing_texts = df_test['normed_text'].values
testing_labels = to_categorical(df_test['Sentiment'].values)
raw_test_sequences = tokenizer.texts_to_sequences(testing_texts)
testing_seq = pad_sequences(raw_test_sequences, maxlen=max_seq_length)
then predict the labels
lstm_preds = lstm_best.predict(testing_seq)
cnn_preds = cnn_best.predict(testing_seq)
def collapse_one_hot(row):
"""For checking accuracy of multiclass classifiers (e.g., neural nets)
Take binary classification with two columns and collapse to a single column.
Returns `0` for the first class and `1` for the second class"""
return 1 if row[0] < row[1] else 0
convert to binary for accuracy scoring
lstm_preds_binary = np.apply_along_axis(collapse_one_hot, 1, lstm_preds)
cnn_preds_binary = np.apply_along_axis(collapse_one_hot, 1, cnn_preds)
test_labels_normed = np.apply_along_axis(collapse_one_hot, 1, testing_labels)
lstm_acc = metrics.accuracy_score(test_labels_normed,lstm_preds_binary)
cnn_acc = metrics.accuracy_score(test_labels_normed,cnn_preds_binary)
print "LSTM model accuracy: %s\nCNN model accuracy: %s" % (lstm_acc,cnn_acc)
the accuracies are essentially the same, with the CNN model slightly outperforming the LSTM
plot ROC curve for each model¶
fpr1, tpr1, thresholds1 = metrics.roc_curve(test_labels_normed, lstm_preds[:,0])
fpr2, tpr2, thresholds2 = metrics.roc_curve(test_labels_normed, cnn_preds[:,0])
fig = plt.figure(figsize=(5,4), dpi=100)
ax = fig.add_subplot(111)
plt.plot(tpr1, fpr1, color='green',label='lstm')
plt.plot(tpr2, fpr2, color='darkorange',label='cnn')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
Conclusions¶
I hope you found this post helpful. In it I showed a couple of architectures that I spent just a bit of time optimizing. Both models acheived ~79% accuracy, even when trained with only 50,000 examples. I don't show it here, but I've found these accuracy metrics to be maintained for other datasets, in different domains, and from a more recent time, when people don't use emoticons as much :( with gifs, emojis, videos, etc.
The CNN trained twice about twice as fast and performed still slightly better than the LSTM. Since the data size is so small, this isn't terribly surprising. With more data, the LSTM may flex its complexity to outperform the CNN. Another thing to note is that I did not do any hyperparameter optimization. Take a look at all those dropouts, for example. Perhaps the regularization could be improved.