CNN for Sentence Classification

Deep learning models have achieved remarkable results in computer vision and speech recognition in recent years. Within natural language processing, much of the work with deep learning methods has involved learning word vector representations through neural language models.

Today in this blog I will try to summarize a whitepaper written by Yoon Kim (New York University) which shows how they achieved excellent performance on different benchmarks by training a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model.

Model

The proposed model architecture shown below is a slight variant of CNN architecture originally presented by Collobert in 2011. Model architecture with two channels for an example sentence.

Let $x_{i} \varepsilon R^{k}$ be the k-dimensional word vector corresponding to the i-th word in the sentence.

A sentence of length n (padded where necessary) is represented as

$x_{1:n} = x_{1} \oplus x_{2} \oplus ... \oplus x_{n}$

The word vectors are concatenated. If we represent $x_{i:i+h-1}$ as the general concatenation of word vectors and W as convolution filter with a convolution window or kernel h, then a feature will be generated by the convolution operation which can be written as below

$c_{i}=f(W . x_{i:i+h-1} + b)$

Here b is the bias and f is a non liner function like hyperbolic tangent. This filter is applied to all possible window of words in the sentence to produce a feature map which can be written as

$c = [c_{1}, c_{2}, . ., c_{n-h+1}]$

Then a max pooling operation is applied over the feature map which takes the maximum value as the feature corresponding to the filter. This is the process for extracting one feature from one filter but in real life the model use multiple filters with different kernel sizes to extract various features and all these operations runs in parallel.

Regularization

For regularization, author used the dropout on the penultimate layer with a constraint on l2-norms of the weight vectors. The usual layer output given by

$y = W . z + b$

but here we presented it as below due to the dropouts

$y = W . (z \circ r) + b$

Where r is a ‘masking’ vector of Bernoulli random variables with probability p of being 1 and it get multiplied(element wise) with layer z.

$z = [c_{1}\hat{}, c_{1}\hat{},...c_{m}\hat{}]$

Datasets and Experimental Setup

Model was tested against various datasets and summary of the Dataset stats given below

Where

c : Number of target classes
l : Average sentence length
N : Datasetsize
|V| : Vocabulary size
|Vpre| : Number of words present in the set of pretrained word vectors
Test : Test set size and if Test set not present the paper uses cv(cross validation)

Hyperparameters & Training

For all datasets model used rectified linear units, filter windows (h) of 3, 4, 5 with 100 feature maps each, dropout rate (p) of 0.5, l2 constraint (s) of 3, and mini-batch size of 50. These values were chosen via a grid search on the SST-2 dev set.

Pre-trainedWord Vectors

Model used the publicly available word2vec vectors that were trained on 100 billion words from Google News. The vectors have dimensionality of 300(i.e k = 300) and were trained using the continuous bag-of-words architecture. Words not present in the set of pre-trained words are initialized randomly.

Model variations

Author deployed various CNN models and you can see their performance comparison with various other models on below table

CNN-rand: All words are randomly initialized
CNN-static: A model with pre-trained vectors from word2vec. All words including the unknown ones that are randomly initialized are kept static and only the other parameters of the model are learned.
CNN-non-static: Same as above but the pretrained vectors are fine-tuned for each task.
CNN-multichannel: A model with two sets of word vectors. Each set of vectors is treated as a 'channel' and each filter is applied to both channels, but gradients are backpropagated only through one of the channels. Hence the model is able to fine-tune one set of vectors while keeping the other static. Both channels are initialized with word2vec.

Results

If we analyze the table, we can see that compare to models which use randomly initialized word vectors the once with pretrained word vectors performed more efficiently. These results suggest that the pretrained vectors are good, ‘universal’ feature extractors and can be utilized across datasets. Finetuning the pre-trained vectors for each task gives still further improvements (CNN-non-static).

Some further observations are

Dropout proved to be such a good regularizer that it was fine to use a larger than necessary network and simply let dropout regularize it. Dropout consistently added 2%–4% relative performance.
Author also experimented with another word vector trained with Wikipedia data but found that word2vec gave far superior result.
Adadelta gave similar results to Adagrad but required fewer epochs.

Conclusion

The paper experimented that by using a simple CNN with one layer on top of the word vectors generated from pretrained models like word2vec with little hyperparameter tuning we can achieve higher or at least at par results compare to other complicated models available on the field. Also the results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP.

You can read the original paper here

Hope you have liked the post. Happy Learning!