Practical Deep Learning for NLP

Practical Deep Learning for NLPMaarten Versteegh

NLP Research Engineer

Overview

● Deep Learning Recap● Text classification:

– Convnet with word embeddings● Sentiment analysis:

– ResNet● Tips and tricks

What is this deep learning thing again?

Input

Hidden

Output

Act

ivat

ion E

rror

Rectified Linear UnitsBackpropagation involves repeated multiplication with derivative of activation function→ Problem if result is always smaller than 1!

Text Classification

Traditional approach: BOW + TFIDF

“The car might also need a front end alignment”

"alignment" (0.323)"also" (0.137)"car" (0.110)"end" (0.182)"front" (0.167)"might" (0.178)"need" (0.157)"the" (0.053)

"also need" (0.343)"car might" (0.358)"end alignment" (0.358)"front end" (0.296)"might also" (0.358)"need front" (0.358)"the car" (0.161)

F1-Score*

BOW+TFIDF+SVM Some number

20 newsgroups performance

(*) Scores removed

Deep Learning 1: Replace Classifier

Hidden x 256

x 512

x 1000BOW Features

Hidden

Output

from keras.layers import Input, Dense

from keras.models import Model

input_layer = Input(shape=(1000,))

fc_1 = Dense(512, activation='relu')(input_layer)

fc_2 = Dense(256, activation='relu')(fc_1)

output_layer = Dense(10, activation='softmax')(fc_2)

model = Model(input=input_layer, output=output_layer)

model.compile(optimizer='rmsprop',

loss='categorical_crossentropy',

metrics=['accuracy'])

model.fit(bow, newsgroups.target)

predictions = model.predict(features).argmax(axis=1)

F1-Score*


BOW+TFIDF+SVD+ 2-layer NN Some slightly higher number

20 newsgroups performance

(*) Scores removed

What about the deep learning promise?

Convolutional Networks

Source: Andrej Karpathy

Pooling layer


Convolutional networks

Source: Y. Kim (2014) Convolutional Networks for Sentence Classification

Word embedding

from keras.layers import Embedding

# embedding_matrix: ndarray(vocab_size, embedding_dim)

input_layer = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

layer = Embedding(

embedding_matrix.shape[0],

embedding_matrix.shape[1],

weights=[embedding_matrix],

input_length=max_sequence_length,

trainable=False

)(input_layer)

from keras.layer import Convolution1D, MaxPooling1D, BatchNormalization, Activation

layer = Embedding(...)(input_layer)

layer = Convolution1D(

128, # number of filters

5, # filter size activation='relu',

)(layer)

layer = MaxPooling1D(5)(layer)

PerformanceF1-Score*


CBOW+TFIDF+SVD+NN Some slightly higher number

ConvNet (3 layers) Quite a bit higher now

ConvNet (6 layers) Look mom, even higher!

(*) Scores removed

Sentiment Analysis

Data Set

Facebook posts from media organizations:

– CNN, MSNBC, NYTimes, The Guardian, Buzzfeed, Breitbart, Politico, The Wall Street Journal, Washington Post, Baltimore Sun

Measure sentiment as “reactions”

Title Org Like Love Wow Haha Sad Angry

Poll: Clinton up big on Trump in Virginia CNN 4176 601 17 211 11 83

It's a fact: Trump has tiny hands. Will this be the one that sinks him?

Guardian 595 17 17 225 2 8

Donald Trump Explains His Obama-Founded-ISIS Claim as ‘Sarcasm’

NYTimes 2059 32 284 1214 80 2167

Can hipsters stomach the unpalatable truth about avocado toast?

Guardian 3655 0 396 44 773 69

Tim Kaine skewers Donald Trump's military policy

MSNBC 1094 111 6 12 2 26

Top 5 Most Antisemitic Things Hillary Clinton Has Done

Breitbart 1067 7 134 35 22 372

17 Hilarious Tweets About Donald Trump Explaining Movies

Buzzfeed 11390 375 16 4121 4 5

Go deeper: ResNet

Convolutional Layers with shortcuts

He et al. Deep Residual Learning for Image Recognition

Go deeper: ResNet

input_layer = ...

layer = Convolution1D(128, 5, activation='linear')(input_layer)layer = BatchNormalization()(layer)layer = Activation('relu')(layer)

layer = Convolution1D(128, 5, activation='linear')(layer)layer = BatchNormalization()(layer)layer = Activation('relu')(layer)

block_output = merge([layer, input_layer], mode='sum')block_output = Activation('relu')(block_output)

It's a fact: Trump has tiny hands.(EMBEDDING_DIM=300)

ResNet Block

…

ResNet Block

The Guardian(1-of-K)

Conv (128) x 10

%

Title + Message

News OrgMaxPooling

Dense

Dense

Cherry-picked predicted response distribution*

Sentence Org Love Haha Wow Sad Angry

Trump wins the election Guardian 3% 9% 7% 32% 49%

Trump wins the election Breitbart 58% 30% 8% 1% 3%

*Your mileage may vary. By a lot. I mean it.

Tips and Tricks

Initialization

● Break symmetry:– Never ever initialize all your weights to

the same value● Let initialization depend on activation

function:– ReLU/PReLU → He Normal– sigmoid/tanh → Glorot Normal

Choose an adaptive optimizer

Source: Alec Radford

Choose an adaptive optimizer

Choose the right model size

● Start small and keep adding layers – Check if test error keeps going down

● Cross-validate over the number of units● You want to be able to overfit

Y. Bengio (2012) Practical recommendations for gradient-based training of deep architectures

Don't be scared of overfitting

● If your model can't overfit, it also can't learn enough

● So, check that your model can overfit:

– If not, make it bigger

– If so, get more date and/or regularize

Source: wikipedia

Regularization

● Norm penalties on hidden layer weights, never on first and last

● Dropout

● Early stopping

Size of data set

● Just get more data already● Augment data:

– Textual replacements– Word vector perturbation– Noise Contrastive Estimation

● Semi-supervised learning:– Adapt word embeddings to your domain

Monitor your model

Training loss:– Does the model converge?– Is the learning rate too low or too high?

Training loss and learning rate


Monitor your model

Training and validation accuracy– Is there a large gap?– Does the training accuracy increase

while the validation accuracy decreases?

Training and validation accuracy


Monitor your model

● Ratio of weights to updates● Distribution of activations and gradients

(per layer)

Hyperparameter optimization

After network architecture, continue with:– Regularization strength– Initial learning rate– Optimization strategy (and LR decay

schedule)

Friends don't let friends do a full grid search!

Hyperparameter optimization

Friends don't let friends do a full grid search!– Use a smart strategy like Bayesian

optimization or Particle Swarm Optimization (Spearmint, SMAC, Hyperopt, Optunity)

– Even random search often beats grid search

Keep up to date: arxiv-sanity.com

We are hiring!DevOps & Front-end

NLP engineersFull-stack Python engineers

www.textkernel.com/jobs

Questions?

Source: http://visualqa.org/

Technology

Practical Deep Learning for NLP