Upload
textkernel
View
269
Download
0
Embed Size (px)
Citation preview
Practical Deep Learning for NLPMaarten Versteegh
NLP Research Engineer
Overview
● Deep Learning Recap● Text classification:
– Convnet with word embeddings● Sentiment analysis:
– ResNet● Tips and tricks
What is this deep learning thing again?
Input
Hidden
Output
Act
ivat
ion E
rror
Rectified Linear UnitsBackpropagation involves repeated multiplication with derivative of activation function→ Problem if result is always smaller than 1!
Text Classification
Traditional approach: BOW + TFIDF
“The car might also need a front end alignment”
"alignment" (0.323)"also" (0.137)"car" (0.110)"end" (0.182)"front" (0.167)"might" (0.178)"need" (0.157)"the" (0.053)
"also need" (0.343)"car might" (0.358)"end alignment" (0.358)"front end" (0.296)"might also" (0.358)"need front" (0.358)"the car" (0.161)
F1-Score*
BOW+TFIDF+SVM Some number
20 newsgroups performance
(*) Scores removed
Deep Learning 1: Replace Classifier
Hidden x 256
x 512
x 1000BOW Features
Hidden
Output
from keras.layers import Input, Dense
from keras.models import Model
input_layer = Input(shape=(1000,))
fc_1 = Dense(512, activation='relu')(input_layer)
fc_2 = Dense(256, activation='relu')(fc_1)
output_layer = Dense(10, activation='softmax')(fc_2)
model = Model(input=input_layer, output=output_layer)
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(bow, newsgroups.target)
predictions = model.predict(features).argmax(axis=1)
F1-Score*
BOW+TFIDF+SVM Some number
BOW+TFIDF+SVD+ 2-layer NN Some slightly higher number
20 newsgroups performance
(*) Scores removed
What about the deep learning promise?
Convolutional Networks
Source: Andrej Karpathy
Pooling layer
Source: Andrej Karpathy
Convolutional networks
Source: Y. Kim (2014) Convolutional Networks for Sentence Classification
Word embedding
from keras.layers import Embedding
# embedding_matrix: ndarray(vocab_size, embedding_dim)
input_layer = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
layer = Embedding(
embedding_matrix.shape[0],
embedding_matrix.shape[1],
weights=[embedding_matrix],
input_length=max_sequence_length,
trainable=False
)(input_layer)
from keras.layer import Convolution1D, MaxPooling1D, BatchNormalization, Activation
layer = Embedding(...)(input_layer)
layer = Convolution1D(
128, # number of filters
5, # filter size activation='relu',
)(layer)
layer = MaxPooling1D(5)(layer)
PerformanceF1-Score*
BOW+TFIDF+SVM Some number
CBOW+TFIDF+SVD+NN Some slightly higher number
ConvNet (3 layers) Quite a bit higher now
ConvNet (6 layers) Look mom, even higher!
(*) Scores removed
Sentiment Analysis
Data Set
Facebook posts from media organizations:
– CNN, MSNBC, NYTimes, The Guardian, Buzzfeed, Breitbart, Politico, The Wall Street Journal, Washington Post, Baltimore Sun
Measure sentiment as “reactions”
Title Org Like Love Wow Haha Sad Angry
Poll: Clinton up big on Trump in Virginia CNN 4176 601 17 211 11 83
It's a fact: Trump has tiny hands. Will this be the one that sinks him?
Guardian 595 17 17 225 2 8
Donald Trump Explains His Obama-Founded-ISIS Claim as ‘Sarcasm’
NYTimes 2059 32 284 1214 80 2167
Can hipsters stomach the unpalatable truth about avocado toast?
Guardian 3655 0 396 44 773 69
Tim Kaine skewers Donald Trump's military policy
MSNBC 1094 111 6 12 2 26
Top 5 Most Antisemitic Things Hillary Clinton Has Done
Breitbart 1067 7 134 35 22 372
17 Hilarious Tweets About Donald Trump Explaining Movies
Buzzfeed 11390 375 16 4121 4 5
Go deeper: ResNet
Convolutional Layers with shortcuts
He et al. Deep Residual Learning for Image Recognition
Go deeper: ResNet
input_layer = ...
layer = Convolution1D(128, 5, activation='linear')(input_layer)layer = BatchNormalization()(layer)layer = Activation('relu')(layer)
layer = Convolution1D(128, 5, activation='linear')(layer)layer = BatchNormalization()(layer)layer = Activation('relu')(layer)
block_output = merge([layer, input_layer], mode='sum')block_output = Activation('relu')(block_output)
It's a fact: Trump has tiny hands.(EMBEDDING_DIM=300)
ResNet Block
…
ResNet Block
The Guardian(1-of-K)
Conv (128) x 10
%
Title + Message
News OrgMaxPooling
Dense
Dense
Cherry-picked predicted response distribution*
Sentence Org Love Haha Wow Sad Angry
Trump wins the election Guardian 3% 9% 7% 32% 49%
Trump wins the election Breitbart 58% 30% 8% 1% 3%
*Your mileage may vary. By a lot. I mean it.
Tips and Tricks
Initialization
● Break symmetry:– Never ever initialize all your weights to
the same value● Let initialization depend on activation
function:– ReLU/PReLU → He Normal– sigmoid/tanh → Glorot Normal
Choose an adaptive optimizer
Source: Alec Radford
Choose an adaptive optimizer
Choose the right model size
● Start small and keep adding layers – Check if test error keeps going down
● Cross-validate over the number of units● You want to be able to overfit
Y. Bengio (2012) Practical recommendations for gradient-based training of deep architectures
Don't be scared of overfitting
● If your model can't overfit, it also can't learn enough
● So, check that your model can overfit:
– If not, make it bigger
– If so, get more date and/or regularize
Source: wikipedia
Regularization
● Norm penalties on hidden layer weights, never on first and last
● Dropout
● Early stopping
Size of data set
● Just get more data already● Augment data:
– Textual replacements– Word vector perturbation– Noise Contrastive Estimation
● Semi-supervised learning:– Adapt word embeddings to your domain
Monitor your model
Training loss:– Does the model converge?– Is the learning rate too low or too high?
Training loss and learning rate
Source: Andrej Karpathy
Monitor your model
Training and validation accuracy– Is there a large gap?– Does the training accuracy increase
while the validation accuracy decreases?
Training and validation accuracy
Source: Andrej Karpathy
Monitor your model
● Ratio of weights to updates● Distribution of activations and gradients
(per layer)
Hyperparameter optimization
After network architecture, continue with:– Regularization strength– Initial learning rate– Optimization strategy (and LR decay
schedule)
Friends don't let friends do a full grid search!
Hyperparameter optimization
Friends don't let friends do a full grid search!– Use a smart strategy like Bayesian
optimization or Particle Swarm Optimization (Spearmint, SMAC, Hyperopt, Optunity)
– Even random search often beats grid search
Keep up to date: arxiv-sanity.com
We are hiring!DevOps & Front-end
NLP engineersFull-stack Python engineers
www.textkernel.com/jobs
Questions?
Source: http://visualqa.org/