Upload
others
View
10
Download
1
Embed Size (px)
Citation preview
Recurrent Neural NetworksMay 22, 2019
http://cross-entropy.net/ML410/Deep_Learning_5.pdf
Agenda
• Homework Review
• [IDL] Word Embeddings and Recurrent Neural Networks
• [DLP] Deep Learning for Text and Sequences
[IDL] Word Embeddings and Recurrent NNs
1. Word Embeddings for Language Models
2. Building Feed-Forward Language Models
3. Improving Feed-Forward Language Models
4. Overfitting
5. Recurrent Networks
6. Long Short-Term Memory
Language Model
• A language model is a probability distribution over all strings in a language
• Let 𝐄𝟏,𝐧 = 𝐸1⋯𝐸𝑛 be a sequence of 𝑛 random variables denoting a string of 𝑛 words and 𝐞𝟏,𝐧 be one candidate value
Word Embeddings
Penn TreeBank Corpus
• A treebank is a parsed text corpus that annotates syntactic or semantic sentence structure
• Penn TreeBank Corpus consists of about 1,000,000 words of news articles from the Wall Street Journal
• It has been tokenized but not “unked”, so the vocabulary is close to 50,000 words
• Counts• 2,312 articles in parsed/mrg/wsj/##/wsj_####.mrg (text) files• 49,206 distinct words [mixed-case (upper-case and lower-case letters allowed)]• 1,173,766 total words
• We replace all words that occur 10 times or less by *UNK*
Word Embeddings
Example Article: wsj_0001.mrg
( (S
(NP-SBJ
(NP (NNP Pierre) (NNP Vinken) )
(, ,)
(ADJP
(NP (CD 61) (NNS years) )
(JJ old) )
(, ,) )
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board) )
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director) ))
(NP-TMP (NNP Nov.) (CD 29) )))
(. .) ))
( (S
(NP-SBJ (NNP Mr.) (NNP Vinken) )
(VP (VBZ is)
(NP-PRD
(NP (NN chairman) )
(PP (IN of)
(NP
(NP (NNP Elsevier) (NNP N.V.) )
(, ,)
(NP (DT the) (NNP Dutch) (VBG publishing) (NN group) )))))
(. .) ))
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.htmlWord Embeddings
Bigram Model
• If we had a very large amount of English text we might be able to estimate the first two or three probabilities on the right-hand side simply by counting how often we see, e.g., “We live” and how often “in” appears next, and then dividing the second by the first (i.e., use the maximum likelihood estimate) to give us an estimate of, e.g., P (in | We live)
• But as ‘n’ gets large this is impossible for lack of any examples in the training corpus of a particular, say, fifty-word sequence
• The standard response is to assume the probability of the next word only depends on the previous one or two words
Bigram Model
Word Embeddings
Sentence Padding
We can simplify the expression for the bigram model if we place an imaginary “STOP” at the beginning of the corpus, and then after every sentence [this is called “sentence padding”]
Word Embeddings
Bad Language Model
• If there are 𝑉 = 10,000 words in our vocabulary, we could use the
uniform distribution; i.e. predict 1
10,000for all words
[read 𝑉 as “size of Vocabulary set V]
• Perplexity = exp(mean(-log(probabilityEst)) = exp(- log(1 / 10,000) = 10,000
Word Embeddings
Word Embeddings
• We need to turn words into the sorts of things that deep networks can manipulate; i.e. floating-point numbers
• A standard solution is to associate each word with a vector of floating point numbers
• The vectors are called word embeddings• Can be pretrained using some other task; e.g. predict whether a word will
appear “near” another word• Can be trained directly as part of the neural net model
Word Embeddings
Feedforward Net for Language Modeling
Input > Embedding > Dense Layer > Softmax Activation
Word Embeddings
Cosine Similarity
• Constructing a language model using the Penn TreeBank with a vocabulary of 7,500 words and an embedding size of 30
• Cosine similarity can be used to measure the similarity of embeddings for the words, with values in the interval [-1, 1]
Word Embeddings
Cosine Similarity Examples
• Example of inability to distinguish between synonyms and antonyms; e.g. “under” and “above” are arguably antonyms
• Five pairs of “similar” words: “odd” words are most similar to the other member of the pair
Word Embeddings
Perplexity for a Test Corpus
exp(mean(-log(probability(actualWord))))
if we’re always assigning 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =1
𝑉to each word,
then exp −𝑙𝑜𝑔1
𝑉= 𝑉
𝑒 is Euler’s constant; 𝑥𝑑 is the sum of log loss, and 𝑑 is the number of words in the corpus
Language Models
Improving the Feedforward Language Model
Most straightforward way to improve the model: move from a bigram model (previous word used to predict the next word) to a trigram model (two previous words used to predict the next word)
Language Models
Larger n-grams Improve Language Model Performance• For an n-gram model, (n-1) words are used to predict the next word
• This is used to produce a language model that can be used to estimate the probability of a string, such as a sentence
Language Models
Overfitting in a Language Model
Just another example requiring early stopping: after epoch 6, the training perplexity continues to decease (get better) but the validation perplexity starts to increase (get worse)
Reminder: it’s common for folks in the Natural Language Processing (NLP) community to refer to a validation set as a “development” (dev) set [and to refer to a “testing” set as an “evaluation” (eval) set]
Overfitting
Methods to Prevent Overfitting
Revised perplexity values for the dev set, when applying dropout [preventing memorization] and regularization [keeping weights smaller]
Compare to dev results when using neither dropout nor regularization:
Overfitting
Regularization Reminder
The first term in the loss below represents cross entropy, while the second term represents the L2 regularization penalty [with lower-case phi being the individual weights of the upper-case phi weight set]
When we differentiate the loss function with respect to 𝜙, the second term adds 𝛼 ∗ 𝜙 to
𝜕ℒ
𝜕𝜙
Overfitting
BasicRNNCell
lines 447 – 454 of rnn_cell_impl.py
fancy: concat replaces addition
https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/ops/rnn_cell_impl.py
Note: using a dot for concatenation is … ummm … suboptimal notationRecurrent Networks
[fancy can be hard to read]
Allocating Words When Batch Size = 2 and Window Size = 3
𝑆 = Τ𝑐 − 1 𝑏 , where S is the number of sections, c is the size of the corpus, and b is the batch size
Recurrent Networks
Tensorflow Code for Creating the RNN
rnnSz: the size of the RNN cell, also known as the number of units
Recurrent Networks
LSTM Cell’s Forget Gate
If the Sigmoid value is zero, when we do the element-wise multiplication of the forget gate f and the carry c, we’ll be forgetting values from the carry
Note: using a dot for concatenation is … ummm … suboptimal notationLong Short-Term Memory
LSTM Cell Output
• Equations 4.17 and 4.18 (repeated below) are not correct ...• Compare to Figure 4.9, which was correct• a1 [from slide before last] is an “input” gate• Sigmoid(h’’), below, gives us an “output” gate• ht+1 = Sigmoid(h’’) * tanh(ct+1) … element-wise multiplication• Note the LSTM Cell has two outputs: c and h
• We’ll review the code in just a few slides
Long Short-Term Memory
word2vec
• Two models• Skip-gram (with negative sampling): predict whether one word is within a
“window” of the other word [grab some random words for negative sampling]
• Continuous Bag-Of-Words (CBOW): predict the current word from a window of context words
• Geometric interpretations• King – Man = Queen – Woman
• Consider the possibility of masculine and feminine dimensions
Simple Recurrent Neural Network (RNN) Cell
• Under “class SimpleRNNCell”, look for “def call”# recurrent.py (link below), line 885: 2019-05-12
h = K.dot(inputs, self.kernel)
h = K.bias_add(h, self.bias)
output = h + K.dot(prev_output, self.recurrent_kernel)
output = self.activation(output) # default: tanh()
return output, [output]
• For each sequence position: features for previous output added to features for current input
2 weight matrices and 1 bias vector [same 2 weight matrices and 1 bias vector used for all positions in the sequence]
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py
Long Short-Term Memory (LSTM) Cell
Under “class LSTMCell”, look for “def call” [8 weight matrices and 4 bias vectors]# recurrent.py (link below), line 1935: 2019-05-12x_i = K.dot(inputs, self.kernel_i)x_f = K.dot(inputs, self.kernel_f)x_c = K.dot(inputs, self.kernel_c)x_o = K.dot(inputs, self.kernel_o)x_i = K.bias_add(x_i, self.bias_i)x_f = K.bias_add(x_f, self.bias_f)x_c = K.bias_add(x_c, self.bias_c)x_o = K.bias_add(x_o, self.bias_o)i = self.recurrent_activation(x_i + K.dot(h_tm1, self.recurrent_kernel_i)) # input gatef = self.recurrent_activation(x_f + K.dot(h_tm1, self.recurrent_kernel_f)) # forget gatec = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1, self.recurrent_kernel_c)) # carryo = self.recurrent_activation(x_o + K.dot(h_tm1, self.recurrent_kernel_o)) # output gateh = o * self.activation(c)return h, [h, c]
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py
recurrent_activation: hard_sigmoid()activation: tanh()tm1: time ‘t’ minus 1
source for image: http://shop.oreilly.com/product/0636920052289.do
Gated Recurrent Unit (GRU) Cell
Under “class GRUCell”, look for “def call” [6 weight matrices and 3 bias vectors]# recurrent.py (link below), line 1346: 2019-05-12x_z = K.dot(inputs_z, self.kernel_z)x_r = K.dot(inputs_r, self.kernel_r)x_h = K.dot(inputs_h, self.kernel_h)x_z = K.bias_add(x_z, self.input_bias_z)x_r = K.bias_add(x_r, self.input_bias_r)x_h = K.bias_add(x_h, self.input_bias_h)recurrent_z = K.dot(h_tm1_z, self.recurrent_kernel_z)recurrent_r = K.dot(h_tm1_r, self.recurrent_kernel_r)z = self.recurrent_activation(x_z + recurrent_z) # update gater = self.recurrent_activation(x_r + recurrent_r) # reset gaterecurrent_h = K.dot(r * h_tm1_h, self.recurrent_kernel_h)hh = self.activation(x_h + recurrent_h)h = z * h_tm1 + (1 - z) * hhreturn h, [h]
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py
recurrent_activation: hard_sigmoid()activation: tanh()tm1: time ‘t’ minus 1
source for image: http://shop.oreilly.com/product/0636920052289.do (super close to reflecting source code: “1-”)
[DLP] Deep Learning for Text and Sequences
1. Working with Text Data
2. Understanding Recurrent Networks
3. Advanced Use of Recurrent Neural Networks
4. Sequence Processing with ConvNets
Applications
• Document classification and timeseries classification, such as identifying the topic of an article or the author of a book
• Timeseries comparisons, such as estimating how closely related two documents or two stock tickers are
• Sequence-to-sequence learning, such as decoding an English sentence into French
• Sentiment analysis, such as classifying the sentiment of tweets or movie reviews as positive or negative
• Timeseries forecasting, such as predicting the future weather at a certain location, given recent weather data
Working with Text Data
Vectorizing the Text
• Transforming text into numeric tensors
• Multiple possibilities exists for tokenization …• Segment text into words, and transform each word into a vector
• Segment text into characters, and transform each character into a vector
• Extract n-grams of words or characters, and transform each n-gram into a vector [n-grams are overlapping groups of multiple consecutive words or characters]
• Methods for encoding include ..• Multi-hot encoding [indicators for presence of tokens]
• Term Frequency – Inverse Document Frequency encoding (TF-IDF)
• Token embeddings (typically used for words and called word embeddings)
Working with Text Data
N-Grams
• Names• 1-grams are called unigrams• 2-grams are called bigrams• 3-grams are called trigrams• 4-grams and called … 4-grams
• “The cat sat on the mat”• Unigrams: { “The”, “cat”, “sat”, “on”, “the”, “mat” }• Bigrams: { “The cat”, “cat sat”, “sat on”, “on the”, “the mat” }• Trigrams: { “The cat sat”, “cat sat on”, “sat on the”, “on the mat” }
• Note: the term “bag” refers to an unordered set rather than a sequence
Working with Text Data
Sparse versus Dense Representation
The primary curse of dimensionality is sparsity
Working with Text Data
Two Ways to Obtain Word Embeddings
• Learn word embeddings jointly with the main task you care about (such as document classification or sentiment prediction). In this setup, you start with random word vectors and then learn word vectors in the same way you learn the weights of a neural network.
• Load into your model word embeddings that were precomputed using a different machine-learning task than the one you’re trying to solve. These are called pretrained word embeddings.
Working with Text Data
Toy Example of a Word Embedding Space
• Vertical dimension could be interpreted as a “wild” index
• Horizontal dimension could be interpreted as a “feline” index
• Another popular example:• Queen – Woman == King – Man• Consider the possibility that the dimensions include feminine and masculine indexes
Working with Text Data
Loading the Internet Movie DataBase (IMDB) Data for Use with an Embedding Layer
Only the first 20 words of the review
Working with Text Data
Pretrained Word Embeddings
• Word2Vec: skipgram and continuous bag-of-words architectures• https://code.google.com/archive/p/word2vec
• Global Vectors (GloVe): matrix factorization• https://nlp.stanford.edu/projects/glove
• Embedding() arguments• embeddings_initializer=keras.initializers.Constant(embedding_matrix)
• trainable=False
Working with Text Data
Processing the Labels of the Raw IMDB Data
http://mng.bz/0tIo
Working with Text Data
Parsing the GloVe Word-Embeddings File
http://nlp.stanford.edu/data/glove.6B.zip
Working with Text Data
Evaluating the Model with Pretrained Embeddings56% accuracy [okay, given only 200 training observations]
Working with Text Data
Wrapping Up
• Turn raw text into something a neural network can process
• Use the Embedding layer in a Keras model to learn task-specific token embeddings
• Use pretrained word embeddings to get an extra boost on small-data natural language-processing problems
Working with Text Data
Wrapping Up
• What RNNs are and how they work
• What LSTM is, and why it works better on long sequences than a naive RNN
• How to use Keras RNN layers to process sequence data
Recurrent Networks
Techniques
• Recurrent Dropout
• Stacking Recurrent Layers
• Bidirectional Recurrent Layers
Advanced Use
Jena Weather Columns
1. Date Time
2. p (mbar): atmospheric pressure in millibars
3. T (degC): temperature in degrees Celsius
4. Tpot (K): potential temperature (for reference pressure) on Kelvin scale
5. Tdew (degC): dewpoint temperature in degrees Celsius
6. rh (%): relative humidity
7. VPmax (mbar): maximum water vapor pressure
8. VPact (mbar): actual water vapor pressure
9. VPdef (mbar): water vapor pressure deficit
10. sh (g/kg): specific humidity
11. H2OC (mmol/mol): water vapor concentration
12. rho (g/m**3): air density
13. wv (m/s): wind velocity
14. max. wv (m/s): maximum wind velocity
15. wd (deg): wind direction
Advanced Use
Preparing the Data
• Given: 10 minutes between consecutive observations
• Parameters:• lookback = 720—Observations will go back 5 days
• steps = 6—Observations will be sampled at one data point per hour
• delay = 144—Targets will be 24 hours in the future
• Preprocess the data to a format a neural network can ingest: normalize each timeseries independently so that they all take small values on a similar scale
• Write a Python generator that takes the current array of float data and yields batches of data from the recent past, along with a target temperature in the future
Advanced Use
Estimating a Baseline[last temperature from observations]
MAE of 0.29: celsius_mae = 0.29 * std[1] = 2.57 degrees Celsius
Advanced Use
Training and Validation Loss with Dropout
Not much better than before; but no longer overfitting
Advanced Use
Training and Validation Loss for Stacked GRU-Based Model
Adding a layer did not help much: diminishing returns from increasing network capacity
Advanced Use
Training and Validation Loss for Reversed Sequences using GRU CellReversed order sequences underperform: last values processed by the GRU is the furthest away from the temperature prediction time
Advanced Use
Training an LSTM Using Reversed Sequences
Nearly identical
performance compared
to LSTM with
chronologically-ordered sequences
Advanced Use
Bidirectional LSTM
Sometimes useful when applied to text• Forward: tokens that come “before” are useful for understanding current token
• Backward: tokens that come “after” are useful for understanding current token
model.add(Bidirectional(LSTM(64))) # creates 2 LSTM cells
Advanced Use
Training a Bidirectional GRU for Temperature Prediction
Performs about as well as the model with the forward GRU layer
Advanced Use
Suggestions for Improving Temperature Predictions• Adjust the number of units in each recurrent layer in the stacked
setup. The current choices are largely arbitrary and thus probably suboptimal.
• Adjust the learning rate used by the RMSprop optimizer.
• Try using LSTM layers instead of GRU layers.
• Try using a bigger densely connected regressor on top of the recurrent layers: that is, a bigger Dense layer or even a stack of Dense layers.
• Don’t forget to eventually run the best-performing models (in terms of validation MAE) on the test set! Otherwise, you’ll develop architectures that are overfitting to the validation set.
Advanced Use
Wrapping Up
• When approaching a new problem, it’s good to first establish common-sense baselines for your metric of choice. If you don’t have a baseline to beat, you can’t tell whether you’re making real progress.
• Try simple models before expensive ones, to justify the additional expense. Sometimes a simple model will turn out to be your best option.
• When you have data where temporal ordering matters, recurrent networks are a great fit and easily outperform models that first flatten the temporal data.
• To use dropout with recurrent networks, you should use a time-constant dropout mask and recurrent dropout mask. These are built into Keras recurrent layers, so all you have to do is use the dropout and recurrent_dropout arguments of recurrent layers.
• Stacked RNNs provide more representational power than a single RNN layer. They’re also much more expensive and thus not always worth it. Although they offer clear gains on complex problems (such as machine translation), they may not always be relevant to smaller, simpler problems.
• Bidirectional RNNs, which look at a sequence both ways, are useful on natural-language processing problems. But they aren’t strong performers on sequence data where the recent past is much more informative than the beginning of the sequence.
Advanced Use
Markets and Machine Learning
• Markets have very different statistical characteristics than natural phenomena such as weather patterns. Trying to use machine learning to beat markets, when you only have access to publicly available data, is a difficult endeavor, and you’re likely to waste your time and resources with nothing to show for it.
• Always remember that when it comes to markets, past performance is not a good predictor of future returns—looking in the rear-view mirror is a bad way to drive. Machine learning, on the other hand, is applicable to datasets where the past is a good predictor of the future.
1D ConvNet
1D Pooling
• This is the 1D equivalent of the 2D versions
• Used for subsampling: giving access to the bigger picture [all pun intended]
• Common flavors include:• Max pooling
• Average pooling
1D ConvNet
Loss and Accuracy for the IMDB ConvNet
Accuracy is not as good as the LSTM, but it runs faster
1D ConvNet
MAE Loss on the Jena Weather Data
Model has no knowledge of temporal position; e.g. toward the beginning, toward the end, etc.
1D ConvNet
MAE Loss for the Jena Weather Data
Not as good as the GRU alone, but it’s significantly faster
1D ConvNet
Wrapping Up
• In the same way that 2D convnets perform well for processing visual patterns in 2D space, 1D convnets perform well for processing temporal patterns. They offer a faster alternative to RNNs on some problems, in particular natural language processing tasks.
• Typically, 1D convnets are structured much like their 2D equivalents from the world of computer vision: they consist of stacks of Conv1D layers and Max-Pooling1D layers, ending in a global pooling operation or flattening operation.
• Because RNNs are extremely expensive for processing very long sequences, but 1D convnets are cheap, it can be a good idea to use a 1D convnet as a preprocessing step before an RNN, shortening the sequence and extracting useful representations for the RNN to process.
1D ConvNet
Parameter Counts for Recurrent Cells
• Number of Parameters =(1 + numGates) * (previousLayerElementSize + 1 + recurrentCellSize)
• numGates =• 0 for Simple Recurrent Neural Network (RNN) Cell
• 2 for Gated Recurrent Unit (GRU) Cell
• 3 for Long Short-Term Memory (LSTM) Cell
• “+ 1”: assumes we’re including a bias term for the cell’s features
• Runtime Complexity: the cell is invoked for each element in a sequence and each sequence in a batch