Recurrent Neural Networks - Cross Entropy · 2019-06-03 · Gated Recurrent Unit (GRU) Cell Under class GRUCell _, look for def call [6 weight matrices and 3 bias vectors] # recurrent.py

Recurrent Neural NetworksMay 22, 2019

[email protected]

http://cross-entropy.net/ML410/Deep_Learning_5.pdf

mailto:[email protected]

http://cross-entropy.net/ML410/Deep_Learning_5.pdf

Agenda

• Homework Review

• [IDL] Word Embeddings and Recurrent Neural Networks

• [DLP] Deep Learning for Text and Sequences

[IDL] Word Embeddings and Recurrent NNs

1. Word Embeddings for Language Models

2. Building Feed-Forward Language Models

3. Improving Feed-Forward Language Models

4. Overfitting

5. Recurrent Networks

6. Long Short-Term Memory

Language Model

• A language model is a probability distribution over all strings in a language

• Let 𝐄𝟏,𝐧 = 𝐸1⋯𝐸𝑛 be a sequence of 𝑛 random variables denoting a string of 𝑛 words and 𝐞𝟏,𝐧 be one candidate value

Word Embeddings

Penn TreeBank Corpus

• A treebank is a parsed text corpus that annotates syntactic or semantic sentence structure

• Penn TreeBank Corpus consists of about 1,000,000 words of news articles from the Wall Street Journal

• It has been tokenized but not “unked”, so the vocabulary is close to 50,000 words

• Counts• 2,312 articles in parsed/mrg/wsj/##/wsj_####.mrg (text) files• 49,206 distinct words [mixed-case (upper-case and lower-case letters allowed)]• 1,173,766 total words

• We replace all words that occur 10 times or less by *UNK*

Word Embeddings

Example Article: wsj_0001.mrg

( (S

(NP-SBJ

(NP (NNP Pierre) (NNP Vinken) )

(, ,)

(ADJP

(NP (CD 61) (NNS years) )

(JJ old) )

(, ,) )

(VP (MD will)

(VP (VB join)

(NP (DT the) (NN board) )

(PP-CLR (IN as)

(NP (DT a) (JJ nonexecutive) (NN director) ))

(NP-TMP (NNP Nov.) (CD 29) )))

(. .) ))

( (S

(NP-SBJ (NNP Mr.) (NNP Vinken) )

(VP (VBZ is)

(NP-PRD

(NP (NN chairman) )

(PP (IN of)

(NP

(NP (NNP Elsevier) (NNP N.V.) )

(, ,)

(NP (DT the) (NNP Dutch) (VBG publishing) (NN group) )))))

(. .) ))

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.htmlWord Embeddings

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Bigram Model

• If we had a very large amount of English text we might be able to estimate the first two or three probabilities on the right-hand side simply by counting how often we see, e.g., “We live” and how often “in” appears next, and then dividing the second by the first (i.e., use the maximum likelihood estimate) to give us an estimate of, e.g., P (in | We live)

• But as ‘n’ gets large this is impossible for lack of any examples in the training corpus of a particular, say, fifty-word sequence

• The standard response is to assume the probability of the next word only depends on the previous one or two words

Bigram Model

Word Embeddings

Sentence Padding

We can simplify the expression for the bigram model if we place an imaginary “STOP” at the beginning of the corpus, and then after every sentence [this is called “sentence padding”]

Word Embeddings

Bad Language Model

• If there are 𝑉 = 10,000 words in our vocabulary, we could use the

uniform distribution; i.e. predict 1

10,000for all words

[read 𝑉 as “size of Vocabulary set V]

• Perplexity = exp(mean(-log(probabilityEst)) = exp(- log(1 / 10,000) = 10,000

Word Embeddings

Word Embeddings

• We need to turn words into the sorts of things that deep networks can manipulate; i.e. floating-point numbers

• A standard solution is to associate each word with a vector of floating point numbers

• The vectors are called word embeddings• Can be pretrained using some other task; e.g. predict whether a word will

appear “near” another word• Can be trained directly as part of the neural net model

Word Embeddings

Feedforward Net for Language Modeling

Input > Embedding > Dense Layer > Softmax Activation

Word Embeddings

Cosine Similarity

• Constructing a language model using the Penn TreeBank with a vocabulary of 7,500 words and an embedding size of 30

• Cosine similarity can be used to measure the similarity of embeddings for the words, with values in the interval [-1, 1]

Word Embeddings

Cosine Similarity Examples

• Example of inability to distinguish between synonyms and antonyms; e.g. “under” and “above” are arguably antonyms

• Five pairs of “similar” words: “odd” words are most similar to the other member of the pair

Word Embeddings

Code Snippets for the Language Model

Language Models

Perplexity for a Test Corpus

exp(mean(-log(probability(actualWord))))

if we’re always assigning 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 =1

𝑉to each word,

then exp −𝑙𝑜𝑔1

𝑉= 𝑉

𝑒 is Euler’s constant; 𝑥𝑑 is the sum of log loss, and 𝑑 is the number of words in the corpus

Language Models

Improving the Feedforward Language Model

Most straightforward way to improve the model: move from a bigram model (previous word used to predict the next word) to a trigram model (two previous words used to predict the next word)

Language Models

Larger n-grams Improve Language Model Performance• For an n-gram model, (n-1) words are used to predict the next word

• This is used to produce a language model that can be used to estimate the probability of a string, such as a sentence

Language Models

Overfitting in a Language Model

Just another example requiring early stopping: after epoch 6, the training perplexity continues to decease (get better) but the validation perplexity starts to increase (get worse)

Reminder: it’s common for folks in the Natural Language Processing (NLP) community to refer to a validation set as a “development” (dev) set [and to refer to a “testing” set as an “evaluation” (eval) set]

Overfitting

Methods to Prevent Overfitting

Revised perplexity values for the dev set, when applying dropout [preventing memorization] and regularization [keeping weights smaller]

Compare to dev results when using neither dropout nor regularization:

Overfitting

Regularization Reminder

The first term in the loss below represents cross entropy, while the second term represents the L2 regularization penalty [with lower-case phi being the individual weights of the upper-case phi weight set]

When we differentiate the loss function with respect to 𝜙, the second term adds 𝛼 ∗ 𝜙 to

𝜕ℒ

𝜕𝜙

Overfitting

BasicRNNCell

lines 447 – 454 of rnn_cell_impl.py

fancy: concat replaces addition

https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/ops/rnn_cell_impl.py

Note: using a dot for concatenation is … ummm … suboptimal notationRecurrent Networks

[fancy can be hard to read]

https://github.com/tensorflow/tensorflow/blob/r1.13/tensorflow/python/ops/rnn_cell_impl.py

Back Propagation Through Time (BPTT)with Window Size Equal to Three

Recurrent Networks

Allocating Words When Batch Size = 2 and Window Size = 3

𝑆 = Τ𝑐 − 1 𝑏 , where S is the number of sections, c is the size of the corpus, and b is the batch size

Recurrent Networks

Tensorflow Code for Creating the RNN

rnnSz: the size of the RNN cell, also known as the number of units

Recurrent Networks

nextState and outputs

nextState rows occupy every other row in outputs

Recurrent Networks

Using the RNN Output

or

Recurrent Networks

Saving and Restoring State for Predictions

Recurrent Networks

Long Short-Term Memory (LSTM) Cell

Long Short-Term Memory

LSTM Cell’s Forget Gate

If the Sigmoid value is zero, when we do the element-wise multiplication of the forget gate f and the carry c, we’ll be forgetting values from the carry

Note: using a dot for concatenation is … ummm … suboptimal notationLong Short-Term Memory

LSTM Cell’s Input Gate and Carry Update


Hyperbolic Tangent Activation Function


LSTM Cell Output

• Equations 4.17 and 4.18 (repeated below) are not correct ...• Compare to Figure 4.9, which was correct• a1 [from slide before last] is an “input” gate• Sigmoid(h’’), below, gives us an “output” gate• ht+1 = Sigmoid(h’’) * tanh(ct+1) … element-wise multiplication• Note the LSTM Cell has two outputs: c and h

• We’ll review the code in just a few slides


Changing Code from BasicRNN to LSTM Cell


word2vec

• Two models• Skip-gram (with negative sampling): predict whether one word is within a

“window” of the other word [grab some random words for negative sampling]

• Continuous Bag-Of-Words (CBOW): predict the current word from a window of context words

• Geometric interpretations• King – Man = Queen – Woman

• Consider the possibility of masculine and feminine dimensions

Simple Recurrent Neural Network (RNN) Cell

• Under “class SimpleRNNCell”, look for “def call”# recurrent.py (link below), line 885: 2019-05-12

h = K.dot(inputs, self.kernel)

h = K.bias_add(h, self.bias)

output = h + K.dot(prev_output, self.recurrent_kernel)

output = self.activation(output) # default: tanh()

return output, [output]

• For each sequence position: features for previous output added to features for current input

2 weight matrices and 1 bias vector [same 2 weight matrices and 1 bias vector used for all positions in the sequence]

https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py


Long Short-Term Memory (LSTM) Cell

Under “class LSTMCell”, look for “def call” [8 weight matrices and 4 bias vectors]# recurrent.py (link below), line 1935: 2019-05-12x_i = K.dot(inputs, self.kernel_i)x_f = K.dot(inputs, self.kernel_f)x_c = K.dot(inputs, self.kernel_c)x_o = K.dot(inputs, self.kernel_o)x_i = K.bias_add(x_i, self.bias_i)x_f = K.bias_add(x_f, self.bias_f)x_c = K.bias_add(x_c, self.bias_c)x_o = K.bias_add(x_o, self.bias_o)i = self.recurrent_activation(x_i + K.dot(h_tm1, self.recurrent_kernel_i)) # input gatef = self.recurrent_activation(x_f + K.dot(h_tm1, self.recurrent_kernel_f)) # forget gatec = f * c_tm1 + i * self.activation(x_c + K.dot(h_tm1, self.recurrent_kernel_c)) # carryo = self.recurrent_activation(x_o + K.dot(h_tm1, self.recurrent_kernel_o)) # output gateh = o * self.activation(c)return h, [h, c]


recurrent_activation: hard_sigmoid()activation: tanh()tm1: time ‘t’ minus 1

source for image: http://shop.oreilly.com/product/0636920052289.do


http://shop.oreilly.com/product/0636920052289.do

Gated Recurrent Unit (GRU) Cell

Under “class GRUCell”, look for “def call” [6 weight matrices and 3 bias vectors]# recurrent.py (link below), line 1346: 2019-05-12x_z = K.dot(inputs_z, self.kernel_z)x_r = K.dot(inputs_r, self.kernel_r)x_h = K.dot(inputs_h, self.kernel_h)x_z = K.bias_add(x_z, self.input_bias_z)x_r = K.bias_add(x_r, self.input_bias_r)x_h = K.bias_add(x_h, self.input_bias_h)recurrent_z = K.dot(h_tm1_z, self.recurrent_kernel_z)recurrent_r = K.dot(h_tm1_r, self.recurrent_kernel_r)z = self.recurrent_activation(x_z + recurrent_z) # update gater = self.recurrent_activation(x_r + recurrent_r) # reset gaterecurrent_h = K.dot(r * h_tm1_h, self.recurrent_kernel_h)hh = self.activation(x_h + recurrent_h)h = z * h_tm1 + (1 - z) * hhreturn h, [h]


recurrent_activation: hard_sigmoid()activation: tanh()tm1: time ‘t’ minus 1

source for image: http://shop.oreilly.com/product/0636920052289.do (super close to reflecting source code: “1-”)


http://shop.oreilly.com/product/0636920052289.do

[DLP] Deep Learning for Text and Sequences

1. Working with Text Data

2. Understanding Recurrent Networks

3. Advanced Use of Recurrent Neural Networks

4. Sequence Processing with ConvNets

Applications

• Document classification and timeseries classification, such as identifying the topic of an article or the author of a book

• Timeseries comparisons, such as estimating how closely related two documents or two stock tickers are

• Sequence-to-sequence learning, such as decoding an English sentence into French

• Sentiment analysis, such as classifying the sentiment of tweets or movie reviews as positive or negative

• Timeseries forecasting, such as predicting the future weather at a certain location, given recent weather data

Working with Text Data

Vectorizing the Text

• Transforming text into numeric tensors

• Multiple possibilities exists for tokenization …• Segment text into words, and transform each word into a vector

• Segment text into characters, and transform each character into a vector

• Extract n-grams of words or characters, and transform each n-gram into a vector [n-grams are overlapping groups of multiple consecutive words or characters]

• Methods for encoding include ..• Multi-hot encoding [indicators for presence of tokens]

• Term Frequency – Inverse Document Frequency encoding (TF-IDF)

• Token embeddings (typically used for words and called word embeddings)


Text to Tokens to Vectors


N-Grams

• Names• 1-grams are called unigrams• 2-grams are called bigrams• 3-grams are called trigrams• 4-grams and called … 4-grams

• “The cat sat on the mat”• Unigrams: { “The”, “cat”, “sat”, “on”, “the”, “mat” }• Bigrams: { “The cat”, “cat sat”, “sat on”, “on the”, “the mat” }• Trigrams: { “The cat sat”, “cat sat on”, “sat on the”, “on the mat” }

• Note: the term “bag” refers to an unordered set rather than a sequence


Word-Level Multi-Hot Encoding


Character-Level Multi-Hot Encoding


Using Keras for Word-Level Multi-Hot Encoding


Word-Level Multi-Hot Encoding with Hashing Trick


Sparse versus Dense Representation

The primary curse of dimensionality is sparsity


Two Ways to Obtain Word Embeddings

• Learn word embeddings jointly with the main task you care about (such as document classification or sentiment prediction). In this setup, you start with random word vectors and then learn word vectors in the same way you learn the weights of a neural network.

• Load into your model word embeddings that were precomputed using a different machine-learning task than the one you’re trying to solve. These are called pretrained word embeddings.


Toy Example of a Word Embedding Space

• Vertical dimension could be interpreted as a “wild” index

• Horizontal dimension could be interpreted as a “feline” index

• Another popular example:• Queen – Woman == King – Man• Consider the possibility that the dimensions include feminine and masculine indexes


Instantiating an Embedding Layer


Loading the Internet Movie DataBase (IMDB) Data for Use with an Embedding Layer

Only the first 20 words of the review


Using an Embedding Layer and Classifier on the IMDB Data

~ 76% Accuracy


Pretrained Word Embeddings

• Word2Vec: skipgram and continuous bag-of-words architectures• https://code.google.com/archive/p/word2vec

• Global Vectors (GloVe): matrix factorization• https://nlp.stanford.edu/projects/glove

• Embedding() arguments• embeddings_initializer=keras.initializers.Constant(embedding_matrix)

• trainable=False


https://code.google.com/archive/p/word2vec

https://nlp.stanford.edu/projects/glove

Processing the Labels of the Raw IMDB Data

http://mng.bz/0tIo


http://mng.bz/0tIo

Tokenizing the Text of the Raw IMDB Data


Creating the Train and Val Data Sets for IMDB


Parsing the GloVe Word-Embeddings File

http://nlp.stanford.edu/data/glove.6B.zip


http://nlp.stanford.edu/data/glove.6B.zip

Preparing the GloVe Word Embeddings Matrix


Model Definition


Training and Evaluation


Loss and Accuracy

Accuracy in the mid-50s


Training the Same Model Without Pretrained Word Embeddings


Loss and Accuracy

Accuracy in the low-50s


Tokenizing the Test Set


Evaluating the Model with Pretrained Embeddings56% accuracy [okay, given only 200 training observations]


Wrapping Up

• Turn raw text into something a neural network can process

• Use the Embedding layer in a Keras model to learn task-specific token embeddings

• Use pretrained word embeddings to get an extra boost on small-data natural language-processing problems


Recurrent Network: a Network with a Loop

Recurrent Networks

Pseudocode RNN

Recurrent Networks

More Detailed Pseudocode for the RNN

Recurrent Networks

Numpy Implementation of a Simple RNN

Recurrent Networks

A Simple RNN, Unrolled Over Time

Recurrent Networks

SimpleRNN: Only Returning Last Output

Recurrent Networks

SimpleRNN: Returning All Outputs

Recurrent Networks

Stacking SimpleRNN Layers[must return sequences to stack]

Recurrent Networks

Preparing the IMDB Data

Recurrent Networks

Training the Model with Embedding and SimpleRNN Layers

Recurrent Networks

Loss and Accuracy

85% accuracy [compare to 88% accuracy in Chapter 3]

Recurrent Networks

Visualizing a SimpleRNN

Recurrent Networks

Going from SimpleRNN to LSTM:Adding a Carry Track

Recurrent Networks

Pseudocode Details of the LSTM Architecture

Recurrent Networks

Using the LSTM Layer in Keras

Recurrent Networks

Loss and Accuracy

89% Accuracy!

Recurrent Networks

Wrapping Up

• What RNNs are and how they work

• What LSTM is, and why it works better on long sequences than a naive RNN

• How to use Keras RNN layers to process sequence data

Recurrent Networks

Techniques

• Recurrent Dropout

• Stacking Recurrent Layers

• Bidirectional Recurrent Layers

Advanced Use

Weather Data from Jena Germany

Advanced Use

Inspecting the Jena Weather Data

420,551 lines; 15 columns

Advanced Use

Jena Weather Columns

1. Date Time

2. p (mbar): atmospheric pressure in millibars

3. T (degC): temperature in degrees Celsius

4. Tpot (K): potential temperature (for reference pressure) on Kelvin scale

5. Tdew (degC): dewpoint temperature in degrees Celsius

6. rh (%): relative humidity

7. VPmax (mbar): maximum water vapor pressure

8. VPact (mbar): actual water vapor pressure

9. VPdef (mbar): water vapor pressure deficit

10. sh (g/kg): specific humidity

11. H2OC (mmol/mol): water vapor concentration

12. rho (g/m**3): air density

13. wv (m/s): wind velocity

14. max. wv (m/s): maximum wind velocity

15. wd (deg): wind direction

Advanced Use

Parsing the Data

Advanced Use

Plotting the Temperature Timeseries

Advanced Use

Plotting the First 10 Days of the Temperature Timeseries

Advanced Use

Preparing the Data

• Given: 10 minutes between consecutive observations

• Parameters:• lookback = 720—Observations will go back 5 days

• steps = 6—Observations will be sampled at one data point per hour

• delay = 144—Targets will be 24 hours in the future

• Preprocess the data to a format a neural network can ingest: normalize each timeseries independently so that they all take small values on a similar scale

• Write a Python generator that takes the current array of float data and yields batches of data from the recent past, along with a target temperature in the future

Advanced Use

Normalizing the Data

Advanced Use

Data Generator

Advanced Use

Data Generator Arguments

Advanced Use

Preparing the Train Data Generator

Advanced Use

Preparing the Val and Test Data Generators

Advanced Use

Mean Absoute Error (MAE) Evaluation Metric[and loss function!]

Advanced Use

Estimating a Baseline[last temperature from observations]

MAE of 0.29: celsius_mae = 0.29 * std[1] = 2.57 degrees Celsius

Advanced Use

Training and Evaluating a Densely Connected Model

Advanced Use

Training and Validation Loss (MAE)

We are *not* beating the baseline

Advanced Use

Training and Evaluating a GRU-Based Model

Advanced Use

Beating the Baseline!

MAE around 0.265

Advanced Use

Training and Evaluating a GRU-Based Model with Dropout

Advanced Use

Training and Validation Loss with Dropout

Not much better than before; but no longer overfitting

Advanced Use

Training and Evaluating Stacked GRU-Based Model

Advanced Use

Training and Validation Loss for Stacked GRU-Based Model

Adding a layer did not help much: diminishing returns from increasing network capacity

Advanced Use

Training and Validation Loss for Reversed Sequences using GRU CellReversed order sequences underperform: last values processed by the GRU is the furthest away from the temperature prediction time

Advanced Use

Training an LSTM Using Reversed Sequences

Nearly identical

performance compared

to LSTM with

chronologically-ordered sequences

Advanced Use

Bidirectional LSTM

Sometimes useful when applied to text• Forward: tokens that come “before” are useful for understanding current token

• Backward: tokens that come “after” are useful for understanding current token

model.add(Bidirectional(LSTM(64))) # creates 2 LSTM cells

Advanced Use

Visualizing a Bidirectional RNN

Advanced Use

Training a Bidirectional LSTM for IMDB

Over 89% accuracy!

Advanced Use

Training a Bidirectional GRU for Temperature Prediction

Performs about as well as the model with the forward GRU layer

Advanced Use

Suggestions for Improving Temperature Predictions• Adjust the number of units in each recurrent layer in the stacked

setup. The current choices are largely arbitrary and thus probably suboptimal.

• Adjust the learning rate used by the RMSprop optimizer.

• Try using LSTM layers instead of GRU layers.

• Try using a bigger densely connected regressor on top of the recurrent layers: that is, a bigger Dense layer or even a stack of Dense layers.

• Don’t forget to eventually run the best-performing models (in terms of validation MAE) on the test set! Otherwise, you’ll develop architectures that are overfitting to the validation set.

Advanced Use

Wrapping Up

• When approaching a new problem, it’s good to first establish common-sense baselines for your metric of choice. If you don’t have a baseline to beat, you can’t tell whether you’re making real progress.

• Try simple models before expensive ones, to justify the additional expense. Sometimes a simple model will turn out to be your best option.

• When you have data where temporal ordering matters, recurrent networks are a great fit and easily outperform models that first flatten the temporal data.

• To use dropout with recurrent networks, you should use a time-constant dropout mask and recurrent dropout mask. These are built into Keras recurrent layers, so all you have to do is use the dropout and recurrent_dropout arguments of recurrent layers.

• Stacked RNNs provide more representational power than a single RNN layer. They’re also much more expensive and thus not always worth it. Although they offer clear gains on complex problems (such as machine translation), they may not always be relevant to smaller, simpler problems.

• Bidirectional RNNs, which look at a sequence both ways, are useful on natural-language processing problems. But they aren’t strong performers on sequence data where the recent past is much more informative than the beginning of the sequence.

Advanced Use

Markets and Machine Learning

• Markets have very different statistical characteristics than natural phenomena such as weather patterns. Trying to use machine learning to beat markets, when you only have access to publicly available data, is a difficult endeavor, and you’re likely to waste your time and resources with nothing to show for it.

• Always remember that when it comes to markets, past performance is not a good predictor of future returns—looking in the rear-view mirror is a bad way to drive. Machine learning, on the other hand, is applicable to datasets where the past is a good predictor of the future.

1D ConvNet

Visualizing 1D Convolution

Note the vertical arrowsfor input features

1D ConvNet

1D Pooling

• This is the 1D equivalent of the 2D versions

• Used for subsampling: giving access to the bigger picture [all pun intended]

• Common flavors include:• Max pooling

• Average pooling

1D ConvNet

Preparing the IMDB Data

1D ConvNet

Simple 1D ConvNet for the IMDB Data

1D ConvNet

Loss and Accuracy for the IMDB ConvNet

Accuracy is not as good as the LSTM, but it runs faster

1D ConvNet

Simple 1D ConvNet for the Jena Weather Data

1D ConvNet

MAE Loss on the Jena Weather Data

Model has no knowledge of temporal position; e.g. toward the beginning, toward the end, etc.

1D ConvNet

Combining a 1D ConvNet and an RNN for Processing Sequences

1D ConvNet

Higher-Resolution Data Generators for the Jena Weather Data

1D ConvNet

1D ConvNet + GRU for the Jena Weather Data

1D ConvNet

MAE Loss for the Jena Weather Data

Not as good as the GRU alone, but it’s significantly faster

1D ConvNet

Wrapping Up

• In the same way that 2D convnets perform well for processing visual patterns in 2D space, 1D convnets perform well for processing temporal patterns. They offer a faster alternative to RNNs on some problems, in particular natural language processing tasks.

• Typically, 1D convnets are structured much like their 2D equivalents from the world of computer vision: they consist of stacks of Conv1D layers and Max-Pooling1D layers, ending in a global pooling operation or flattening operation.

• Because RNNs are extremely expensive for processing very long sequences, but 1D convnets are cheap, it can be a good idea to use a 1D convnet as a preprocessing step before an RNN, shortening the sequence and extracting useful representations for the RNN to process.

1D ConvNet

Chapter Summary: 1 of 2

Chapter Summary: 2 of 2

Parameter Counts for Recurrent Cells

• Number of Parameters =(1 + numGates) * (previousLayerElementSize + 1 + recurrentCellSize)

• numGates =• 0 for Simple Recurrent Neural Network (RNN) Cell

• 2 for Gated Recurrent Unit (GRU) Cell

• 3 for Long Short-Term Memory (LSTM) Cell

• “+ 1”: assumes we’re including a bias term for the cell’s features

• Runtime Complexity: the cell is invoked for each element in a sequence and each sequence in a batch

Documents

Recurrent Neural Networks - Cross Entropy · 2019-06-03 · Gated Recurrent Unit (GRU) Cell Under class GRUCell _, look for def call [6 weight matrices and 3 bias vectors] # recurrent.py