View
12
Download
0
Category
Preview:
Citation preview
Recurrent Neural Networks
COMP-550
Oct 5, 2017
OutlineIntroduction to neural networks and deep learning
Feedforward neural networks
Recurrent neural networks
2
Classification Reviewπ¦ = π( π₯)
Represent input π₯ as a list of features
3
output label
input
classifier
Lorem ipsum dolor sit amet, consecteturadipiscing elit, sed do eiusmod temporincididunt ut labore et dolore magna aliqua. Utenim ad minim veniam, quis nostrudexercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillumdolore eu fugiat nulla pariatur. Excepteur sintoccaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0 β¦
π₯1 π₯2 π₯3 π₯4 π₯5 π₯6 π₯7 π₯8 β¦
Logistic RegressionLinear regression:
π¦ = π1π₯1 + π2π₯2 + β¦ + πππ₯π + π
Intuition: Linear regression gives as continuous values in [-β, β] βletβs squish the values to be in [0, 1]!
Function that does this: logit function
π(π¦| π₯) =1
πππ1π₯1 + π2π₯2 + β¦ + πππ₯π + π
(a.k.a., maximum entropy or MaxEnt classifier)
N.B.: Donβt be confused by nameβthis method is most often used to solve classification problems.
4
This π is a normalizing constant to ensure this is a probability distribution.
Linear ModelLogistic regression, support vector machines, etc. are examples of linear models.
π(π¦| π₯) =1
πππ1π₯1 + π2π₯2 + β¦ + πππ₯π + π
Cannot learn complex, non-linear functions from input features to output labels (without adding features)
e.g., Starts with a capital AND not at beginning of sentence -> proper noun
5
Linear combination of feature weights and values
(Artificial) Neural NetworksA kind of learning model which automatically learns non-linear functions from input to output
Biologically inspired metaphor:
β’ Network of computational units called neurons
β’ Each neuron takes scalar inputs, and produces a scalar output, very much like a logistic regression model
Neuron( π₯) = π(π1π₯1 + π2π₯2 + β¦ + πππ₯π + π)
As a whole, the network can theoretically compute any computable function, given enough neurons. (These notions can be formalized.)
6
Responsible For:AlphaGo (Google) (2015)
β’ Beat Go champion Lee Sedol in a series of 5 matches, 4-1
Atari game-playing bot (Google) (2015)
Above results use NNs in conjunction with reinforcement learning
State of the art in:
β’ Speech recognition
β’ Machine translation
β’ Object detection
β’ Other NLP tasks
7
Feedforward Neural NetworksAll connections flow forward (no loops); each layer of hidden units is fully connected to the next.
8
Figure from Goldberg (2015)
Inference in a FF Neural NetworkPerform computations forwads
through the graph:
π‘π = π1(π±ππ + ππ)π‘π = π2(π‘πππ + ππ)π² = π‘πππ
Note that we are now representing each layer as a vector; combining all of the weights in a layer across the units into a weight matrix
9
Activation FunctionIn one unit:
Linear combination of inputs and weight values non-linearity
π‘π = π1(π±ππ + ππ)
Popular choices:
Sigmoid function (just like logistic regression!)
tanh function
Rectifier/ramp function: π π₯ = max(0, π₯)
Why do we need the non-linearity?
10
Softmax LayerIn NLP, we often care about discrete outcomes
β’ e.g., words, POS tags, topic label
Output layer can be constructed such that the output values sum to one:
Let π± = π₯1β¦π₯π
π πππ‘πππ₯ π₯π =exp(π₯π)
ππ exp(π₯π)
Interpretation: unit π₯π represents probability that outcome is π.
Essentially, the last layer is like a multinomial logistic regression
11
Loss FunctionA neural network is optimized with respect to a loss function, which measures how much error it is making on predictions:
π²: correct, gold-standard distribution over class labels
π²: system predicted distribution over class labels
πΏ(π², π²): loss function between the two
Popular choice for classification (usually with a softmaxoutput layer) β cross entropy:
πΏππ π², π² = β
π
π¦π log( π¦π)
12
Training Neural NetworksTypically done by stochastic gradient descent
β’ For one training example, find gradient of loss function wrt parameters of the network (i.e., the weights of each layer); βtravel along in that directionβ.
Network has very many parameters!
Efficient algorithm to compute the gradient with respect to all parameters: backpropagation (Rumelhartet al., 1986)
β’ Boils down to an efficient way to use the chain rule of derivatives to propagate the error signal from the loss function backwards through the network back to the inputs
13
SGD OverviewInputs:
β’ Function computed by neural network, π(π±; π)
β’ Training samples {π±π€, π²π€}
β’ Loss function πΏ
Repeat for a while:
Sample a training case, π±π€, π²π€
Compute loss πΏ(π π±π€; π , π²π€)
Compute gradient π»πΏ(π±π€) wrt the parameters π
Update π β π β ππ»πΏ(π±π€)
Return π
14
By backpropagation
Forward pass
Example: Forward Passπ‘π = π1(π±ππ + ππ)π‘π = π2(π‘πππ + ππ)
π π± = π² = π3 π‘π = π‘πππ
Loss function: πΏ(π², π²ππππ)
Save the values for π‘π, π‘π, π² too!
15
Example Contβd: Backpropagationπ π± = π3(π2(π1 π± )
Need to compute: ππΏ
ππ3, ππΏ
ππ2, ππΏ
ππ1
By calculus and chain rule:
β’ππΏ
ππ3 =ππΏ
ππ3ππ3
ππ3
β’ππΏ
ππ2 =ππΏ
ππ3ππ3
ππ2ππ2
ππ2
β’ππΏ
ππ1 =ππΏ
ππ3ππ3
ππ2ππ2
ππ1ππ1
ππ1
Notice the overlapping computations? Be sure to do this in a smart order to avoid redundant computations!
16
Example: Time Delay Neural NetworkLetβs draw a neural network architecture for POS tagging using a feedforward neural network.
Weβll construct a context window around each word, and predict the POS tag of that word as the output.
Limitations of this approach?
17
Recurrent Neural NetworksA neural network sequence model:
π ππ π¬π, π±π:π§ = π¬π:π§, π²π:π§π¬π’ = π (π¬π’βπ, π±π’) # π¬π’ : state vector
π²π’ = π π¬π’ # π²π’ : output vector
π and π are parts of the neural network that compute the next state vector and the output vector
18
Long-Term Dependencies in LanguageThere can be dependencies between words that are arbitrarily far apart.
β’ I will look the word that you have described that doesnβt make sense to her up.
β’ Can you think of some other examples in English of long-range dependencies?
Cannot easily model with HMMs or even LC-CRFs, but can with RNNs
19
Vanishing and Exploding GradientsIf π and π are simple fully connected layers, we have a problem. In the unrolled network, the gradient signal can get lost on its way back to the words far in the past:
β’ Suppose it is π1 that we want to modify, and there are N layers between that and the loss function.
ππΏ
ππ1=
ππΏ
ππππππ
πππβ1β¦ππ2
ππ1ππ1
ππ1
β’ If the gradient norms are small (<1), the gradient will vanish to near-zero (or explode to near-infinity if >1)
β’ This happens especially because we have repeated applications of the same weight matrices in the recurrence
20
Long Short-Term Memory NetworksCurrently one of the most popular RNN architectures for NLP (Hochreiter and Schmidhuber, 1997)
β’ Explicitly models a βmemoryβ cell (i.e., a hidden-layer vector), and how it is updated as a response to the current input and the previous state of the memory cell.
Visual step-by-step explanation:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
21
Fix for Vanishing GradientsIt is the fact that in the LSTM, we can propagate a cell state directly that fixed the vanishing gradient problem:
There is no repeated weight application between the internal states across time!
22
Hardware for NNsCommon operations in inference and learning:
β’ Matrix multiplication
β’ Component-wise operations (e.g., activation functions)
This operation is highly parallelizable!
Graphical processing units (GPUs) are specifically designed to perform this type of computation efficiently
23
Packages for Implementing NNsTensorFlow https://www.tensorflow.org/
PyTorch http://pytorch.org/
Caffe http://caffe.berkeleyvision.org/
Theano http://deeplearning.net/software/theano/
β’ These packages support GPU and CPU computations
β’ Write interface code in high-level programming language, like Python
24
Summary: Advantages of NNsLearn relationships between inputs and outputs:
β’ Complex features and dependencies between inputs and states over long ranges with no fixed horizon assumption (i.e., non-Markovian)
β’ Reduces need for feature engineering
β’ More efficient use of input data via weight sharing
Highly flexible, generic architecture
β’ Multi-task learning: jointly train model that solves multiple tasks simultaneously
β’ Transfer learning: Take part of a neural network used for an initial task, use that as an initialization for a second, related task
25
Summary: Challenges of NNsComplex models may need a lot of training data
Many fiddly hyperparameters to tune, little guidance on how to do so, except empirically or through experience:
β’ Learning rate, number of hidden units, number of hidden layers, how to connect units, non-linearity, loss function, how to sample data, training procedure, etc.
Can be difficult to interpret the output of a system
β’ Why did the model predict a certain label? Have to examine weights in the network.
β’ Important to convince people to act on the outputs of the model!
26
NNs for NLPNeural networks have βtaken overβ mainstream NLP since 2014; most empirical work at recent conferences use them in some way
Lots of interesting open research questions:
β’ How to use linguistic structure (e.g., word senses, parses, other resources) with NNs, either as input or output?
β’ When is linguistic feature engineering a good idea, rather than just throwing more data with a simple representation for the NN to learn the features?
β’ Multitask and transfer learning for NLP
β’ Defining and solving new, challenging NLP tasks
27
Recommended