26
Lecture 7 2015. 08. 02 | Do Hoerin

Deep Leaning : Modeling Sequence / RNN

Embed Size (px)

Citation preview

Page 1: Deep Leaning : Modeling Sequence / RNN

Lecture 72015. 08. 02 | Do Hoerin

Page 2: Deep Leaning : Modeling Sequence / RNN

Modeling sequence:A brief overviewLecture 7a

Page 3: Deep Leaning : Modeling Sequence / RNN

Getting target when modeling sequences

• Teach by trying to predict the next term in the input seq.• Blurs the distinction between supervised and unsupervised

learning

Page 4: Deep Leaning : Modeling Sequence / RNN

Memoryless models

Autoregressive Models

ㅁㅁㅁㅁㅁ

Feed-forward Neural Nets

inputt-2

inputt-1

inputt

Average of individual or vector values

inputt-2

inputt-1

inputt

hidden

Adding one more non-linear hidden unit

Page 5: Deep Leaning : Modeling Sequence / RNN

Beyond Memoryless Models

• With hidden state, we get a more interesting kind of model• Store more information for a long time.• If the dynamics is noisy and the way it generates outputs from its

hidden state is noisy, we can never know its exact hidden state.• The best we can do is to infer a probability distribution over the

space of hidden state vectors

• This inference is only tractable for 2 types of hidden state model.

도회린
Page 6: Deep Leaning : Modeling Sequence / RNN

Linear Dynamic Systems

output output output

hidden hidden hidden

drivinginput

drivinginput

drivinginput

time

• Real-valued hidden states• Linear dynamics with Gaussian noise• Driving input

• To predict the next output , we need to infer the hidden state

• Linearly transformed Gaussian is Gaussian• Computed using Kalman filtering

Page 7: Deep Leaning : Modeling Sequence / RNN

Hidden Markov Model

output output output

time

• HMMs have a discrete one-of-N state

• Transitions between states are stocastic

• To predict the next output, we need to infer the probability distribution over hidden states

stateA

stateB

stateC

Page 8: Deep Leaning : Modeling Sequence / RNN

Limitations of HMMs

• Generate data with HMMs• N hidden states Æ remember only log N bits

• Generate the second half of utterance, with the first half of it• Syntax, semantics, intonation, accent, rate, volume, speaker’s characteristics…• All of these spec’s must all fit• 100 bits of info’ that the first half of an utterance Æ 2^100

Page 9: Deep Leaning : Modeling Sequence / RNN

• Combine two properties• Distributed hidden state store lots of info’

about the past• Non-linear dynamics to update hidden state

• Deterministic (not stochastic)

Recurrent Neural Network

output output output

hidden hidden hidden

input input input

time

Page 10: Deep Leaning : Modeling Sequence / RNN

Recurrent Neural Network

• What kinds of behaviors can RNN exhibit?• Oscillate• Settle to point attractors• Behave chaotically

• Computational power of RNNs makes them very hard to train• Discuss on Lecture 7d

도회린
Page 11: Deep Leaning : Modeling Sequence / RNN

Training RNNswith backpropagationLecture 7b

Page 12: Deep Leaning : Modeling Sequence / RNN

• Layered feed-forward net with shared weight

• Training algorithm in the time domain• The forward pass builds up a stack of

activities of all the units• The backward pass peels activities off the

stack to compute the error• After the backward pass, add together the

derivatives at all the different times for each weight

Training RNNs with backpropagation

time

w1 w2 w3 w4

w1 w2 w3 w4

w1 w2 w3 w4

Page 13: Deep Leaning : Modeling Sequence / RNN

An irritating extra issue

• We need to specify the initial state of all units• It’s better to treat initial states as learned parameters

• We learn them in the same way as we learn the weights• Start off with an initial random guess for the initial states.• At the end of each training sequence, backpropagate through time all the way

to the initial states to get the gradient of the error function with respect to each initial state.

• Adjust the initial states by following the negative gradient.

Page 14: Deep Leaning : Modeling Sequence / RNN

• Initial states of all the units

• The initial states of a subset of the units

• States of a subset at every steps• Natural way to model most sequential data

Providing input to recurrent networks

time

w1 w2 w3 w4

w1 w2 w3 w4

w1 w2 w3 w4

Page 15: Deep Leaning : Modeling Sequence / RNN

• Specified desired final activities

• Specified desired activities of all units for the last few step

• Specified desired activity of a subset

Teaching signals for recurrent networks

time

w1 w2 w3 w4

w1 w2 w3 w4

w1 w2 w3 w4

Page 16: Deep Leaning : Modeling Sequence / RNN

A toy exampleof training an RNNLecture 7c

Page 17: Deep Leaning : Modeling Sequence / RNN

The algorithm for binary addition

No carryPrint 1

CarryPrint 1

No carryPrint 0

CarryPrint 0

00

00

00

00

11

11

11

11

1 00 1

1 00 1

1 00 1

1 00 1

Page 18: Deep Leaning : Modeling Sequence / RNN

• 2 input units, 1 output unit

• desired output at each time step isthe output for the column that was provided as input two steps ago

• It takes one time step to update the hidden units based on the two input digits.

• It takes another time step for the hidden units to cause the output.

A recurrent net for binary

0 0 1 1 0 1 0 0

0 1 0 0 1 1 0 1

1 0 0 0 0 0 0 1time

in

3 fully interconnected hidden units

out

Page 19: Deep Leaning : Modeling Sequence / RNN

• It learns four distinct patterns of activity for the 3 hidden units• Patterns correspond to the nodes in the finite state automaton. • The automaton is restricted to be in exactly one state at each time.• The hidden units are restricted to have exactly one vector of activity at each

time.

• With N hidden neurons, it has 2^N possible binary activity vectors

What the network learns

Page 20: Deep Leaning : Modeling Sequence / RNN

Why is difficultto train an RNNLecture 7d

Page 21: Deep Leaning : Modeling Sequence / RNN

• In the forward pass : use squashing function to prevent exploding

• In the backward pass : completely linear• After forward pass, slope of the blue line is fixed• In an RNN trained on long sequence, it can

easily explode or vanish• So RNN have difficulty dealing with long-range

dependencies

The backward pass is linear

Page 22: Deep Leaning : Modeling Sequence / RNN

Why the back-propagated gradient blows up

• If we start a trajectory within an attractor, small changes in where we start make no difference to where we end up.

• But if we start almost exactly on the boundary, tiny changes can make a huge difference.

Page 23: Deep Leaning : Modeling Sequence / RNN

• Long Short Term Memory (Discuss on Lecture 7e)• Designed to remember values for a long time

• Hessian Free Optimization• Deal with fancy optimizer that can detect directions with a tiny gradient

but even smaller curvature

• Echo State Networks

• Good initialization with momentum

Four effective ways to learn an RNN

Page 24: Deep Leaning : Modeling Sequence / RNN

Long Short Term Memory (LSTM)Lecture 7d

Page 25: Deep Leaning : Modeling Sequence / RNN

• Designed a memory cell using logistic and linear units with multiplicative interactions

• Information gets into the cell whenever it’s write gate is on

• Information stays so long as its keepgate is on

• Information can be read from the cell by turning on its read gate

Long Short Term Memory

Page 26: Deep Leaning : Modeling Sequence / RNN

• INPUT : sequence of (x, y, p) – coordinates, status of pen(up or down)

• OUTPUT : sequence of character

• Graves & Schmidhuber (2009) showed that RNNs with LSTM are currently the best systems for reading cursive writing.

• Online handwriting recognition by an RNN with Long Short Term Memory (from Alex Graves)

• [Refer] https://www.youtube.com/watch?v=-yX1SYeDHbg

Reading Cursive Handwriting