Deep Leaning : Modeling Sequence / RNN

Lecture 72015. 08. 02 | Do Hoerin

Modeling sequence:A brief overviewLecture 7a

Getting target when modeling sequences

• Teach by trying to predict the next term in the input seq.• Blurs the distinction between supervised and unsupervised

learning

Memoryless models

Autoregressive Models

ㅁㅁㅁㅁㅁ

Feed-forward Neural Nets

inputt-2

inputt-1

inputt

Average of individual or vector values

inputt-2

inputt-1

inputt

hidden

Adding one more non-linear hidden unit

Beyond Memoryless Models

• With hidden state, we get a more interesting kind of model• Store more information for a long time.• If the dynamics is noisy and the way it generates outputs from its

hidden state is noisy, we can never know its exact hidden state.• The best we can do is to infer a probability distribution over the

space of hidden state vectors

• This inference is only tractable for 2 types of hidden state model.

도회린

Linear Dynamic Systems

output output output

hidden hidden hidden

drivinginput

drivinginput

drivinginput

time

• Real-valued hidden states• Linear dynamics with Gaussian noise• Driving input

• To predict the next output , we need to infer the hidden state

• Linearly transformed Gaussian is Gaussian• Computed using Kalman filtering

Hidden Markov Model


time

• HMMs have a discrete one-of-N state

• Transitions between states are stocastic

• To predict the next output, we need to infer the probability distribution over hidden states

stateA

stateB

stateC

Limitations of HMMs

• Generate data with HMMs• N hidden states Æ remember only log N bits

• Generate the second half of utterance, with the first half of it• Syntax, semantics, intonation, accent, rate, volume, speaker’s characteristics…• All of these spec’s must all fit• 100 bits of info’ that the first half of an utterance Æ 2^100

• Combine two properties• Distributed hidden state store lots of info’

about the past• Non-linear dynamics to update hidden state

• Deterministic (not stochastic)

Recurrent Neural Network


hidden hidden hidden

input input input

time

Recurrent Neural Network

• What kinds of behaviors can RNN exhibit?• Oscillate• Settle to point attractors• Behave chaotically

• Computational power of RNNs makes them very hard to train• Discuss on Lecture 7d

도회린

Training RNNswith backpropagationLecture 7b

• Layered feed-forward net with shared weight

• Training algorithm in the time domain• The forward pass builds up a stack of

activities of all the units• The backward pass peels activities off the

stack to compute the error• After the backward pass, add together the

derivatives at all the different times for each weight

Training RNNs with backpropagation

time

w1 w2 w3 w4

w1 w2 w3 w4

w1 w2 w3 w4

An irritating extra issue

• We need to specify the initial state of all units• It’s better to treat initial states as learned parameters

• We learn them in the same way as we learn the weights• Start off with an initial random guess for the initial states.• At the end of each training sequence, backpropagate through time all the way

to the initial states to get the gradient of the error function with respect to each initial state.

• Adjust the initial states by following the negative gradient.

• Initial states of all the units

• The initial states of a subset of the units

• States of a subset at every steps• Natural way to model most sequential data

Providing input to recurrent networks

time

w1 w2 w3 w4

w1 w2 w3 w4

w1 w2 w3 w4

• Specified desired final activities

• Specified desired activities of all units for the last few step

• Specified desired activity of a subset

Teaching signals for recurrent networks

time

w1 w2 w3 w4

w1 w2 w3 w4

w1 w2 w3 w4

A toy exampleof training an RNNLecture 7c

The algorithm for binary addition

No carryPrint 1

CarryPrint 1

No carryPrint 0

CarryPrint 0

00

00

00

00

11

11

11

11

1 00 1

1 00 1

1 00 1

1 00 1

• 2 input units, 1 output unit

• desired output at each time step isthe output for the column that was provided as input two steps ago

• It takes one time step to update the hidden units based on the two input digits.

• It takes another time step for the hidden units to cause the output.

A recurrent net for binary

0 0 1 1 0 1 0 0

0 1 0 0 1 1 0 1

1 0 0 0 0 0 0 1time

in

3 fully interconnected hidden units

out

• It learns four distinct patterns of activity for the 3 hidden units• Patterns correspond to the nodes in the finite state automaton. • The automaton is restricted to be in exactly one state at each time.• The hidden units are restricted to have exactly one vector of activity at each

time.

• With N hidden neurons, it has 2^N possible binary activity vectors

What the network learns

Why is difficultto train an RNNLecture 7d

• In the forward pass : use squashing function to prevent exploding

• In the backward pass : completely linear• After forward pass, slope of the blue line is fixed• In an RNN trained on long sequence, it can

easily explode or vanish• So RNN have difficulty dealing with long-range

dependencies

The backward pass is linear

Why the back-propagated gradient blows up

• If we start a trajectory within an attractor, small changes in where we start make no difference to where we end up.

• But if we start almost exactly on the boundary, tiny changes can make a huge difference.

• Long Short Term Memory (Discuss on Lecture 7e)• Designed to remember values for a long time

• Hessian Free Optimization• Deal with fancy optimizer that can detect directions with a tiny gradient

but even smaller curvature

• Echo State Networks

• Good initialization with momentum

Four effective ways to learn an RNN

Long Short Term Memory (LSTM)Lecture 7d

• Designed a memory cell using logistic and linear units with multiplicative interactions

• Information gets into the cell whenever it’s write gate is on

• Information stays so long as its keepgate is on

• Information can be read from the cell by turning on its read gate

Long Short Term Memory

• INPUT : sequence of (x, y, p) – coordinates, status of pen(up or down)

• OUTPUT : sequence of character

• Graves & Schmidhuber (2009) showed that RNNs with LSTM are currently the best systems for reading cursive writing.

• Online handwriting recognition by an RNN with Long Short Term Memory (from Alex Graves)

• [Refer] https://www.youtube.com/watch?v=-yX1SYeDHbg

Reading Cursive Handwriting

https://www.youtube.com/watch?v=-yX1SYeDHbg

Engineering

Deep Leaning : Modeling Sequence / RNN