On Recurrent and Deep Neural Networkslisa/pointeurs/RazvanPascanu...Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Universit e de Montr eal, LISA lab September 2014 Pascanu On Recurrent

On Recurrent and Deep Neural Networks

Razvan PascanuAdvisor: Yoshua Bengio

PhD DefenceUniversité de Montréal, LISA lab

September 2014

Pascanu On Recurrent and Deep Neural Networks 1/ 38

Motivation

“A computer once beat me at chess, but it was no match for me at kick boxing”

— Emo Phillips

Studying the mechanism behind learning provides a meta-solution for solvingtasks.


Supervised Learing

I fF : Θ× D→ T

I fθ(x) = fF(θ, x)I f ? = arg minθ∈Θ EEx,t∼π [d(fθ(x), t)]


Supervised Learing

I fF : Θ× D→ TI fθ(x) = fF(θ, x)

I f ? = arg minθ∈Θ EEx,t∼π [d(fθ(x), t)]


Supervised Learing

I fF : Θ× D→ TI fθ(x) = fF(θ, x)I f ? = arg minθ∈Θ EEx,t∼π [d(fθ(x), t)]


Optimization for learning

θ[k]θ[k+1]


Neural networks

Output neurons

Last hidden layer

Second hidden layer

First hidden layer

Input layer

bias = 1


Recurrent neural networks

Output neurons

Last hidden layer

Second hidden layer

First hidden layer

Input layer

bias = 1

(a) Feedforward network

Output neurons

Input layer

bias = 1

Recurrent Layer

(b) Recurrent network


On the number of linear regions of Deep Neural Networks

Razvan Pascanu, Guido Montufar∗, Kyunghyun Cho? and Yoshua Bengio

International Conference on Learning Representations 2014Submitted to Conference on Neural Information Processing Systems 2014

Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 7/ 38

Big picture

I rect(x) =

{0 , x < 0

x , x > 0I Idea: Composition of piece-wise functions is a

piece-wise function

I Approach: count the number of pieces for a deep

versus shallow model


Single Layer models

L1

L2

R12

R1R∅

R2

L1

L2

R12

R1R∅

R2

R23 R123

R13

L3

Zaslavsky’s Theorem (1975):∑ninps=0

(nhids

)Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 9/ 38

Multi-Layer models: how would it work?

PS1PS3

PS4 PS2

-4

-2

4

2

x0

x1



S1S2S3

S4

S ′4 S′1

S ′1S′1

S ′1 S′4

S ′4S′4

S ′2

S ′2S′2

S ′2 S′3 S

′3

S ′3 S′3

S ′1S′4

S ′2S′3

Input Space

First Layer Space

Second LayerSpace

1. Fold along the 2. Fold along thehorizontal axisvertical axis

3.


Visualizing units


Revisiting Natural Gradient for Deep Networks

Razvan Pascanu and Yoshua Bengio

International Conference on Learning Representations 2014

Pascanu and Bengio On Recurrent and Deep Neural Networks 14/ 38

Gist of this work

I Natural Gradient is a generalized Trust Region method

I Hessian-Free Optimization is Natural Gradient1

I Using the Empirical Fisher (TONGA) is not equivalent to the same trustregion method as natural gradient

I Natural Gradient can be accelerated if we add second order information ofthe error

I Natural Gradient can use unlabeled data

I Natural Gradient is more robust to change in order of the training set

1for particular pairs of activation functions and error functionsPascanu and Bengio On Recurrent and Deep Neural Networks 15/ 38

On the saddle point problem for non-convex optimization

Yann Dauphin, Razvan Pascanu, Caglar Gulcehre?

Kyunghyun Cho?, Surya Ganguli and Yoshua Bengio

Submitted to Conference of Neural Information Processing Systems 2014

Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 16/ 38

Existing evidence

I Statistical physics (on random gaussian fields)

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2index

0.05

0.00

0.05

0.10

0.15

0.20

0.25

erro

r

1.0 0.5 0.0 0.5 1.0 1.5eigenvalue

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Existing evidence

I empirical evidence

0.00 0.12 0.25 Index of critical point α

0

10

20

30

Trai

n er

ror ²

(%)

0.0 0.5 1.0 1.5 2.0Eigenvalue λ

10-410-310-210-1100101102

p(λ

)

Error 0.32%Error 23.49%Error 28.23%


Problem

I saddle points are attractors of secondorder dynamics

0.6 0.4 0.2 0.0 0.2 0.4 0.60.15

0.10

0.05

0.00

0.05

0.10

0.15

START

NewtonSFNSGD

NewtonSFNSGD


Solution

arg min∆θ T1 {L(θ)}s. t. ‖T2 {L(θ)} − T1 {L(θ)} ‖ ≤ �

UsingLagrange multipliers

∆θ = −∂L(θ)∂θ |H|


Experiments

CIF

AR

-10

5 25 50# hidden units

32

60

Tra

in e

rror ²

(%)

MSGD

Damped Newton

SFN

0 20 40 60 80 100# epochs

101

102

Tra

in e

rror ²

(%)

MSGD

Damped Newton

SFN

0 20 40 60 80100# epochs

10-610-510-410-310-210-1100101

|most

negati

ve λ

| MSGDDamped Newton

SFN


Experiments

Deep Autoencoder

500 1300

100

101

3000 3150

100

101

MSGDSFN

Recurrent Neural Network

0 2k 4k0.00.51.01.52.02.53.03.5

250k 300k 0.00.51.01.52.02.53.03.5

MSGDSFN


A Neurodynamical Model for Working Memory

Razvan Pascanu, Herbert Jaeger

Journal of Neural Networks 2011

Pascanu, Jaeger On Recurrent and Deep Neural Networks 23/ 38

Gist of this work

Input Units

Reservoir

Output units

( x )

( y )( u )

WM units

( m )

Pascanu, Jaeger On Recurrent and Deep Neural Networks 24/ 38

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio

International Conference on Machine Learning 2013

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 25/ 38

The exploding gradients problem

∂C(t+1)∂h(t+1)

C(t + 1)C(t)C(t− 1)

h(t + 1)h(t)h(t− 1)

x(t− 1) x(t) x(t + 1)

∂C(t)∂h(t)

∂C(t−1)∂h(t−1)

∂h(t+2)∂h(t+1)

∂h(t+1)∂h(t)

∂h(t)∂h(t−1)

∂h(t−1)∂h(t−2)

∂C∂W =

∑t∂C (t)∂W =

∑t

∑tk=0

∂C (t)∂h(t)

∂h(t)∂h(t−k)

∂h(t−k)∂W

∂h(t)∂h(t−k) =

∏tj=k+1

∂h(j)∂h(j−1)


Possible geometric interpretation and norm clipping

Classical view:

error

θ

θ

The error is (h(50)− 0.7)2 for h(t) = wσ(h(t − 1)) + b withh(0) = 0.5


The vanishing gradients problem

∂C(t+1)∂h(t+1)

C(t + 1)C(t)C(t− 1)

h(t + 1)h(t)h(t− 1)

x(t− 1) x(t) x(t + 1)

∂C(t)∂h(t)

∂C(t−1)∂h(t−1)

∂h(t+2)∂h(t+1)

∂h(t+1)∂h(t)

∂h(t)∂h(t−1)

∂h(t−1)∂h(t−2)

∂C∂W =

∑t∂C (t)∂W =

∑t

∑tk=0

∂C (t)∂h(t)

∂h(t)∂h(t−k)

∂h(t−k)∂W

∂h(t)∂h(t−k) =

∏tj=k+1

∂h(j)∂h(j−1)


Regularization term

Ω =∑k

Ωk =∑k

∥∥∥ ∂C∂hk+1 ∂hk+1∂hk ∥∥∥∥∥∥ ∂C∂hk+1∥∥∥ − 1

2


Temporal Order

Important symbols : A,BDistractor symbols: c,d,e,f

de..fAef︸︷︷︸110T

ccefc..e︸︷︷︸410T

fAef..e︸︷︷︸110T

ef..c︸︷︷︸410T

→ AA

edefcAccfef..ceceBedef..fedef→ AB

feBefccde..efddcAfccee..cedcd→ BA

Bfffede..cffecdBedfd..cedfedc→ BB


Results - Temporal order task

50 100 150 200 250Sequence length

0.0

0.2

0.4

0.6

0.8

1.0

Rate

of

succ

ess

sigmoid

MSGDMSGD-CMSGD-CR




0.0

0.2

0.4

0.6

0.8

1.0

Rate

of

succ

ess

basic tanh

MSGDMSGD-CMSGD-CR




0.0

0.2

0.4

0.6

0.8

1.0

Rate

of

succ

ess

smart tanh

MSGDMSGD-CMSGD-CR


Results - Natural tasks


How to construct Deep Recurrent Neural Networks

Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio

International Conference on Learning Representations 2014

Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 35/ 38

Gist of this work

xt

ht-1 ht

yt

+

Operator view

xt

ht-1 ht

yt

xt

ht-1 ht

yt

DT-RNN DOT-RNN

xt

ht-1ht

yt

z t-1z t

xt

ht-1 ht

yt

Stacked RNNs DOT(s)-RNNPascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 36/ 38

Overview of contributions

I The efficiency of deep feedforward models with piece-wise linear activationfunctions

I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient

I Importance of saddle points for optimization algorithms when applied to deeplearning

I Training Echo-State Networks to exhibit short term memory

I Training Recurrent Networks with gradient based methods to exhibit short termmemory

I How can one construct deep Recurrent Networks


Thank you !

Thank you !


Documents

On Recurrent and Deep Neural Networkslisa/pointeurs/RazvanPascanu...Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Universit e de Montr eal, LISA lab September 2014 Pascanu On Recurrent