Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
On Recurrent and Deep Neural Networks
Razvan PascanuAdvisor: Yoshua Bengio
PhD DefenceUniversité de Montréal, LISA lab
September 2014
Pascanu On Recurrent and Deep Neural Networks 1/ 38
Motivation
“A computer once beat me at chess, but it was no match for me at kick boxing”
— Emo Phillips
Studying the mechanism behind learning provides a meta-solution for solvingtasks.
Pascanu On Recurrent and Deep Neural Networks 2/ 38
Motivation
“A computer once beat me at chess, but it was no match for me at kick boxing”
— Emo Phillips
Studying the mechanism behind learning provides a meta-solution for solvingtasks.
Pascanu On Recurrent and Deep Neural Networks 2/ 38
Supervised Learing
I fF : Θ× D→ T
I fθ(x) = fF(θ, x)I f ? = arg minθ∈Θ EEx,t∼π [d(fθ(x), t)]
Pascanu On Recurrent and Deep Neural Networks 3/ 38
Supervised Learing
I fF : Θ× D→ TI fθ(x) = fF(θ, x)
I f ? = arg minθ∈Θ EEx,t∼π [d(fθ(x), t)]
Pascanu On Recurrent and Deep Neural Networks 3/ 38
Supervised Learing
I fF : Θ× D→ TI fθ(x) = fF(θ, x)I f ? = arg minθ∈Θ EEx,t∼π [d(fθ(x), t)]
Pascanu On Recurrent and Deep Neural Networks 3/ 38
Optimization for learning
θ[k]θ[k+1]
Pascanu On Recurrent and Deep Neural Networks 4/ 38
Neural networks
Output neurons
Last hidden layer
Second hidden layer
First hidden layer
Input layer
bias = 1
Pascanu On Recurrent and Deep Neural Networks 5/ 38
Recurrent neural networks
Output neurons
Last hidden layer
Second hidden layer
First hidden layer
Input layer
bias = 1
(a) Feedforward network
Output neurons
Input layer
bias = 1
Recurrent Layer
(b) Recurrent network
Pascanu On Recurrent and Deep Neural Networks 6/ 38
On the number of linear regions of Deep Neural Networks
Razvan Pascanu, Guido Montufar∗, Kyunghyun Cho? and Yoshua Bengio
International Conference on Learning Representations 2014Submitted to Conference on Neural Information Processing Systems 2014
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 7/ 38
Big picture
I rect(x) =
{0 , x < 0
x , x > 0I Idea: Composition of piece-wise functions is a
piece-wise function
I Approach: count the number of pieces for a deep
versus shallow model
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 8/ 38
Single Layer models
L1
L2
R12
R1R∅
R2
L1
L2
R12
R1R∅
R2
R23 R123
R13
L3
Zaslavsky’s Theorem (1975):∑ninps=0
(nhids
)Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 9/ 38
Multi-Layer models: how would it work?
PS1PS3
PS4 PS2
-4
-2
4
2
x0
x1
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 10/ 38
Multi-Layer models: how would it work?
S1S2S3
S4
S ′4 S′1
S ′1S′1
S ′1 S′4
S ′4S′4
S ′2
S ′2S′2
S ′2 S′3 S
′3
S ′3 S′3
S ′1S′4
S ′2S′3
Input Space
First Layer Space
Second LayerSpace
1. Fold along the 2. Fold along thehorizontal axisvertical axis
3.
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 11/ 38
Multi-Layer models: how would it work?
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 12/ 38
Visualizing units
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 13/ 38
Revisiting Natural Gradient for Deep Networks
Razvan Pascanu and Yoshua Bengio
International Conference on Learning Representations 2014
Pascanu and Bengio On Recurrent and Deep Neural Networks 14/ 38
Gist of this work
I Natural Gradient is a generalized Trust Region method
I Hessian-Free Optimization is Natural Gradient1
I Using the Empirical Fisher (TONGA) is not equivalent to the same trustregion method as natural gradient
I Natural Gradient can be accelerated if we add second order information ofthe error
I Natural Gradient can use unlabeled data
I Natural Gradient is more robust to change in order of the training set
1for particular pairs of activation functions and error functionsPascanu and Bengio On Recurrent and Deep Neural Networks 15/ 38
On the saddle point problem for non-convex optimization
Yann Dauphin, Razvan Pascanu, Caglar Gulcehre?
Kyunghyun Cho?, Surya Ganguli and Yoshua Bengio
Submitted to Conference of Neural Information Processing Systems 2014
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 16/ 38
Existing evidence
I Statistical physics (on random gaussian fields)
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2index
0.05
0.00
0.05
0.10
0.15
0.20
0.25
erro
r
1.0 0.5 0.0 0.5 1.0 1.5eigenvalue
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 17/ 38
Existing evidence
I empirical evidence
0.00 0.12 0.25 Index of critical point α
0
10
20
30
Trai
n er
ror ²
(%)
0.0 0.5 1.0 1.5 2.0Eigenvalue λ
10-410-310-210-1100101102
p(λ
)
Error 0.32%Error 23.49%Error 28.23%
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 18/ 38
Problem
I saddle points are attractors of secondorder dynamics
0.6 0.4 0.2 0.0 0.2 0.4 0.60.15
0.10
0.05
0.00
0.05
0.10
0.15
START
NewtonSFNSGD
NewtonSFNSGD
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 19/ 38
Solution
arg min∆θ T1 {L(θ)}s. t. ‖T2 {L(θ)} − T1 {L(θ)} ‖ ≤ �
UsingLagrange multipliers
∆θ = −∂L(θ)∂θ |H|
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 20/ 38
Experiments
CIF
AR
-10
5 25 50# hidden units
32
60
Tra
in e
rror ²
(%)
MSGD
Damped Newton
SFN
0 20 40 60 80 100# epochs
101
102
Tra
in e
rror ²
(%)
MSGD
Damped Newton
SFN
0 20 40 60 80100# epochs
10-610-510-410-310-210-1100101
|most
negati
ve λ
| MSGDDamped Newton
SFN
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 21/ 38
Experiments
Deep Autoencoder
500 1300
100
101
3000 3150
100
101
MSGDSFN
Recurrent Neural Network
0 2k 4k0.00.51.01.52.02.53.03.5
250k 300k 0.00.51.01.52.02.53.03.5
MSGDSFN
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 22/ 38
A Neurodynamical Model for Working Memory
Razvan Pascanu, Herbert Jaeger
Journal of Neural Networks 2011
Pascanu, Jaeger On Recurrent and Deep Neural Networks 23/ 38
Gist of this work
Input Units
Reservoir
Output units
( x )
( y )( u )
WM units
( m )
Pascanu, Jaeger On Recurrent and Deep Neural Networks 24/ 38
On the difficulty of training recurrent neural networks
Razvan Pascanu, Tomas Mikolov, Yoshua Bengio
International Conference on Machine Learning 2013
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 25/ 38
The exploding gradients problem
∂C(t+1)∂h(t+1)
C(t + 1)C(t)C(t− 1)
h(t + 1)h(t)h(t− 1)
x(t− 1) x(t) x(t + 1)
∂C(t)∂h(t)
∂C(t−1)∂h(t−1)
∂h(t+2)∂h(t+1)
∂h(t+1)∂h(t)
∂h(t)∂h(t−1)
∂h(t−1)∂h(t−2)
∂C∂W =
∑t∂C (t)∂W =
∑t
∑tk=0
∂C (t)∂h(t)
∂h(t)∂h(t−k)
∂h(t−k)∂W
∂h(t)∂h(t−k) =
∏tj=k+1
∂h(j)∂h(j−1)
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 26/ 38
Possible geometric interpretation and norm clipping
Classical view:
error
θ
θ
The error is (h(50)− 0.7)2 for h(t) = wσ(h(t − 1)) + b withh(0) = 0.5
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 27/ 38
The vanishing gradients problem
∂C(t+1)∂h(t+1)
C(t + 1)C(t)C(t− 1)
h(t + 1)h(t)h(t− 1)
x(t− 1) x(t) x(t + 1)
∂C(t)∂h(t)
∂C(t−1)∂h(t−1)
∂h(t+2)∂h(t+1)
∂h(t+1)∂h(t)
∂h(t)∂h(t−1)
∂h(t−1)∂h(t−2)
∂C∂W =
∑t∂C (t)∂W =
∑t
∑tk=0
∂C (t)∂h(t)
∂h(t)∂h(t−k)
∂h(t−k)∂W
∂h(t)∂h(t−k) =
∏tj=k+1
∂h(j)∂h(j−1)
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 28/ 38
Regularization term
Ω =∑k
Ωk =∑k
∥∥∥ ∂C∂hk+1 ∂hk+1∂hk ∥∥∥∥∥∥ ∂C∂hk+1∥∥∥ − 1
2
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 29/ 38
Temporal Order
Important symbols : A,BDistractor symbols: c,d,e,f
de..fAef︸ ︷︷ ︸110T
ccefc..e︸ ︷︷ ︸410T
fAef..e︸ ︷︷ ︸110T
ef..c︸ ︷︷ ︸410T
→ AA
edefcAccfef..ceceBedef..fedef→ AB
feBefccde..efddcAfccee..cedcd→ BA
Bfffede..cffecdBedfd..cedfedc→ BB
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 30/ 38
Results - Temporal order task
50 100 150 200 250Sequence length
0.0
0.2
0.4
0.6
0.8
1.0
Rate
of
succ
ess
sigmoid
MSGDMSGD-CMSGD-CR
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 31/ 38
Results - Temporal order task
50 100 150 200 250Sequence length
0.0
0.2
0.4
0.6
0.8
1.0
Rate
of
succ
ess
basic tanh
MSGDMSGD-CMSGD-CR
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 32/ 38
Results - Temporal order task
50 100 150 200 250Sequence length
0.0
0.2
0.4
0.6
0.8
1.0
Rate
of
succ
ess
smart tanh
MSGDMSGD-CMSGD-CR
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 33/ 38
Results - Natural tasks
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 34/ 38
How to construct Deep Recurrent Neural Networks
Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio
International Conference on Learning Representations 2014
Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 35/ 38
Gist of this work
xt
ht-1 ht
yt
+
Operator view
xt
ht-1 ht
yt
xt
ht-1 ht
yt
DT-RNN DOT-RNN
xt
ht-1ht
yt
z t-1z t
xt
ht-1 ht
yt
Stacked RNNs DOT(s)-RNNPascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 36/ 38
Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activationfunctions
I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deeplearning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short termmemory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38
Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activationfunctions
I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deeplearning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short termmemory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38
Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activationfunctions
I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deeplearning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short termmemory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38
Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activationfunctions
I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deeplearning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short termmemory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38
Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activationfunctions
I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deeplearning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short termmemory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38
Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activationfunctions
I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deeplearning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short termmemory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38
Thank you !
Thank you !
Pascanu On Recurrent and Deep Neural Networks 38/ 38