46
On Recurrent and Deep Neural Networks Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Universit´ e de Montr´ eal, LISA lab September 2014 Pascanu On Recurrent and Deep Neural Networks 1/ 38

On Recurrent and Deep Neural Networkslisa/pointeurs/RazvanPascanu...Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Universit e de Montr eal, LISA lab September 2014 Pascanu On Recurrent

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • On Recurrent and Deep Neural Networks

    Razvan PascanuAdvisor: Yoshua Bengio

    PhD DefenceUniversité de Montréal, LISA lab

    September 2014

    Pascanu On Recurrent and Deep Neural Networks 1/ 38

  • Motivation

    “A computer once beat me at chess, but it was no match for me at kick boxing”

    — Emo Phillips

    Studying the mechanism behind learning provides a meta-solution for solvingtasks.

    Pascanu On Recurrent and Deep Neural Networks 2/ 38

  • Motivation

    “A computer once beat me at chess, but it was no match for me at kick boxing”

    — Emo Phillips

    Studying the mechanism behind learning provides a meta-solution for solvingtasks.

    Pascanu On Recurrent and Deep Neural Networks 2/ 38

  • Supervised Learing

    I fF : Θ× D→ T

    I fθ(x) = fF(θ, x)I f ? = arg minθ∈Θ EEx,t∼π [d(fθ(x), t)]

    Pascanu On Recurrent and Deep Neural Networks 3/ 38

  • Supervised Learing

    I fF : Θ× D→ TI fθ(x) = fF(θ, x)

    I f ? = arg minθ∈Θ EEx,t∼π [d(fθ(x), t)]

    Pascanu On Recurrent and Deep Neural Networks 3/ 38

  • Supervised Learing

    I fF : Θ× D→ TI fθ(x) = fF(θ, x)I f ? = arg minθ∈Θ EEx,t∼π [d(fθ(x), t)]

    Pascanu On Recurrent and Deep Neural Networks 3/ 38

  • Optimization for learning

    θ[k]θ[k+1]

    Pascanu On Recurrent and Deep Neural Networks 4/ 38

  • Neural networks

    Output neurons

    Last hidden layer

    Second hidden layer

    First hidden layer

    Input layer

    bias = 1

    Pascanu On Recurrent and Deep Neural Networks 5/ 38

  • Recurrent neural networks

    Output neurons

    Last hidden layer

    Second hidden layer

    First hidden layer

    Input layer

    bias = 1

    (a) Feedforward network

    Output neurons

    Input layer

    bias = 1

    Recurrent Layer

    (b) Recurrent network

    Pascanu On Recurrent and Deep Neural Networks 6/ 38

  • On the number of linear regions of Deep Neural Networks

    Razvan Pascanu, Guido Montufar∗, Kyunghyun Cho? and Yoshua Bengio

    International Conference on Learning Representations 2014Submitted to Conference on Neural Information Processing Systems 2014

    Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 7/ 38

  • Big picture

    I rect(x) =

    {0 , x < 0

    x , x > 0I Idea: Composition of piece-wise functions is a

    piece-wise function

    I Approach: count the number of pieces for a deep

    versus shallow model

    Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 8/ 38

  • Single Layer models

    L1

    L2

    R12

    R1R∅

    R2

    L1

    L2

    R12

    R1R∅

    R2

    R23 R123

    R13

    L3

    Zaslavsky’s Theorem (1975):∑ninps=0

    (nhids

    )Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 9/ 38

  • Multi-Layer models: how would it work?

    PS1PS3

    PS4 PS2

    -4

    -2

    4

    2

    x0

    x1

    Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 10/ 38

  • Multi-Layer models: how would it work?

    S1S2S3

    S4

    S ′4 S′1

    S ′1S′1

    S ′1 S′4

    S ′4S′4

    S ′2

    S ′2S′2

    S ′2 S′3 S

    ′3

    S ′3 S′3

    S ′1S′4

    S ′2S′3

    Input Space

    First Layer Space

    Second LayerSpace

    1. Fold along the 2. Fold along thehorizontal axisvertical axis

    3.

    Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 11/ 38

  • Multi-Layer models: how would it work?

    Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 12/ 38

  • Visualizing units

    Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 13/ 38

  • Revisiting Natural Gradient for Deep Networks

    Razvan Pascanu and Yoshua Bengio

    International Conference on Learning Representations 2014

    Pascanu and Bengio On Recurrent and Deep Neural Networks 14/ 38

  • Gist of this work

    I Natural Gradient is a generalized Trust Region method

    I Hessian-Free Optimization is Natural Gradient1

    I Using the Empirical Fisher (TONGA) is not equivalent to the same trustregion method as natural gradient

    I Natural Gradient can be accelerated if we add second order information ofthe error

    I Natural Gradient can use unlabeled data

    I Natural Gradient is more robust to change in order of the training set

    1for particular pairs of activation functions and error functionsPascanu and Bengio On Recurrent and Deep Neural Networks 15/ 38

  • On the saddle point problem for non-convex optimization

    Yann Dauphin, Razvan Pascanu, Caglar Gulcehre?

    Kyunghyun Cho?, Surya Ganguli and Yoshua Bengio

    Submitted to Conference of Neural Information Processing Systems 2014

    Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 16/ 38

  • Existing evidence

    I Statistical physics (on random gaussian fields)

    0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2index

    0.05

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    erro

    r

    1.0 0.5 0.0 0.5 1.0 1.5eigenvalue

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 17/ 38

  • Existing evidence

    I empirical evidence

    0.00 0.12 0.25 Index of critical point α

    0

    10

    20

    30

    Trai

    n er

    ror ²

    (%)

    0.0 0.5 1.0 1.5 2.0Eigenvalue λ

    10-410-310-210-1100101102

    p(λ

    )

    Error 0.32%Error 23.49%Error 28.23%

    Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 18/ 38

  • Problem

    I saddle points are attractors of secondorder dynamics

    0.6 0.4 0.2 0.0 0.2 0.4 0.60.15

    0.10

    0.05

    0.00

    0.05

    0.10

    0.15

    START

    NewtonSFNSGD

    NewtonSFNSGD

    Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 19/ 38

  • Solution

    arg min∆θ T1 {L(θ)}s. t. ‖T2 {L(θ)} − T1 {L(θ)} ‖ ≤ �

    UsingLagrange multipliers

    ∆θ = −∂L(θ)∂θ |H|

    Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 20/ 38

  • Experiments

    CIF

    AR

    -10

    5 25 50# hidden units

    32

    60

    Tra

    in e

    rror ²

    (%)

    MSGD

    Damped Newton

    SFN

    0 20 40 60 80 100# epochs

    101

    102

    Tra

    in e

    rror ²

    (%)

    MSGD

    Damped Newton

    SFN

    0 20 40 60 80100# epochs

    10-610-510-410-310-210-1100101

    |most

    negati

    ve λ

    | MSGDDamped Newton

    SFN

    Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 21/ 38

  • Experiments

    Deep Autoencoder

    500 1300

    100

    101

    3000 3150

    100

    101

    MSGDSFN

    Recurrent Neural Network

    0 2k 4k0.00.51.01.52.02.53.03.5

    250k 300k 0.00.51.01.52.02.53.03.5

    MSGDSFN

    Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 22/ 38

  • A Neurodynamical Model for Working Memory

    Razvan Pascanu, Herbert Jaeger

    Journal of Neural Networks 2011

    Pascanu, Jaeger On Recurrent and Deep Neural Networks 23/ 38

  • Gist of this work

    Input Units

    Reservoir

    Output units

    ( x )

    ( y )( u )

    WM units

    ( m )

    Pascanu, Jaeger On Recurrent and Deep Neural Networks 24/ 38

  • On the difficulty of training recurrent neural networks

    Razvan Pascanu, Tomas Mikolov, Yoshua Bengio

    International Conference on Machine Learning 2013

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 25/ 38

  • The exploding gradients problem

    ∂C(t+1)∂h(t+1)

    C(t + 1)C(t)C(t− 1)

    h(t + 1)h(t)h(t− 1)

    x(t− 1) x(t) x(t + 1)

    ∂C(t)∂h(t)

    ∂C(t−1)∂h(t−1)

    ∂h(t+2)∂h(t+1)

    ∂h(t+1)∂h(t)

    ∂h(t)∂h(t−1)

    ∂h(t−1)∂h(t−2)

    ∂C∂W =

    ∑t∂C (t)∂W =

    ∑t

    ∑tk=0

    ∂C (t)∂h(t)

    ∂h(t)∂h(t−k)

    ∂h(t−k)∂W

    ∂h(t)∂h(t−k) =

    ∏tj=k+1

    ∂h(j)∂h(j−1)

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 26/ 38

  • Possible geometric interpretation and norm clipping

    Classical view:

    error

    θ

    θ

    The error is (h(50)− 0.7)2 for h(t) = wσ(h(t − 1)) + b withh(0) = 0.5

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 27/ 38

  • The vanishing gradients problem

    ∂C(t+1)∂h(t+1)

    C(t + 1)C(t)C(t− 1)

    h(t + 1)h(t)h(t− 1)

    x(t− 1) x(t) x(t + 1)

    ∂C(t)∂h(t)

    ∂C(t−1)∂h(t−1)

    ∂h(t+2)∂h(t+1)

    ∂h(t+1)∂h(t)

    ∂h(t)∂h(t−1)

    ∂h(t−1)∂h(t−2)

    ∂C∂W =

    ∑t∂C (t)∂W =

    ∑t

    ∑tk=0

    ∂C (t)∂h(t)

    ∂h(t)∂h(t−k)

    ∂h(t−k)∂W

    ∂h(t)∂h(t−k) =

    ∏tj=k+1

    ∂h(j)∂h(j−1)

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 28/ 38

  • Regularization term

    Ω =∑k

    Ωk =∑k

    ∥∥∥ ∂C∂hk+1 ∂hk+1∂hk ∥∥∥∥∥∥ ∂C∂hk+1∥∥∥ − 1

    2

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 29/ 38

  • Temporal Order

    Important symbols : A,BDistractor symbols: c,d,e,f

    de..fAef︸ ︷︷ ︸110T

    ccefc..e︸ ︷︷ ︸410T

    fAef..e︸ ︷︷ ︸110T

    ef..c︸ ︷︷ ︸410T

    → AA

    edefcAccfef..ceceBedef..fedef→ AB

    feBefccde..efddcAfccee..cedcd→ BA

    Bfffede..cffecdBedfd..cedfedc→ BB

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 30/ 38

  • Results - Temporal order task

    50 100 150 200 250Sequence length

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Rate

    of

    succ

    ess

    sigmoid

    MSGDMSGD-CMSGD-CR

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 31/ 38

  • Results - Temporal order task

    50 100 150 200 250Sequence length

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Rate

    of

    succ

    ess

    basic tanh

    MSGDMSGD-CMSGD-CR

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 32/ 38

  • Results - Temporal order task

    50 100 150 200 250Sequence length

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Rate

    of

    succ

    ess

    smart tanh

    MSGDMSGD-CMSGD-CR

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 33/ 38

  • Results - Natural tasks

    Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 34/ 38

  • How to construct Deep Recurrent Neural Networks

    Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio

    International Conference on Learning Representations 2014

    Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 35/ 38

  • Gist of this work

    xt

    ht-1 ht

    yt

    +

    Operator view

    xt

    ht-1 ht

    yt

    xt

    ht-1 ht

    yt

    DT-RNN DOT-RNN

    xt

    ht-1ht

    yt

    z t-1z t

    xt

    ht-1 ht

    yt

    Stacked RNNs DOT(s)-RNNPascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 36/ 38

  • Overview of contributions

    I The efficiency of deep feedforward models with piece-wise linear activationfunctions

    I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient

    I Importance of saddle points for optimization algorithms when applied to deeplearning

    I Training Echo-State Networks to exhibit short term memory

    I Training Recurrent Networks with gradient based methods to exhibit short termmemory

    I How can one construct deep Recurrent Networks

    Pascanu On Recurrent and Deep Neural Networks 37/ 38

  • Overview of contributions

    I The efficiency of deep feedforward models with piece-wise linear activationfunctions

    I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient

    I Importance of saddle points for optimization algorithms when applied to deeplearning

    I Training Echo-State Networks to exhibit short term memory

    I Training Recurrent Networks with gradient based methods to exhibit short termmemory

    I How can one construct deep Recurrent Networks

    Pascanu On Recurrent and Deep Neural Networks 37/ 38

  • Overview of contributions

    I The efficiency of deep feedforward models with piece-wise linear activationfunctions

    I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient

    I Importance of saddle points for optimization algorithms when applied to deeplearning

    I Training Echo-State Networks to exhibit short term memory

    I Training Recurrent Networks with gradient based methods to exhibit short termmemory

    I How can one construct deep Recurrent Networks

    Pascanu On Recurrent and Deep Neural Networks 37/ 38

  • Overview of contributions

    I The efficiency of deep feedforward models with piece-wise linear activationfunctions

    I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient

    I Importance of saddle points for optimization algorithms when applied to deeplearning

    I Training Echo-State Networks to exhibit short term memory

    I Training Recurrent Networks with gradient based methods to exhibit short termmemory

    I How can one construct deep Recurrent Networks

    Pascanu On Recurrent and Deep Neural Networks 37/ 38

  • Overview of contributions

    I The efficiency of deep feedforward models with piece-wise linear activationfunctions

    I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient

    I Importance of saddle points for optimization algorithms when applied to deeplearning

    I Training Echo-State Networks to exhibit short term memory

    I Training Recurrent Networks with gradient based methods to exhibit short termmemory

    I How can one construct deep Recurrent Networks

    Pascanu On Recurrent and Deep Neural Networks 37/ 38

  • Overview of contributions

    I The efficiency of deep feedforward models with piece-wise linear activationfunctions

    I The relationship between a few optimization techniques for deep learning, with afocus on understanding natural gradient

    I Importance of saddle points for optimization algorithms when applied to deeplearning

    I Training Echo-State Networks to exhibit short term memory

    I Training Recurrent Networks with gradient based methods to exhibit short termmemory

    I How can one construct deep Recurrent Networks

    Pascanu On Recurrent and Deep Neural Networks 37/ 38

  • Thank you !

    Thank you !

    Pascanu On Recurrent and Deep Neural Networks 38/ 38