17
Review Series of Recent Deep Learning Papers: Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver Koray Kavukcuoglu Reviewed by : Arshdeep Sekhon 1 Department of Computer Science, University of Virginia https://qdata.github.io/deep2Read/ August 25, 2018 Reviewed by : Arshdeep Sekhon (University of Virginia) Review Series of Recent Deep Learning Papers:Parameter Pre August 25, 2018 1 / 17

Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Review Series of Recent Deep Learning Papers:

Parameter Prediction Paper: DecoupledNeural Interfaces Using Synthetic

GradientsMax Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol

Vinyals, Alex Graves, David Silver Koray Kavukcuoglu

Reviewed by : Arshdeep Sekhon

1Department of Computer Science, University of Virginiahttps://qdata.github.io/deep2Read/

August 25, 2018

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 1 / 17

Page 2: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Locking in Neural Networks

To update Layer 1:

1 Forward Propagation through Layer 2 and Layer 3

2 Backward Propagation through Layer 2 and Layer 3

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 2 / 17

Page 3: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Why is locking a problem?

1 Updates in sequential and synchronous manner

2 A distributed system: Updates depend on the slowest part

3 parallelizing training of neural network modules can speed up training.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 3 / 17

Page 4: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Decoupled Neural interface

1 Layer 1 will be updated before Layer 2 and Layer 3 have even beenexecuted.

2 No longer locked to the rest of the network.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 4 / 17

Page 5: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

A Decoupled Neural Interface

1 Decoupled Neural Interfaces predict gradients : synthetic gradientsfrom previous layer outputs or activations

2 do not rely on backpropagation to get error gradients

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 5 / 17

Page 6: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Gradients for FeedForward Networks

1 A network with N layers fi ,i ε{1, · · · ,N}2 For the ith layer, input hi−1, output hi = fi (hi−1)3 The complete graph is represented by FN

1

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 6 / 17

Page 7: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Gradients for FeedForward Networks

1

θi ← θi − αδiδhiδθi

; δi =δL

δhi(1)

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 7 / 17

Page 8: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Synthetic Gradients for FeedForward Networks

θn ← θn − αδ̂iδhiδθn

; δi = Mi+1(hi ) (2)

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 8 / 17

Page 9: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Synthetic Gradients for FeedForward Networks

1

θn ← θn − αδ̂iδhiδθn

; δi = Mi+1(hi ) (3)

2 n ε{1, · · · , n}3 To train Mi+1(hi ),

4 wait for true error gradient to be computed

5 after a full forwards and backwards pass of FNi+1

6 Minimize ||δ̂i − δi ||22

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 9 / 17

Page 10: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Every Layer DNI for FeedForward Networks

use backpropagated ˆδi+1 instead of the true gradients

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 10 / 17

Page 11: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Synthetic Gradients for RNNs

The task: stream prediction; possibly infinite

Unrolling the recurrent network:

Forward Graph: Finf1 made up of fi where i varies from 1 to inf

At a particular point in time t, minimise Loss over the next steps

inf∑τ=t

Lτ (4)

θ ← θ − αinf∑τ=t

δLτδθ

(5)

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 11 / 17

Page 12: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Synthetic Gradients for RNNs

Truncated Backpropagation

At a particular point in time t, minimise Loss over the next steps

θ ← θ − α( T∑

τ=t

δLτδθ

+( inf∑

τ=T+1

δLτδhT

)δhTδθ

)(6)

θ ← θ − α( T∑

τ=t

δLτδθ

+(δT

)δhTδθ

)(7)

truncated BPTT: δT = 0; limits temporal dependency learnt by rnn

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 12 / 17

Page 13: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Synthetic Gradients for RNNs

Truncated Backpropagation

θ ← θ − α( T∑

τ=t

δLτδθ

+(δ̂T

)δhTδθ

)(8)

δ̂T = MT (hT ); learned approximation of the future loss gradients

divide unrolled rnn into subnetworks of length T

insert a DNI between Ft+T−1t and Ft+2T−1

t+T

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 13 / 17

Page 14: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Synthetic Gradients for RNNs

Truncated Backpropagation

θ ← θ − α( T∑

τ=t

δLτδθ

+(δ̂T

)δhTδθ

)(9)

train MT by minimizing d(δT , δ̂T )

true δT not available: Bootstrapping

δT =∑2T

τ=T+1

δLτδhT

+ δ̂2T+1h2ThT

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 14 / 17

Page 15: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Other applications

1 Add an auxiliary task

2 Combine with true backpropagation gradients

3 Arbitrary Network Graphs

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 15 / 17

Page 16: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Results: Penn Tree Bank Language Modeling

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 16 / 17

Page 17: Review Series of Recent Deep Learning Papers: …Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic Gradients Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero,

Results: Copy Repeat Copy Task

1 Copy: Copy a sentence of length N

2 Repeat Copy: copy a sentence of length N R times

3 Max sequence length successfully modeled increases with DNI for thesame T in BPTT

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Decoupled Neural Interfaces Using Synthetic GradientsAugust 25, 2018 17 / 17