Upload
david-balduzzi
View
229
Download
1
Embed Size (px)
Citation preview
Game theory for Neural Networks
David Balduzzi VUW
C3 Research Camp
Goals
“a justification is objective if, in principle, it can be tested and understood by anybody”
— Karl Popper
Objective knowledge
• How is knowledge created?
• How is knowledge shared?
Goals
• Deductive reasoning (a priori knowledge)• foundation of mathematics • formal: mathematical logic • operational: Turing machines, lambda calculus
Background
Tiny Turing Machine
Alvy Ray Smith, based on [Yurii Rogozhin 1996]
Tiny Turing Machine
Alvy Ray Smith, based on [Yurii Rogozhin 1996]
• Deductive reasoning (a priori knowledge)• foundation of mathematics • formal: mathematical logic • operational: Turing machines, lambda calculus
• Inductive reasoning (empirical knowledge)• foundation of science • formal: learning theory • operational: SVMs, neural networks, etc.
Background
• What’s missing? • To be objective, knowledge must be shared • How can agents share knowledge?
• A theory of distributed reasoning • induction builds on deduction • distribution should build on both
Neural Networks• Superhuman performance on
object recognition (ImageNet) • Outperformed humans at
recognising street-signs (Google streetview).
• Superhuman performance on Atari games (Google).
• Real-time translation: English voice to Chinese voice (Microsoft).
30 years of trial-and-error• Architecture
• Weight-tying (convolutions) • Max-pooling
• Nonlinearity • Rectilinear units (Jarrett 2009)
• Credit assignment • Backpropagation
• Optimization • Stochastic Gradient Descent • Nesterov momentum, BFGS, AdaGrad, etc. • RMSProp (Riedmiller 1993, Hinton & Tieleman 200x)
• Regularization • Dropout (Srivastava 2014) or • Batch-Normalization (Szegedy & Ioffe 2015)
• Concrete problem:• Why does gradient descent work on neural nets? • Not just gradient descent; many methods designed
for convex problems work surprisingly well on neural nets.
• Big picture:• Neural networks are populations of learners • that distribute knowledge extremely effectively.
We should study them as such.
This tutorial
• Deduction • Induction • Gradients • Games • Neural networks
Outline
Deductive reasoning
“we are regarding the function of the mathematician as simply to determine the truth or falsity of propositions”
All men are mortal Socrates is a man
Deductive reasoning
Socrates is mortal
All men are mortal Socrates is a man
Deductive reasoning
Socrates is mortal
Idea: Intelligence is deductive reasoning!
Articles
WINTER 2006 13
Photo courtesy Dartmouth College.
Page 1 of the Original Proposal.
Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge,
Ray Solomonoff
50 years later
“To understand the real world, we must have a different set of primitives from the relatively
simple line trackers suitable and sufficient for the blocks world”
— Patrick Winston (1975) Director of MIT’s AI lab from 1972-1997
A bump in the road
The AI winterhttp://en.wikipedia.org/wiki/AI_winter
Reductio ad absurdum
“Intelligence is 10 million rules” — Doug Lenat
Deductive reasoning• Structure:
• axioms; • logical laws; • production system
• Derive complicated (empirical) truths from simple (self-evident) truths
• Deductive reasoning provides no guidance about which axiom(s) to eliminate when “derived truths” contradict experience
• In general, deductive reasoning has nothing to say about mistakes except “avoid them”
• In retrospect, it’s hard to understand why anyone thought deduction would yield artificial intelligence.
Inductive reasoning
“this book is composed […] upon one very simple theme […] that we can learn from our mistakes”
Inductive reasoningSocrates is mortal
Plato is mortal [ … more examples …]
All men are mortal
Inductive reasoning
All observed As are Bs a is an A
a is a B
The problem of induction
All observed As are Bs a is an A
a is a B
This is clearly false. The conclusion does not follow from the premises.
“If a machine is […] infallible, it cannot also be intelligent. There are […] theorems which say almost exactly that. But
these theorems say nothing about how much intelligence may be displayed if a machine makes no pretence at infallibility.”
[ it’s ok to make mistakes ]
Let’s look at some examples
Sequential predictionScenario: At time t, Forecaster predicts 0 or 1.
Nature then reveals the truth.
Forecaster has access to N experts. One of them is always correct.
Goal: Predict as accurately as possible.
Halving Algorithm
While t>0:Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
Set t = 1.
While t>0:
Question:
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
How long to find correct expert?
Set t = 1.Halving Algorithm
While t>0:
BAD!!!
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
How long to find correct expert?
Set t = 1.Halving Algorithm
While t>0:
Question:
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
How many errors?
Set t = 1.Halving Algorithm
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.
How many errors?
Halving Algorithm
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.
How many errors?
When algorithm makes a mistake,
it removes ≥ half of experts
Halving Algorithm
≤ log N
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.
How many errors?
When algorithm makes a mistake,
it removes ≥ half of experts
Halving Algorithm
What’s going on?
Didn’t we just use deductive reasoning!?!
What’s going on?
Didn’t we just use deductive reasoning!?!
Yes… but No!
What’s going on?
Algorithm: makes educated guesses about Nature
Analysis: proves theorem about number of errors
(inductive)
(deductive)
What’s going on?
Algorithm: makes educated guesses about Nature
Analysis: proves theorem about number of errors
(inductive)
(deductive)
The algorithm learns — but it does not deduce!
Adversarial predictionScenario: At time t, Forecaster predicts 0 or 1.
Nature then reveals the truth.
Forecaster has access to N experts. One of them is always correct. Nature is adversarial.
Goal: Predict as accurately as possible.
At time t, Forecaster predicts 0 or 1. Nature then reveals the truth.
Forecaster has access to N experts. One of them is always correct. Nature is adversarial.
Goal: Predict as accurately as possible.
Seriously?!?!
Regret
Let m* be the best expert in hindsight. regret := # mistakes(Forecaster) - # mistakes(m*)
Goal: Predict as accurately as possible. Minimize regret.
While t>0:
Question:
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
What is the regret?
Set t = 1.Halving Algorithm
ADVERSARIAL SETTING
While t>0:
Question:
Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.
What is the regret?
Set t = 1.Halving Algorithm
ADVERSARIAL SETTING
BAD!
While t ≤ T:
Question:
Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.
What is the regret?
Set t = 1.Weighted Majority Alg
Pick β in (0,1). Assign 1 to experts.
While t ≤ T:Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.
Set t = 1.
What is the regret? [ choose β carefully ]r
T · logN2
Pick β in (0,1). Assign 1 to experts.
Weighted Majority Alg
Loss functions
`(y, y0) =1
2
⇣y � y0
⌘2
`(p, i) =
(� log(1� p) if i = 0
� log(p) else
`(y, y0) =
(0 if y = y0
1 else
• 0/1 loss
• Mean square error
• Logistic loss
Loss functions
`(y, y0) =1
2
⇣y � y0
⌘2
`(p, i) =
(� log(1� p) if i = 0
� log(p) else
`(y, y0) =
(0 if y = y0
1 else
• 0/1 loss
• Mean square error
• Logistic loss
not convex
convex
Exp. Weights Alg.
While t ≤ T:Predict the weighted sum.Step 1.Multiply expert i byStep 2.t ← t+1Step 3.
Set t = 1.
What is the regret? [ choose β carefully ]
Pick β > 0. Assign 1 to experts.
e��·`(fti ,yt)
logN+
p2T logN
Inductive reasoning• Structure:
• game played over a series of rounds • on each round, experts make predictions • Forecaster takes weighted average of predictions • Nature reveals the truth and a loss is incurred • Forecaster adjusts weights
• correct experts —> increase weights; • incorrect —> decrease
• Goal:• average loss per round converges to best-in-
hindsight
Inductive reasoning• How it works
• Forecaster adjusts strategy according to prior experience
• Current strategy encodes previous mistakes
• No guarantee on any particular example • Performance is optimal on average
• Still to come• How can Forecaster communicate its knowledge
with others, in a way that is useful to them?
Gradient descent
“the great watershed in optimisation isn’t between linearity and nonlinearity, but convexity and nonconvexity”
— Tyrrell Rockafellar
Gradient descent
• Everyone knows (?) gradient descent
• Take a fresh look: • gradient descent is a no-regret algorithm • —> game theory • —> neural networks
Online Convex Opt.Scenario: Convex set K;
differentiable loss L(a,b) that is convex function of a
At time t, Forecaster picks at in K Nature responds with bt in K [ Nature is adversarial ]
Forecaster’s loss is L(a,b)
Goal: Minimize regret.
�
�� �� �� � � � ��
�
��
�
Follow the Leader
Idea: Predict the at that would have worked best on { b1, … ,bt-1 }
While t ≤ T:Step 1.Step 2.
Set t = 1.Follow the Leader
Idea:
t ← t+1
Pick a1 at random.
at := argmina2K
"t�1X
i=1
L(a, bi)#
Predict the at that would have worked best on { b1, … ,bt-1 }
While t ≤ T:Step 1.Step 2.
Set t = 1.Follow the LeaderBAD!
Problem: Nature pulls Forecaster back-and-forth No memory!
t ← t+1
Pick a1 at random.
at := argmina2K
"t�1X
i=1
L(a, bi)#
While t ≤ T:Step 1.Step 2.
Set t = 1.
t ← t+1
FTRLPick a1 at random.
regularize
at := argmina2K
"t�1X
i=1
L(a, bi) +�
2· kak22
#
While t ≤ T:Step 1.Step 2.
Set t = 1.FTRL
t ← t+1
Pick a1 at random.
gradient descent
at at�1 � � · @
@aL(at�1, bt�1)
While t ≤ T:Step 1.Step 2.
Set t = 1.FTRL
Intuition: β controls memory
t ← t+1
Pick a1 at random.
at at�1 � � · @
@aL(at�1, bt�1)
While t ≤ T:Step 1.Step 2.
Set t = 1.FTRL
What is the regret? [ choose β carefully ]
diam(K) · Lipschitz(L) ·pT
t ← t+1
Pick a1 at random.
at at�1 � � · @
@aL(at�1, bt�1)
Logarithmic regret
regret
[ choose β carefully ]
Convergence at rate is actually quite slow
• For many common loss functions • (e.g. mean-square error, logistic loss)
• it is possible to do much better
pT
T
logT
T
Information theory
Two connections
• Entropic regularization <—> • Exponential Weights Algorithm
• Analogy between gradients and Fano information • “the information represented by y about x”
log
P(x|y)P(x)
Gradient descent
• Gradient descent is fast, cheap, and guaranteed to converge — even against adversaries
• “a hammer that smashes convex nails”
�
�� �� �� � � � ��
�
��
�
• GD is well-understood in convex settings
• Convergence rate depends on the structure of the loss, size of search space, etc.
Convex optimizationConvex methods
• Stochastic gradient descent
• Nesterov momentum
• BFGS (Broyden-Fletcher-Goldfarb-Shanno)
• Conjugate gradient methods
• AdaGrad
Online Convex Opt. (deep learning)
Apply Gradient Descent to nonconvex optimization (neural networks).
• Theorems don’t work (not convex)
• tons of engineering (decades of tweaking)
Amazing performance.
self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys-tems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7.
Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec-tacular results, almost halving the error rates of the best compet-ing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and tech-niques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).
Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.
The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook,
Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.
ConvNets are easily amenable to efficient hardware implemen-tations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.
Distributed representations and language processing Deep-learning theory shows that deep nets have two different expo-nential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth).
The hidden layers of a multilayer neural network learn to repre-sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local
Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced
with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.
VisionDeep CNN
LanguageGenerating RNN
A group of people shopping at an outdoor
market.
There are many vegetables at the
fruit stand.
A woman is throwing a frisbee in a park.
A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest withtrees in the background.
A dog is standing on a hardwood floor. A stop sign is on a road with amountain in the background
4 4 0 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5
REVIEWINSIGHT
© 2015 Macmillan Publishers Limited. All rights reserved
Deep Learning
• Superhuman performance on object recognition (ImageNet)
• Outperformed humans at recognising street-signs (Google streetview).
• Superhuman performance on Atari games (Google).
• Real-time translation: English voice to Chinese voice (Microsoft).
Game theory
“An equilibrium is not always an optimum; it might not even be good. This may be the most
important discovery of game theory.” — Ivar Ekeland
Game theory• So far
• Seen how Forecaster can learn to use “expert” advice.
• Forecaster modifies the weights of experts such that it converges to optimal-in-hindsight predictions.
• Who are these experts?• We will model them as Forecasters in their own right • First, let’s develop tools for modeling populations.
Setup• Players A, B with actions {a1, a2, … am} and {b1, b2, … bn}
• Two payoff matrices, describing the losses/gains of the players for every combination of moves
• Goal of players is to maximise-gain / minimise-loss
• Neither player knows what the other will do
Setup• Players A, B with actions {a1, a2, … am} and {b1, b2, … bn}
• Two payoff matrices, describing the losses/gains of the players for every combination of moves
AB
RP
R P
0
Rock-paper -scissors (zero-sum game)
S
S0
0
-1 11 -1-1 1
AB
RP
R P
0
S
S0
0
1 -1-1 11 -1
Minimax theoreminfa2K
supb2K
L(a, b) = supb2K
infa2K
L(a, b)
Minimax theorem
Forecaster picks a, Nature responds b
infa2K
supb2K
L(a, b) = supb2K
infa2K
L(a, b)
Minimax theorem
Forecaster picks a, Nature responds b
Nature picks b, Forecaster responds a
infa2K
supb2K
L(a, b) = supb2K
infa2K
L(a, b)
Minimax theorem
Forecaster picks a, Nature responds b
Nature picks b, Forecaster responds a
infa2K
supb2K
L(a, b) = supb2K
infa2K
L(a, b)
AB
RP
R P
0
S
S0
0
-1 11 -1-1 1
AB
RP
R P
0
S
S0
0
1 -1-1 11 -1
Minimax theorem
Forecaster picks a, Nature responds b
Nature picks b, Forecaster responds a
infa2K
supb2K
L(a, b) = supb2K
infa2K
L(a, b)
infa2K
supb2K
L(a, b) � supb2K
infa2K
L(a, b)going first hurts Forecaster, so
Minimax theorem
Proof idea:No-regret algorithm →
→
→
Forecaster can asymptotically match hindsight
Order of players doesn’t matter asymptotically
Convert series of moves into average via online-to-batch.
Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)
infa2K
supb2K
L(a, b) supb2K
infa2K
L(a, b)
Minimax theorem
Proof idea:No-regret algorithm →
→
→
Forecaster can asymptotically match hindsight
Order of players doesn’t matter asymptotically
Convert series of moves into average via online-to-batch.
Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)
infa2K
supb2K
L(a, b) supb2K
infa2K
L(a, b)
Minimax theorem
Proof idea:No-regret algorithm →
→
→
Forecaster can asymptotically match hindsight
Order of players doesn’t matter asymptotically
Convert series of moves into average via online-to-batch.
Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)
infa2K
supb2K
L(a, b) supb2K
infa2K
L(a, b)
Minimax theorem
Proof idea:No-regret algorithm →
→
→
Forecaster can asymptotically match hindsight
Order of players doesn’t matter asymptotically
Convert series of moves into average via online-to-batch.
Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)
a =1
T
TX
t=1
at
infa2K
supb2K
L(a, b) supb2K
infa2K
L(a, b)
Nash equilibrium
Nash equilibrium <—> no player benefits by deviating
Nash equilibrium
Nash equilibrium <—> no player benefits by deviating
AB
Deny
Confess
Deny Confess
-2/-2 -6/0
0/-6 -6/-6
Prisoner’s dilemma:
Nash equilibrium
Problems:
• Computationally hard (PPAD-complete, Daskalakis 2008)
• Some outcomes are very bad
Nash equilibrium
AB
Stop
Go
Stop Go
0/0 10/0
0/10 -∞/-∞
Blind-intersection dilemma:
Nash equilibrium
AB
Stop
Go
Stop Go
0/0 10/0
0/10 -∞/-∞
Blind-intersection dilemma:
Equilibrium: both drivers STOP; no-one uses intersection
Correlated equilibrium
AB
Stop
Go
Stop Go
0/0 10/0
0/10 -∞/-∞
Traffic light:P(x, y) =
8><
>:
12 if x = S and y = G
12 if x = G and y = S
0 else
Blind-intersection dilemma:
Correlated equilibrium
AB
Stop
Go
Stop Go
0/0 10/0
0/10 -∞/-∞
Blind-intersection dilemma:
• Better outcomes for players:Correlated equilibria can be much better than Nash equilibria (depending on the signal)
Convex games• Players with convex action sets
• Loss functions where and each loss is separately convex in each argument
• Players aim to minimize their losses
• Player do not know what other players will do
`i(w1, . . . ,wN)
{Pi}Ni=1 {Ki}Ni=1
wi 2 Ki
Sketch of proof:
Since player j applies a no-regret algorithm, it follows that after T rounds, its sequence of moves {wtj} has a loss within of the optimal, wj*.
Interpreting the historical sequence of plays as a signal obtains
The result follows by letting —> 0.
No-regret —> Corr. eqm
1
T
X`i(w
tj ,w
t�j)� `i(w
⇤j ,w
t�j) ✏ 8w⇤
j
✏
✏
Eh`(w)
i E
h`(w⇤
j ,w�j)i+ ✏
Nash —> AumannFrom Nash equilibrium to correlated equilibrium:
• Computationally tractable:No-regret algorithms converge on coarse correlated equilibria
• Better outcomes:Signal is the past actions of players, chosen according to an optimal strategy
• Every game has a Nash equilibrium • Economists typically assume that economy is in
Nash equilibrium • But Nash equilibria are hard to find and are often
unpleasant
• Correlated equilibria are easier to compute and can be better for the players
• A signal is required to coordinate players: • set signally externally
• e.g. traffic light • in repeated games, construct signal as the
average of historical behavior • e.g. no-regret learning
Game theory
• Correlated equilibrium• Players converge on mutually agreeable
outcomes by adapting to each other’s behavior over repeated rounds
• The signal provides a “strong hint” about where to find equilibria.
• Nash equilibrium• It’s hard to guess what others will do when
playing de novo (intractability); and • You haven’t established a track record to
build on (undesirability)
Game theory
Gradients and GamesGradient descent is a no-regret algorithm
• Insert players using gradient descent into any convex game, and they will converge on a correlated equilibrium
• Global convergence to correlated equilibrium is controlled by convergence of players to their best-in-hindsight local optimum
Next: • Gradient descent + chain rule = backprop
Neural networks
“Anything a human can do in a fraction of a second, a deep network can do too” — Ilya Sutskever
• Origins • Backpropagation (Werbos 1974, Rumelhart 1986) • Convolutional nets (Fukushima 1980, LeCun 1990)
• Mid-1990s — Mid-2000s • back to convex methods; neural networks ignored
• Unsupervised pre-training • restricted Boltzmann machines (Hinton 2006) • autoencoders (Bengio 2007)
• Convolutional nets • sharp improvement on ImageNET challenge
(Krizhevsky 2012) • back to fully-supervised training
History
• Convolutional nets • sharp improvement on ImageNET challenge
(Krizhevsky 2012) • back to fully-supervised training
History
30 years of trial-and-error• Architecture
• Weight-tying (convolutions) • Max-pooling
• Nonlinearity • Rectilinear units (Jarrett 2009)
• Credit assignment • Backpropagation
• Optimization • Stochastic Gradient Descent • Nesterov momentum, BFGS, AdaGrad, etc. • RMSProp (Riedmiller 1993, Hinton & Tieleman 200x)
• Regularization • Dropout (Srivastava et al 2014) or • Batch-Normalization (Szegedy & Ioffe 2015)
Why (these) NNs?
• Why neural networks? • GPUs • Good local minima (no-one knows why) • “Mirrors the world’s compositional structure”
• Why ReLUs, Why Max-Pooling, Why convex methods? • “locally” convex optimization (this talk)
Neural networksInput
Layer1
W1 Matrix mult
Layer2
W2 Matrix mult
Output
W3 Matrix mult
Nonlinearity
Nonlinearity
fW(x)
Neural networksInput
Layer1
W1 Matrix mult
Layer2
W2 Matrix mult
Output
W3 Matrix mult
Nonlinearity
Nonlinearity
`(fW(x), y)
label
loss
Univ Function ApproxTheorem (Leshno 1993): A two-layer neural network with infinite width and nonpolynomial nonlinearity is
dense in the space of continuous functions.
�
��� �� � � ������
����
���
���
���
�
�
��� �� � � ������
����
���
���
���
�
Sigmoid Tanh
�(x) =1
1+ e
�x
⌧(x) =e
x � e
�x
e
x + e
�x
�
��� �� � � �������������������
�
ReLU
⇢(x) = max(0, x)
�
��� �� � � ������
����
���
���
���
�
�
��� �� � � ������
����
���
���
���
�
�
��� �� � � ������
����
���
���
���
�
d⇢
dx
d�
dx
d⌧
dx
�
��� �� � � ������
����
���
���
���
�
�
��� �� � � ������
����
���
���
���
�
�
��� �� � � �������������������
�
⌧� ⇢
The chain rule
h
g
f
df
dx
=df
dg
· dgdh
· dhdx
f � g � h(x) = f(g(h(x)))
Linear networksInput
W1
W2
W3
Feedforward sweepInput
W1
W2
W3
x
out
=
i=#LY
1
Wi
!· x
in
Input
W1
W2
W3
Weight updates
j
Gradient descent
Wt+1ij Wt
ij � ⌘ · @`
@Wij
Input
W1
W2
W3
Weight updates
j
Gradient descent
Wt+1ij Wt
ij � ⌘ · @`
@Wij
by the chain rule@`
@Wij=
@`
@xj· @xj@Wij
Input
W1
W2
W3
Weight updates
j
Gradient descent
Wt+1ij Wt
ij � ⌘ · @`
@Wij
by the chain rule
since@xj@Wij
= xi
@`
@Wij=
@`
@xj· xi
Input
W1
W2
W3
Weight updates
j
Gradient descent
Wt+1ij Wt
ij � ⌘ · @`
@Wij
by the chain rule
also, define
�j :=@`
@xj
@`
@Wij= �j · xi
Input
W1
W2
W3
Weight updates
j
Gradient descent
Wt+1ij Wt
ij � ⌘ · @`
@Wij
by the chain rule
How to compute?
�j :=@`
@xj
@`
@Wij= �j · xi
Input
�i =
#LY
k=i+1
W|k
!·⇣r
x
out
`⌘
x
out
=
i=#LY
1
Wi
!· x
in
W|1
W|2
W|3
Feedforward sweep
Backpropagation
Backpropagation
Max-pooling
Pick element with max output in each block;
zero out the rest
DropoutInput
W1
W2
W3
During training, randomly zero out units with
probability 0.5
ReLU networksInput
W1
W2
W3
�
��� �� � � �������������������
�
ReLU
⇢(x) = max(0, x)
Path sumsInput
W1
W2
W3
i
j
@`
@xi=
@`
@xj· @xj@xi
consider the internal structure
Path sumsInput
W1
W2
W3
i
j
@`
@xi=
@`
@xj· @xj@xi
@xj@xi
=X
p2{i!j}
Y
↵2p
w↵
Active path sumsInput
W1
W2
W3
i
j
@`
@xi=
@`
@xj· @xj@xi
@xj@xi
=X
p2{i!j}active
Y
↵2p
w↵
Knowledge distribution• Units in NN <—> Players in a game
• All units have the same loss
• Loss is convex function of units’ weights when unit is active
• Units in output layer compute error-gradients
• Other units distribute signals of the form (error-gradients) x (path-sums)
Circadian games• Units have convex losses when active
• Neural networks are circadian games • i.e. game that is convex for active (awake) players
Main Theorem. Convergence of neural network to correlated equilibrium is upper-bounded by convergence rates of individual units.
Corollary. Units perform convex optimization when active. Convex methods and guarantees (on convergence rates) apply to neural networks.
—> Explains why ReLUs work so well.
Philosophical Corollary• Neural networks are games
Units are no-regret learners
• Unlike many games (e.g. in micro-economic theory), the equilibria in NNs are extremely desirable
• Error-backpropagation distributes gradient-information across units
• Claim: There is a theory of distributed reasoning implicit in gradient descent and the chain rule.
The End. Questions?