121
Game theory for Neural Networks David Balduzzi VUW C 3 Research Camp

Game theory for neural networks

Embed Size (px)

Citation preview

Page 1: Game theory for neural networks

Game theory for Neural Networks

David Balduzzi VUW

C3 Research Camp

Page 2: Game theory for neural networks

Goals

“a justification is objective if, in principle, it can be tested and understood by anybody”

— Karl Popper

Page 3: Game theory for neural networks

Objective knowledge

• How is knowledge created?

• How is knowledge shared?

Goals

Page 4: Game theory for neural networks

• Deductive reasoning (a priori knowledge)• foundation of mathematics • formal: mathematical logic • operational: Turing machines, lambda calculus

Background

Page 5: Game theory for neural networks

Tiny Turing Machine

Alvy Ray Smith, based on [Yurii Rogozhin 1996]

Page 6: Game theory for neural networks

Tiny Turing Machine

Alvy Ray Smith, based on [Yurii Rogozhin 1996]

Page 7: Game theory for neural networks

• Deductive reasoning (a priori knowledge)• foundation of mathematics • formal: mathematical logic • operational: Turing machines, lambda calculus

• Inductive reasoning (empirical knowledge)• foundation of science • formal: learning theory • operational: SVMs, neural networks, etc.

Background

Page 8: Game theory for neural networks

• What’s missing? • To be objective, knowledge must be shared • How can agents share knowledge?

• A theory of distributed reasoning • induction builds on deduction • distribution should build on both

Page 9: Game theory for neural networks

Neural Networks• Superhuman performance on

object recognition (ImageNet) • Outperformed humans at

recognising street-signs (Google streetview).

• Superhuman performance on Atari games (Google).

• Real-time translation: English voice to Chinese voice (Microsoft).

Page 10: Game theory for neural networks

30 years of trial-and-error• Architecture

• Weight-tying (convolutions) • Max-pooling

• Nonlinearity • Rectilinear units (Jarrett 2009)

• Credit assignment • Backpropagation

• Optimization • Stochastic Gradient Descent • Nesterov momentum, BFGS, AdaGrad, etc. • RMSProp (Riedmiller 1993, Hinton & Tieleman 200x)

• Regularization • Dropout (Srivastava 2014) or • Batch-Normalization (Szegedy & Ioffe 2015)

Page 11: Game theory for neural networks

• Concrete problem:• Why does gradient descent work on neural nets? • Not just gradient descent; many methods designed

for convex problems work surprisingly well on neural nets.

• Big picture:• Neural networks are populations of learners • that distribute knowledge extremely effectively.

We should study them as such.

This tutorial

Page 12: Game theory for neural networks

• Deduction • Induction • Gradients • Games • Neural networks

Outline

Page 13: Game theory for neural networks

Deductive reasoning

“we are regarding the function of the mathematician as simply to determine the truth or falsity of propositions”

Page 14: Game theory for neural networks

All men are mortal Socrates is a man

Deductive reasoning

Socrates is mortal

Page 15: Game theory for neural networks

All men are mortal Socrates is a man

Deductive reasoning

Socrates is mortal

Idea: Intelligence is deductive reasoning!

Page 16: Game theory for neural networks

Articles

WINTER 2006 13

Photo courtesy Dartmouth College.

Page 1 of the Original Proposal.

Page 17: Game theory for neural networks

Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge,

Ray Solomonoff

50 years later

Page 18: Game theory for neural networks

“To understand the real world, we must have a different set of primitives from the relatively

simple line trackers suitable and sufficient for the blocks world”

— Patrick Winston (1975) Director of MIT’s AI lab from 1972-1997

A bump in the road

Page 19: Game theory for neural networks

The AI winterhttp://en.wikipedia.org/wiki/AI_winter

Page 20: Game theory for neural networks

Reductio ad absurdum

“Intelligence is 10 million rules” — Doug Lenat

Page 21: Game theory for neural networks

Deductive reasoning• Structure:

• axioms; • logical laws; • production system

• Derive complicated (empirical) truths from simple (self-evident) truths

• Deductive reasoning provides no guidance about which axiom(s) to eliminate when “derived truths” contradict experience

• In general, deductive reasoning has nothing to say about mistakes except “avoid them”

• In retrospect, it’s hard to understand why anyone thought deduction would yield artificial intelligence.

Page 22: Game theory for neural networks

Inductive reasoning

“this book is composed […] upon one very simple theme […] that we can learn from our mistakes”

Page 23: Game theory for neural networks

Inductive reasoningSocrates is mortal

Plato is mortal [ … more examples …]

All men are mortal

Page 24: Game theory for neural networks

Inductive reasoning

All observed As are Bs a is an A

a is a B

Page 25: Game theory for neural networks

The problem of induction

All observed As are Bs a is an A

a is a B

This is clearly false. The conclusion does not follow from the premises.

Page 26: Game theory for neural networks

“If a machine is […] infallible, it cannot also be intelligent. There are […] theorems which say almost exactly that. But

these theorems say nothing about how much intelligence may be displayed if a machine makes no pretence at infallibility.”

[ it’s ok to make mistakes ]

Page 27: Game theory for neural networks

Let’s look at some examples

Page 28: Game theory for neural networks

Sequential predictionScenario: At time t, Forecaster predicts 0 or 1.

Nature then reveals the truth.

Forecaster has access to N experts. One of them is always correct.

Goal: Predict as accurately as possible.

Page 29: Game theory for neural networks

Halving Algorithm

While t>0:Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

Set t = 1.

Page 30: Game theory for neural networks

While t>0:

Question:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

How long to find correct expert?

Set t = 1.Halving Algorithm

Page 31: Game theory for neural networks

While t>0:

BAD!!!

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

How long to find correct expert?

Set t = 1.Halving Algorithm

Page 32: Game theory for neural networks

While t>0:

Question:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

How many errors?

Set t = 1.Halving Algorithm

Page 33: Game theory for neural networks

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.

How many errors?

Halving Algorithm

Page 34: Game theory for neural networks

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.

How many errors?

When algorithm makes a mistake,

it removes ≥ half of experts

Halving Algorithm

Page 35: Game theory for neural networks

≤ log N

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.

How many errors?

When algorithm makes a mistake,

it removes ≥ half of experts

Halving Algorithm

Page 36: Game theory for neural networks

What’s going on?

Didn’t we just use deductive reasoning!?!

Page 37: Game theory for neural networks

What’s going on?

Didn’t we just use deductive reasoning!?!

Yes… but No!

Page 38: Game theory for neural networks

What’s going on?

Algorithm: makes educated guesses about Nature

Analysis: proves theorem about number of errors

(inductive)

(deductive)

Page 39: Game theory for neural networks

What’s going on?

Algorithm: makes educated guesses about Nature

Analysis: proves theorem about number of errors

(inductive)

(deductive)

The algorithm learns — but it does not deduce!

Page 40: Game theory for neural networks

Adversarial predictionScenario: At time t, Forecaster predicts 0 or 1.

Nature then reveals the truth.

Forecaster has access to N experts. One of them is always correct. Nature is adversarial.

Goal: Predict as accurately as possible.

Page 41: Game theory for neural networks

At time t, Forecaster predicts 0 or 1. Nature then reveals the truth.

Forecaster has access to N experts. One of them is always correct. Nature is adversarial.

Goal: Predict as accurately as possible.

Seriously?!?!

Page 42: Game theory for neural networks

Regret

Let m* be the best expert in hindsight. regret := # mistakes(Forecaster) - # mistakes(m*)

Goal: Predict as accurately as possible. Minimize regret.

Page 43: Game theory for neural networks

While t>0:

Question:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

What is the regret?

Set t = 1.Halving Algorithm

ADVERSARIAL SETTING

Page 44: Game theory for neural networks

While t>0:

Question:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

What is the regret?

Set t = 1.Halving Algorithm

ADVERSARIAL SETTING

BAD!

Page 45: Game theory for neural networks

While t ≤ T:

Question:

Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.

What is the regret?

Set t = 1.Weighted Majority Alg

Pick β in (0,1). Assign 1 to experts.

Page 46: Game theory for neural networks

While t ≤ T:Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.

Set t = 1.

What is the regret? [ choose β carefully ]r

T · logN2

Pick β in (0,1). Assign 1 to experts.

Weighted Majority Alg

Page 47: Game theory for neural networks

Loss functions

`(y, y0) =1

2

⇣y � y0

⌘2

`(p, i) =

(� log(1� p) if i = 0

� log(p) else

`(y, y0) =

(0 if y = y0

1 else

• 0/1 loss

• Mean square error

• Logistic loss

Page 48: Game theory for neural networks

Loss functions

`(y, y0) =1

2

⇣y � y0

⌘2

`(p, i) =

(� log(1� p) if i = 0

� log(p) else

`(y, y0) =

(0 if y = y0

1 else

• 0/1 loss

• Mean square error

• Logistic loss

not convex

convex

Page 49: Game theory for neural networks

Exp. Weights Alg.

While t ≤ T:Predict the weighted sum.Step 1.Multiply expert i byStep 2.t ← t+1Step 3.

Set t = 1.

What is the regret? [ choose β carefully ]

Pick β > 0. Assign 1 to experts.

e��·`(fti ,yt)

logN+

p2T logN

Page 50: Game theory for neural networks

Inductive reasoning• Structure:

• game played over a series of rounds • on each round, experts make predictions • Forecaster takes weighted average of predictions • Nature reveals the truth and a loss is incurred • Forecaster adjusts weights

• correct experts —> increase weights; • incorrect —> decrease

• Goal:• average loss per round converges to best-in-

hindsight

Page 51: Game theory for neural networks

Inductive reasoning• How it works

• Forecaster adjusts strategy according to prior experience

• Current strategy encodes previous mistakes

• No guarantee on any particular example • Performance is optimal on average

• Still to come• How can Forecaster communicate its knowledge

with others, in a way that is useful to them?

Page 52: Game theory for neural networks

Gradient descent

“the great watershed in optimisation isn’t between linearity and nonlinearity, but convexity and nonconvexity”

— Tyrrell Rockafellar

Page 53: Game theory for neural networks

Gradient descent

• Everyone knows (?) gradient descent

• Take a fresh look: • gradient descent is a no-regret algorithm • —> game theory • —> neural networks

Page 54: Game theory for neural networks

Online Convex Opt.Scenario: Convex set K;

differentiable loss L(a,b) that is convex function of a

At time t, Forecaster picks at in K Nature responds with bt in K [ Nature is adversarial ]

Forecaster’s loss is L(a,b)

Goal: Minimize regret.

�� �� �� � � � ��

��

Page 55: Game theory for neural networks

Follow the Leader

Idea: Predict the at that would have worked best on { b1, … ,bt-1 }

Page 56: Game theory for neural networks

While t ≤ T:Step 1.Step 2.

Set t = 1.Follow the Leader

Idea:

t ← t+1

Pick a1 at random.

at := argmina2K

"t�1X

i=1

L(a, bi)#

Predict the at that would have worked best on { b1, … ,bt-1 }

Page 57: Game theory for neural networks

While t ≤ T:Step 1.Step 2.

Set t = 1.Follow the LeaderBAD!

Problem: Nature pulls Forecaster back-and-forth No memory!

t ← t+1

Pick a1 at random.

at := argmina2K

"t�1X

i=1

L(a, bi)#

Page 58: Game theory for neural networks

While t ≤ T:Step 1.Step 2.

Set t = 1.

t ← t+1

FTRLPick a1 at random.

regularize

at := argmina2K

"t�1X

i=1

L(a, bi) +�

2· kak22

#

Page 59: Game theory for neural networks

While t ≤ T:Step 1.Step 2.

Set t = 1.FTRL

t ← t+1

Pick a1 at random.

gradient descent

at at�1 � � · @

@aL(at�1, bt�1)

Page 60: Game theory for neural networks

While t ≤ T:Step 1.Step 2.

Set t = 1.FTRL

Intuition: β controls memory

t ← t+1

Pick a1 at random.

at at�1 � � · @

@aL(at�1, bt�1)

Page 61: Game theory for neural networks

While t ≤ T:Step 1.Step 2.

Set t = 1.FTRL

What is the regret? [ choose β carefully ]

diam(K) · Lipschitz(L) ·pT

t ← t+1

Pick a1 at random.

at at�1 � � · @

@aL(at�1, bt�1)

Page 62: Game theory for neural networks

Logarithmic regret

regret

[ choose β carefully ]

Convergence at rate is actually quite slow

• For many common loss functions • (e.g. mean-square error, logistic loss)

• it is possible to do much better

pT

T

logT

T

Page 63: Game theory for neural networks

Information theory

Two connections

• Entropic regularization <—> • Exponential Weights Algorithm

• Analogy between gradients and Fano information • “the information represented by y about x”

log

P(x|y)P(x)

Page 64: Game theory for neural networks

Gradient descent

• Gradient descent is fast, cheap, and guaranteed to converge — even against adversaries

• “a hammer that smashes convex nails”

�� �� �� � � � ��

��

• GD is well-understood in convex settings

• Convergence rate depends on the structure of the loss, size of search space, etc.

Page 65: Game theory for neural networks

Convex optimizationConvex methods

• Stochastic gradient descent

• Nesterov momentum

• BFGS (Broyden-Fletcher-Goldfarb-Shanno)

• Conjugate gradient methods

• AdaGrad

Page 66: Game theory for neural networks

Online Convex Opt. (deep learning)

Apply Gradient Descent to nonconvex optimization (neural networks).

• Theorems don’t work (not convex)

• tons of engineering (decades of tweaking)

Amazing performance.

self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys-tems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7.

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec-tacular results, almost halving the error rates of the best compet-ing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and tech-niques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook,

Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

ConvNets are easily amenable to efficient hardware implemen-tations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

Distributed representations and language processing Deep-learning theory shows that deep nets have two different expo-nential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth).

The hidden layers of a multilayer neural network learn to repre-sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local

Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced

with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.

VisionDeep CNN

LanguageGenerating RNN

A group of people shopping at an outdoor

market.

There are many vegetables at the

fruit stand.

A woman is throwing a frisbee in a park.

A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest withtrees in the background.

A dog is standing on a hardwood floor. A stop sign is on a road with amountain in the background

4 4 0 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5

REVIEWINSIGHT

© 2015 Macmillan Publishers Limited. All rights reserved

Page 67: Game theory for neural networks

Deep Learning

• Superhuman performance on object recognition (ImageNet)

• Outperformed humans at recognising street-signs (Google streetview).

• Superhuman performance on Atari games (Google).

• Real-time translation: English voice to Chinese voice (Microsoft).

Page 68: Game theory for neural networks

Game theory

“An equilibrium is not always an optimum; it might not even be good. This may be the most

important discovery of game theory.” — Ivar Ekeland

Page 69: Game theory for neural networks

Game theory• So far

• Seen how Forecaster can learn to use “expert” advice.

• Forecaster modifies the weights of experts such that it converges to optimal-in-hindsight predictions.

• Who are these experts?• We will model them as Forecasters in their own right • First, let’s develop tools for modeling populations.

Page 70: Game theory for neural networks

Setup• Players A, B with actions {a1, a2, … am} and {b1, b2, … bn}

• Two payoff matrices, describing the losses/gains of the players for every combination of moves

• Goal of players is to maximise-gain / minimise-loss

• Neither player knows what the other will do

Page 71: Game theory for neural networks

Setup• Players A, B with actions {a1, a2, … am} and {b1, b2, … bn}

• Two payoff matrices, describing the losses/gains of the players for every combination of moves

AB

RP

R P

0

Rock-paper -scissors (zero-sum game)

S

S0

0

-1 11 -1-1 1

AB

RP

R P

0

S

S0

0

1 -1-1 11 -1

Page 72: Game theory for neural networks

Minimax theoreminfa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Page 73: Game theory for neural networks

Minimax theorem

Forecaster picks a, Nature responds b

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Page 74: Game theory for neural networks

Minimax theorem

Forecaster picks a, Nature responds b

Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Page 75: Game theory for neural networks

Minimax theorem

Forecaster picks a, Nature responds b

Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

AB

RP

R P

0

S

S0

0

-1 11 -1-1 1

AB

RP

R P

0

S

S0

0

1 -1-1 11 -1

Page 76: Game theory for neural networks

Minimax theorem

Forecaster picks a, Nature responds b

Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

infa2K

supb2K

L(a, b) � supb2K

infa2K

L(a, b)going first hurts Forecaster, so

Page 77: Game theory for neural networks

Minimax theorem

Proof idea:No-regret algorithm →

Forecaster can asymptotically match hindsight

Order of players doesn’t matter asymptotically

Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Page 78: Game theory for neural networks

Minimax theorem

Proof idea:No-regret algorithm →

Forecaster can asymptotically match hindsight

Order of players doesn’t matter asymptotically

Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Page 79: Game theory for neural networks

Minimax theorem

Proof idea:No-regret algorithm →

Forecaster can asymptotically match hindsight

Order of players doesn’t matter asymptotically

Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Page 80: Game theory for neural networks

Minimax theorem

Proof idea:No-regret algorithm →

Forecaster can asymptotically match hindsight

Order of players doesn’t matter asymptotically

Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

a =1

T

TX

t=1

at

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Page 81: Game theory for neural networks

Nash equilibrium

Nash equilibrium <—> no player benefits by deviating

Page 82: Game theory for neural networks

Nash equilibrium

Nash equilibrium <—> no player benefits by deviating

AB

Deny

Confess

Deny Confess

-2/-2 -6/0

0/-6 -6/-6

Prisoner’s dilemma:

Page 83: Game theory for neural networks

Nash equilibrium

Problems:

• Computationally hard (PPAD-complete, Daskalakis 2008)

• Some outcomes are very bad

Page 84: Game theory for neural networks

Nash equilibrium

AB

Stop

Go

Stop Go

0/0 10/0

0/10 -∞/-∞

Blind-intersection dilemma:

Page 85: Game theory for neural networks

Nash equilibrium

AB

Stop

Go

Stop Go

0/0 10/0

0/10 -∞/-∞

Blind-intersection dilemma:

Equilibrium: both drivers STOP; no-one uses intersection

Page 86: Game theory for neural networks

Correlated equilibrium

AB

Stop

Go

Stop Go

0/0 10/0

0/10 -∞/-∞

Traffic light:P(x, y) =

8><

>:

12 if x = S and y = G

12 if x = G and y = S

0 else

Blind-intersection dilemma:

Page 87: Game theory for neural networks

Correlated equilibrium

AB

Stop

Go

Stop Go

0/0 10/0

0/10 -∞/-∞

Blind-intersection dilemma:

• Better outcomes for players:Correlated equilibria can be much better than Nash equilibria (depending on the signal)

Page 88: Game theory for neural networks

Convex games• Players with convex action sets

• Loss functions where and each loss is separately convex in each argument

• Players aim to minimize their losses

• Player do not know what other players will do

`i(w1, . . . ,wN)

{Pi}Ni=1 {Ki}Ni=1

wi 2 Ki

Page 89: Game theory for neural networks

Sketch of proof:

Since player j applies a no-regret algorithm, it follows that after T rounds, its sequence of moves {wtj} has a loss within of the optimal, wj*.

Interpreting the historical sequence of plays as a signal obtains

The result follows by letting —> 0.

No-regret —> Corr. eqm

1

T

X`i(w

tj ,w

t�j)� `i(w

⇤j ,w

t�j) ✏ 8w⇤

j

Eh`(w)

i E

h`(w⇤

j ,w�j)i+ ✏

Page 90: Game theory for neural networks

Nash —> AumannFrom Nash equilibrium to correlated equilibrium:

• Computationally tractable:No-regret algorithms converge on coarse correlated equilibria

• Better outcomes:Signal is the past actions of players, chosen according to an optimal strategy

Page 91: Game theory for neural networks

• Every game has a Nash equilibrium • Economists typically assume that economy is in

Nash equilibrium • But Nash equilibria are hard to find and are often

unpleasant

• Correlated equilibria are easier to compute and can be better for the players

• A signal is required to coordinate players: • set signally externally

• e.g. traffic light • in repeated games, construct signal as the

average of historical behavior • e.g. no-regret learning

Game theory

Page 92: Game theory for neural networks

• Correlated equilibrium• Players converge on mutually agreeable

outcomes by adapting to each other’s behavior over repeated rounds

• The signal provides a “strong hint” about where to find equilibria.

• Nash equilibrium• It’s hard to guess what others will do when

playing de novo (intractability); and • You haven’t established a track record to

build on (undesirability)

Game theory

Page 93: Game theory for neural networks

Gradients and GamesGradient descent is a no-regret algorithm

• Insert players using gradient descent into any convex game, and they will converge on a correlated equilibrium

• Global convergence to correlated equilibrium is controlled by convergence of players to their best-in-hindsight local optimum

Next: • Gradient descent + chain rule = backprop

Page 94: Game theory for neural networks

Neural networks

“Anything a human can do in a fraction of a second, a deep network can do too” — Ilya Sutskever

Page 95: Game theory for neural networks

• Origins • Backpropagation (Werbos 1974, Rumelhart 1986) • Convolutional nets (Fukushima 1980, LeCun 1990)

• Mid-1990s — Mid-2000s • back to convex methods; neural networks ignored

• Unsupervised pre-training • restricted Boltzmann machines (Hinton 2006) • autoencoders (Bengio 2007)

• Convolutional nets • sharp improvement on ImageNET challenge

(Krizhevsky 2012) • back to fully-supervised training

History

Page 96: Game theory for neural networks

• Convolutional nets • sharp improvement on ImageNET challenge

(Krizhevsky 2012) • back to fully-supervised training

History

Page 97: Game theory for neural networks

30 years of trial-and-error• Architecture

• Weight-tying (convolutions) • Max-pooling

• Nonlinearity • Rectilinear units (Jarrett 2009)

• Credit assignment • Backpropagation

• Optimization • Stochastic Gradient Descent • Nesterov momentum, BFGS, AdaGrad, etc. • RMSProp (Riedmiller 1993, Hinton & Tieleman 200x)

• Regularization • Dropout (Srivastava et al 2014) or • Batch-Normalization (Szegedy & Ioffe 2015)

Page 98: Game theory for neural networks

Why (these) NNs?

• Why neural networks? • GPUs • Good local minima (no-one knows why) • “Mirrors the world’s compositional structure”

• Why ReLUs, Why Max-Pooling, Why convex methods? • “locally” convex optimization (this talk)

Page 99: Game theory for neural networks

Neural networksInput

Layer1

W1 Matrix mult

Layer2

W2 Matrix mult

Output

W3 Matrix mult

Nonlinearity

Nonlinearity

fW(x)

Page 100: Game theory for neural networks

Neural networksInput

Layer1

W1 Matrix mult

Layer2

W2 Matrix mult

Output

W3 Matrix mult

Nonlinearity

Nonlinearity

`(fW(x), y)

label

loss

Page 101: Game theory for neural networks

Univ Function ApproxTheorem (Leshno 1993): A two-layer neural network with infinite width and nonpolynomial nonlinearity is

dense in the space of continuous functions.

��� �� � � ������

����

���

���

���

��� �� � � ������

����

���

���

���

Sigmoid Tanh

�(x) =1

1+ e

�x

⌧(x) =e

x � e

�x

e

x + e

�x

��� �� � � �������������������

ReLU

⇢(x) = max(0, x)

Page 102: Game theory for neural networks

��� �� � � ������

����

���

���

���

��� �� � � ������

����

���

���

���

��� �� � � ������

����

���

���

���

d⇢

dx

d�

dx

d⌧

dx

��� �� � � ������

����

���

���

���

��� �� � � ������

����

���

���

���

��� �� � � �������������������

⌧� ⇢

Page 103: Game theory for neural networks

The chain rule

h

g

f

df

dx

=df

dg

· dgdh

· dhdx

f � g � h(x) = f(g(h(x)))

Page 104: Game theory for neural networks

Linear networksInput

W1

W2

W3

Page 105: Game theory for neural networks

Feedforward sweepInput

W1

W2

W3

x

out

=

i=#LY

1

Wi

!· x

in

Page 106: Game theory for neural networks

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

Page 107: Game theory for neural networks

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

by the chain rule@`

@Wij=

@`

@xj· @xj@Wij

Page 108: Game theory for neural networks

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

by the chain rule

since@xj@Wij

= xi

@`

@Wij=

@`

@xj· xi

Page 109: Game theory for neural networks

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

by the chain rule

also, define

�j :=@`

@xj

@`

@Wij= �j · xi

Page 110: Game theory for neural networks

Input

W1

W2

W3

Weight updates

j

Gradient descent

Wt+1ij Wt

ij � ⌘ · @`

@Wij

by the chain rule

How to compute?

�j :=@`

@xj

@`

@Wij= �j · xi

Page 111: Game theory for neural networks

Input

�i =

#LY

k=i+1

W|k

!·⇣r

x

out

`⌘

x

out

=

i=#LY

1

Wi

!· x

in

W|1

W|2

W|3

Feedforward sweep

Backpropagation

Backpropagation

Page 112: Game theory for neural networks

Max-pooling

Pick element with max output in each block;

zero out the rest

Page 113: Game theory for neural networks

DropoutInput

W1

W2

W3

During training, randomly zero out units with

probability 0.5

Page 114: Game theory for neural networks

ReLU networksInput

W1

W2

W3

��� �� � � �������������������

ReLU

⇢(x) = max(0, x)

Page 115: Game theory for neural networks

Path sumsInput

W1

W2

W3

i

j

@`

@xi=

@`

@xj· @xj@xi

consider the internal structure

Page 116: Game theory for neural networks

Path sumsInput

W1

W2

W3

i

j

@`

@xi=

@`

@xj· @xj@xi

@xj@xi

=X

p2{i!j}

Y

↵2p

w↵

Page 117: Game theory for neural networks

Active path sumsInput

W1

W2

W3

i

j

@`

@xi=

@`

@xj· @xj@xi

@xj@xi

=X

p2{i!j}active

Y

↵2p

w↵

Page 118: Game theory for neural networks

Knowledge distribution• Units in NN <—> Players in a game

• All units have the same loss

• Loss is convex function of units’ weights when unit is active

• Units in output layer compute error-gradients

• Other units distribute signals of the form (error-gradients) x (path-sums)

Page 119: Game theory for neural networks

Circadian games• Units have convex losses when active

• Neural networks are circadian games • i.e. game that is convex for active (awake) players

Main Theorem. Convergence of neural network to correlated equilibrium is upper-bounded by convergence rates of individual units.

Corollary. Units perform convex optimization when active. Convex methods and guarantees (on convergence rates) apply to neural networks.

—> Explains why ReLUs work so well.

Page 120: Game theory for neural networks

Philosophical Corollary• Neural networks are games

Units are no-regret learners

• Unlike many games (e.g. in micro-economic theory), the equilibria in NNs are extremely desirable

• Error-backpropagation distributes gradient-information across units

• Claim: There is a theory of distributed reasoning implicit in gradient descent and the chain rule.

Page 121: Game theory for neural networks

The End. Questions?