Theory and Applications of A Repeated Game Playing Algorithm

Theory and Applications ofTheory and Applications ofTheory and Applications ofTheory and Applications ofTheory and Applications ofA Repeated Game Playing AlgorithmA Repeated Game Playing AlgorithmA Repeated Game Playing AlgorithmA Repeated Game Playing AlgorithmA Repeated Game Playing Algorithm

Rob Schapire

Princeton University[currently visiting Yahoo! Research]

Learning Is (Often) Just a GameLearning Is (Often) Just a GameLearning Is (Often) Just a GameLearning Is (Often) Just a GameLearning Is (Often) Just a Game

• some learning problems:

• learn from training examples to detect faces in images• sequentially predict what decision a person will make

next (e.g., what link a user will click on)• learn to drive toy car by observing how human does it

• these appear to be very different

• this talk:

• can all be viewed as game-playing problems• can all be handled as applications of single algorithm for

playing repeated games

• take-home message:• game-theoretic algorithms have wide applications in

learning• reveals deep connections between seemingly unrelated

learning problems

OutlineOutlineOutlineOutlineOutline

• how to play repeated games: theory and an algorithm

• applications:

• boosting• on-line learning• apprenticeship learning

How to Play Repeated GamesHow to Play Repeated GamesHow to Play Repeated GamesHow to Play Repeated GamesHow to Play Repeated Games

[with Freund]

GamesGamesGamesGamesGames

• game defined by matrix M:

Rock Paper Scissors

Rock 1/2 1 0Paper 0 1/2 1

Scissors 1 0 1/2

• row player (“Mindy”) chooses row i

• column player (“Max”) chooses column j (simultaneously)

• Mindy’s goal: minimize her loss M(i , j)

• assume (wlog) all entries in [0, 1]

Randomized PlayRandomized PlayRandomized PlayRandomized PlayRandomized Play

• usually allow randomized play:• Mindy chooses distribution P over rows• Max chooses distribution Q over columns

(simultaneously)

• Mindy’s (expected) loss

=∑

i ,j

P(i)M(i , j)Q(j)

= P>MQ ≡ M(P,Q)

• i , j = “pure” strategies

• P, Q = “mixed” strategies

• m = # rows of M

• also write M(i ,Q) and M(P, j) when one side plays pure andother plays mixed

Sequential PlaySequential PlaySequential PlaySequential PlaySequential Play

• say Mindy plays before Max

• if Mindy chooses P then Max will pick Q to maximizeM(P,Q) ⇒ loss will be

L(P) ≡ maxQ

M(P,Q) = maxj

M(P, j)

[note: maximum realized at pure strategy]

• so Mindy should pick P to minimize L(P)⇒ loss will be

minP

L(P) = minP

maxQ

M(P,Q)

• similarly, if Max plays first, loss will be

maxQ

minP

M(P,Q)

Minmax TheoremMinmax TheoremMinmax TheoremMinmax TheoremMinmax Theorem

• playing second (with knowledge of other player’s move)cannot be worse than playing first, so:

minP

maxQ

M(P,Q)︸︷︷︸

Mindy plays first

≥ maxQ

minP

M(P,Q)︸︷︷︸

Mindy plays second

• von Neumann’s minmax theorem:

minP

maxQ

M(P,Q) = maxQ

minP

M(P,Q)

• in words: no advantage to playing second

Optimal PlayOptimal PlayOptimal PlayOptimal PlayOptimal Play

• minmax theorem:

minP

maxQ

M(P,Q) = maxQ

minP

M(P,Q) = value v of game

• optimal strategies:• P∗ = arg minP maxQ M(P,Q) = minmax strategy• Q∗ = arg maxQ minP M(P,Q) = maxmin strategy

• in words:• Mindy’s minmax strategy P∗ guarantees loss ≤ v

(regardless of Max’s play)• optimal because Max has maxmin strategy Q∗ that can

force loss ≥ v (regardless of Mindy’s play)

• e.g.: in RPS, P∗ = Q∗ = uniform

• solving game = finding minmax/maxmin strategies

Weaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical Theory

• seems to fully answer how to play games — just computeminmax strategy (e.g., using linear programming)

• weaknesses:

• game M may be unknown• game M may be extremely large• opponent may not be fully adversarial

• may be possible to do better than value v

• e.g.:Lisa (thinks):

Poor predictable Bart, always takes Rock.

Bart (thinks):Good old Rock, nothing beats that.

Repeated PlayRepeated PlayRepeated PlayRepeated PlayRepeated Play

• if only playing once, hopeless to overcome ignorance of gameM or opponent

• but if game played repeatedly, may be possible to learn toplay well

• goal: play (almost) as well as if knew game and howopponent would play ahead of time

Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)

• M unknown

• for t = 1, . . . ,T :• Mindy chooses Pt

• Max chooses Qt (possibly depending on Pt)• Mindy’s loss = M(Pt ,Qt)• Mindy observes loss M(i ,Qt) of each pure strategy i

• want:

1

T

T∑

t=1

M(Pt ,Qt)

︸︷︷︸

actual average loss

≤ minP

1

T

T∑

t=1

M(P,Qt)

︸︷︷︸

best loss (in hindsight)

+ [“small amount”]

Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)

• choose η > 0

• initialize: P1 = uniform

• on round t:

Pt+1(i) =Pt(i) exp (−η M(i ,Qt))

normalization

• idea: decrease weight of strategies suffering the most loss

• directly generalizes [Littlestone & Warmuth]

• other algorithms:• [Hannan’57]

• [Blackwell’56]

• [Foster & Vohra]

• [Fudenberg & Levine]...

AnalysisAnalysisAnalysisAnalysisAnalysis

• Theorem: can choose η so that, for any game M with m

rows, and any opponent,

1

T

T∑

t=1

M(Pt ,Qt)

︸︷︷︸

actual average loss

≤ minP

1

T

T∑

t=1

M(P,Qt)

︸︷︷︸

best average loss (≤ v)

+ ∆T

where ∆T = O

(√

lnm

T

)

→ 0

• regret ∆T is:• logarithmic in # rows m

• independent of # columns

• running time (of MW) is:• linear in # rows m

• independent of # columns

• therefore, can use when working with very large games

Can Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as Corollary

• want to prove:

minP

maxQ

M(P,Q) ≤ maxQ

minP

M(P,Q)

(already argued “≥”)

• game M played repeatedly

• Mindy plays using MW• on round t, Max chooses best response:

Qt = arg maxQ

M(Pt ,Q)

[note: Qt can be pure]

• let

P =1

T

T∑

t=1

Pt , Q =1

T

T∑

t=1

Qt

Proof of Minmax TheoremProof of Minmax TheoremProof of Minmax TheoremProof of Minmax TheoremProof of Minmax Theorem

minP

maxQ

P>MQ ≤ maxQ

P>MQ

= maxQ

1

T

T∑

t=1

P>

t MQ

≤1

T

T∑

t=1

maxQ

P>

t MQ

=1

T

T∑

t=1

P>

t MQt

≤ minP

1

T

T∑

t=1

P>MQt + ∆T

= minP

P>MQ + ∆T

≤ maxQ

minP

P>MQ + ∆T

Solving a GameSolving a GameSolving a GameSolving a GameSolving a Game

• derivation shows that:

maxQ

M(P,Q) ≤ v + ∆T

andminP

M(P,Q) ≥ v − ∆T

• so: P and Q are ∆T -approximate minmax and maxminstrategies

• further: can choose Qt ’s to be pure⇒ Q = 1

T

∑

t Qt will be sparse (≤ T non-zero entries)

Summary of Game PlayingSummary of Game PlayingSummary of Game PlayingSummary of Game PlayingSummary of Game Playing

• presented MW algorithm for repeatedly playing matrix games

• plays almost as well as best fixed strategy in hindsight

• very efficient — independent of # columns

• proved minmax theorem

• MW can also be used to approximately solve a game

Application to BoostingApplication to BoostingApplication to BoostingApplication to BoostingApplication to Boosting

[with Freund]

Example: Spam FilteringExample: Spam FilteringExample: Spam FilteringExample: Spam FilteringExample: Spam Filtering

• problem: filter out spam (junk email)

• gather large collection of examples of spam and non-spam:

From: [email protected] Rob, can you review a paper... non-spamFrom: [email protected] Earn money without working!!!! ... spam...

......

• goal: have computer learn from examples to distinguish spamfrom non-spam

• main observation:

• easy to find “rules of thumb” that are “often” correct

• If ‘viagra’ occurs in message, then predict ‘spam’

• hard to find single rule that is very highly accurate

The Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting Approach

• given:• training examples• access to weak learning algorithm for finding weak

classifiers (rules of thumb)

• repeat:• choose distribution on training examples

• intuition: put most weight on “hardest” examples

• train weak learner on distribution to get weak classifier

• combine weak classifiers into final classifier• use (weighted) majority vote

• weak learning assumption: weak classifiers slightly butconsistently better than random

• want: final classifier to be very accurate

Boosting as a GameBoosting as a GameBoosting as a GameBoosting as a GameBoosting as a Game

• Mindy (row player) ↔ booster

• Max (column player) ↔ weak learner

• matrix M:

• row ↔ training example• column ↔ weak classifier• M(i , j) ={

1 if j-th weak classifier correct on i -th training example0 else

• encodes which weak classifiers correct on which examples• huge # of columns — one for every possible weak

classifier

Boosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax Theorem

• γ-weak learning assumption:• for every distribution on examples• can find weak classifier with weighted error ≤ 1

2− γ

• equivalent to:

(value of game M) ≥ 1

2+ γ

• by minmax theorem, implies that:• ∃ some weighted majority classifier that correctly

classifies all training examples• further, weights are given by maxmin strategy of game M

Idea for BoostingIdea for BoostingIdea for BoostingIdea for BoostingIdea for Boosting

• maxmin strategy of M has perfect (training) accuracy

• find approximately using earlier algorithm for solving a game• i.e., apply MW to M

• yields (variant of) AdaBoost

AdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game Theory

• AdaBoost can be derived as special case of general gameplaying algorithm (MW)

• idea of boosting intimately tied to minmax theorem

• game-theoretic interpretation can further be used to analyze:

• AdaBoost’s convergence properties• AdaBoost as a “large-margin” classifier

(margin = measure of confidence, useful for analyzinggeneralization accuracy)

Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]

• problem: find faces in photograph or movie

• weak classifiers: detect light/dark rectangles in image

• many clever tricks to make extremely fast and accurate

Application to On-line LearningApplication to On-line LearningApplication to On-line LearningApplication to On-line LearningApplication to On-line Learning

[with Freund]

Example: Predicting Stock MarketExample: Predicting Stock MarketExample: Predicting Stock MarketExample: Predicting Stock MarketExample: Predicting Stock Market

• want to predict daily if stock market will go up or down

• every morning:• listen to predictions of other forecasters

(based on market conditions)• formulate own prediction

• every evening:• find out if market actually went up or down

• want own predictions to be almost as good as best forecaster

On-line LearningOn-line LearningOn-line LearningOn-line LearningOn-line Learning

• given access to pool of predictors• could be: people, simple fixed functions, other learning

algorithms, etc.• for our purposes, think of as fixed functions of observable

“context”

• on each round:• get predictions of all predictors, given current context• make own prediction• find out correct outcome

• want: # mistakes close to that of best predictor

• training and testing occur simultaneously

• no statistical assumptions — examples chosen by adversary

On-line Learning as a GameOn-line Learning as a GameOn-line Learning as a GameOn-line Learning as a GameOn-line Learning as a Game

• Mindy (row player) ↔ learner

• Max (column player) ↔ adversary

• matrix M:

• row ↔ predictor• column ↔ context

• M(i , j) =

{1 if i -th predictor wrong on j-th context0 else

• actually, the same game as in boosting, but with roles ofplayers reversed

Applying MWApplying MWApplying MWApplying MWApplying MW

• can apply MW to game M

→ weighted majority algorithm [Littlestone & Warmuth]

• MW analysis gives:

(# mistakes of MW) ≤ (# mistakes of best predictor)

+ [“small amount”]

• regret only logarithmic in # predictors

Example: Mind-reading GameExample: Mind-reading GameExample: Mind-reading GameExample: Mind-reading GameExample: Mind-reading Game[with Freund & Doshi]

• can use to play penny-matching:• human and computer each choose a bit• if match, computer wins• else human wins

• random play wins with 50% probability• very hard for humans to play randomly• can do better if can predict what opponent will do

• algorithm: MW with many fixed predictors• each prediction based on recent history of plays• e.g.: if human played 0 on last two rounds,

then predict next play will be 1else predict next play will be 0

• play at: seed.ucsd.edu/∼mindreader

Histogram of ScoresHistogram of ScoresHistogram of ScoresHistogram of ScoresHistogram of Scores

0

20

40

60

80

100

120

140

160

180

200

-100 -75 -50 -25 0 25 50 75 100

num

ber

score

� � � � � � � � � � � � � � � � ��

� � � � � � � � � � � � � � � � ��

human winscomputer wins

• score = (rounds won by human) − (rounds won by computer)

• based on 11,882 games

• computer won 86.6% of games

• average score = −41.0

Slower is Better (for People)Slower is Better (for People)Slower is Better (for People)Slower is Better (for People)Slower is Better (for People)

0

50

100

150

200

250

300

-100 -75 -50 -25 0 25 50

aver

age

dura

tion

(sec

onds

)

score

• the faster people play, the lower their score

• when play fast, probably fall into rhythmic pattern• easy for computer to detect/predict

Application to Apprenticeship LearningApplication to Apprenticeship LearningApplication to Apprenticeship LearningApplication to Apprenticeship LearningApplication to Apprenticeship Learning

[with Syed]

Learning How to Behave in Complex EnvironmentsLearning How to Behave in Complex EnvironmentsLearning How to Behave in Complex EnvironmentsLearning How to Behave in Complex EnvironmentsLearning How to Behave in Complex Environments

• example: learning to “drive” in toy world

• states: positions of cars• actions: steer left/right; speed up; slow down• reward: for going fast without crashing or going off road

• in general, called Markov decision process (MDP)

• goal: find policy that maximizes expected long-term reward• policy = mapping from states to actions

Unknown Reward FunctionUnknown Reward FunctionUnknown Reward FunctionUnknown Reward FunctionUnknown Reward Function[Abbeel & Ng]

• when real-valued reward function known, optimal policy canbe found using well established techniques

• however, often reward function not known

• in driving example:

• may know reward “features”:

• faster is better than slower• collisions are bad• going off road is bad

• but may not know how to combine these into singlereal-valued function

• we assume reward is unknown convex combination of knownreal-valued features

Apprenticeship LearningApprenticeship LearningApprenticeship LearningApprenticeship LearningApprenticeship Learning

• also assume get to observe behavior of “expert”

• goal:• imitate expert• improve on expert, if possible

Game-theoretic ApproachGame-theoretic ApproachGame-theoretic ApproachGame-theoretic ApproachGame-theoretic Approach

• want to find policy such that improvement over expert is aslarge as possible, for all possible reward functions

• can formulate as game:• rows ↔ features• columns ↔ policies

(note: extremely large number of columns)

• optimal (randomized) policy, under our criterion, is exactlymaxmin of game

• can solve (approximately) using MW:• to pick “best” column on each round, use methods for

standard MDP’s with known reward

ConclusionsConclusionsConclusionsConclusionsConclusions

• presented MW algorithm for playing repeated games• used to prove minmax theorem• can use to approximately solve games efficiently

• many diverse applications and insights• boosting

• key concepts and algorithms are all game-theoretic

• on-line learning

• in essence, the same problem as boosting

• apprenticeship learning

• new and richer problem, but solvable as a game

ReferencesReferencesReferencesReferencesReferences

• Yoav Freund and Robert E. Schapire.Game theory, on-line prediction and boosting.In Ninth Annual Conference on Computational Learning Theory, 1996.

• Yoav Freund and Robert E. Schapire.Adaptive game playing using multiplicative weights.Games and Economic Behavior, 29:79–103, 1999.

• Umar Syed and Robert E. Schapire.A game-theoretic approach to apprenticeship learning.In Advances in Neural Information Processing Systems 20, 2008.

• Nicolo Cesa-Bianchi and Gabor Lugosi.Prediction, Learning, and Games.Cambridge University Press, 2006.

Coming soon (hopefully):

• Robert E. Schapire and Yoav Freund.Boosting: Foundations and Algorithms.MIT Press, 2010(?).

Documents

Theory and Applications of A Repeated Game Playing Algorithm