41
Theory and Applications of Theory and Applications of Theory and Applications of Theory and Applications of Theory and Applications of A Repeated Game Playing Algorithm A Repeated Game Playing Algorithm A Repeated Game Playing Algorithm A Repeated Game Playing Algorithm A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research]

Theory and Applications of A Repeated Game Playing Algorithm

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Theory and Applications of A Repeated Game Playing Algorithm

Theory and Applications ofTheory and Applications ofTheory and Applications ofTheory and Applications ofTheory and Applications ofA Repeated Game Playing AlgorithmA Repeated Game Playing AlgorithmA Repeated Game Playing AlgorithmA Repeated Game Playing AlgorithmA Repeated Game Playing Algorithm

Rob Schapire

Princeton University[currently visiting Yahoo! Research]

Page 2: Theory and Applications of A Repeated Game Playing Algorithm

Learning Is (Often) Just a GameLearning Is (Often) Just a GameLearning Is (Often) Just a GameLearning Is (Often) Just a GameLearning Is (Often) Just a Game

• some learning problems:

• learn from training examples to detect faces in images• sequentially predict what decision a person will make

next (e.g., what link a user will click on)• learn to drive toy car by observing how human does it

• these appear to be very different

• this talk:

• can all be viewed as game-playing problems• can all be handled as applications of single algorithm for

playing repeated games

• take-home message:• game-theoretic algorithms have wide applications in

learning• reveals deep connections between seemingly unrelated

learning problems

Page 3: Theory and Applications of A Repeated Game Playing Algorithm

OutlineOutlineOutlineOutlineOutline

• how to play repeated games: theory and an algorithm

• applications:

• boosting• on-line learning• apprenticeship learning

Page 4: Theory and Applications of A Repeated Game Playing Algorithm

How to Play Repeated GamesHow to Play Repeated GamesHow to Play Repeated GamesHow to Play Repeated GamesHow to Play Repeated Games

[with Freund]

Page 5: Theory and Applications of A Repeated Game Playing Algorithm

GamesGamesGamesGamesGames

• game defined by matrix M:

Rock Paper Scissors

Rock 1/2 1 0Paper 0 1/2 1

Scissors 1 0 1/2

• row player (“Mindy”) chooses row i

• column player (“Max”) chooses column j (simultaneously)

• Mindy’s goal: minimize her loss M(i , j)

• assume (wlog) all entries in [0, 1]

Page 6: Theory and Applications of A Repeated Game Playing Algorithm

Randomized PlayRandomized PlayRandomized PlayRandomized PlayRandomized Play

• usually allow randomized play:• Mindy chooses distribution P over rows• Max chooses distribution Q over columns

(simultaneously)

• Mindy’s (expected) loss

=∑

i ,j

P(i)M(i , j)Q(j)

= P>MQ ≡ M(P,Q)

• i , j = “pure” strategies

• P, Q = “mixed” strategies

• m = # rows of M

• also write M(i ,Q) and M(P, j) when one side plays pure andother plays mixed

Page 7: Theory and Applications of A Repeated Game Playing Algorithm

Sequential PlaySequential PlaySequential PlaySequential PlaySequential Play

• say Mindy plays before Max

• if Mindy chooses P then Max will pick Q to maximizeM(P,Q) ⇒ loss will be

L(P) ≡ maxQ

M(P,Q) = maxj

M(P, j)

[note: maximum realized at pure strategy]

• so Mindy should pick P to minimize L(P)⇒ loss will be

minP

L(P) = minP

maxQ

M(P,Q)

• similarly, if Max plays first, loss will be

maxQ

minP

M(P,Q)

Page 8: Theory and Applications of A Repeated Game Playing Algorithm

Minmax TheoremMinmax TheoremMinmax TheoremMinmax TheoremMinmax Theorem

• playing second (with knowledge of other player’s move)cannot be worse than playing first, so:

minP

maxQ

M(P,Q)︸ ︷︷ ︸

Mindy plays first

≥ maxQ

minP

M(P,Q)︸ ︷︷ ︸

Mindy plays second

• von Neumann’s minmax theorem:

minP

maxQ

M(P,Q) = maxQ

minP

M(P,Q)

• in words: no advantage to playing second

Page 9: Theory and Applications of A Repeated Game Playing Algorithm

Optimal PlayOptimal PlayOptimal PlayOptimal PlayOptimal Play

• minmax theorem:

minP

maxQ

M(P,Q) = maxQ

minP

M(P,Q) = value v of game

• optimal strategies:• P∗ = arg minP maxQ M(P,Q) = minmax strategy• Q∗ = arg maxQ minP M(P,Q) = maxmin strategy

• in words:• Mindy’s minmax strategy P∗ guarantees loss ≤ v

(regardless of Max’s play)• optimal because Max has maxmin strategy Q∗ that can

force loss ≥ v (regardless of Mindy’s play)

• e.g.: in RPS, P∗ = Q∗ = uniform

• solving game = finding minmax/maxmin strategies

Page 10: Theory and Applications of A Repeated Game Playing Algorithm

Weaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical TheoryWeaknesses of Classical Theory

• seems to fully answer how to play games — just computeminmax strategy (e.g., using linear programming)

• weaknesses:

• game M may be unknown• game M may be extremely large• opponent may not be fully adversarial

• may be possible to do better than value v

• e.g.:Lisa (thinks):

Poor predictable Bart, always takes Rock.

Bart (thinks):Good old Rock, nothing beats that.

Page 11: Theory and Applications of A Repeated Game Playing Algorithm

Repeated PlayRepeated PlayRepeated PlayRepeated PlayRepeated Play

• if only playing once, hopeless to overcome ignorance of gameM or opponent

• but if game played repeatedly, may be possible to learn toplay well

• goal: play (almost) as well as if knew game and howopponent would play ahead of time

Page 12: Theory and Applications of A Repeated Game Playing Algorithm

Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)Repeated Play (cont.)

• M unknown

• for t = 1, . . . ,T :• Mindy chooses Pt

• Max chooses Qt (possibly depending on Pt)• Mindy’s loss = M(Pt ,Qt)• Mindy observes loss M(i ,Qt) of each pure strategy i

• want:

1

T

T∑

t=1

M(Pt ,Qt)

︸ ︷︷ ︸

actual average loss

≤ minP

1

T

T∑

t=1

M(P,Qt)

︸ ︷︷ ︸

best loss (in hindsight)

+ [“small amount”]

Page 13: Theory and Applications of A Repeated Game Playing Algorithm

Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)Multiplicative-weights Algorithm (MW)

• choose η > 0

• initialize: P1 = uniform

• on round t:

Pt+1(i) =Pt(i) exp (−η M(i ,Qt))

normalization

• idea: decrease weight of strategies suffering the most loss

• directly generalizes [Littlestone & Warmuth]

• other algorithms:• [Hannan’57]

• [Blackwell’56]

• [Foster & Vohra]

• [Fudenberg & Levine]...

Page 14: Theory and Applications of A Repeated Game Playing Algorithm

AnalysisAnalysisAnalysisAnalysisAnalysis

• Theorem: can choose η so that, for any game M with m

rows, and any opponent,

1

T

T∑

t=1

M(Pt ,Qt)

︸ ︷︷ ︸

actual average loss

≤ minP

1

T

T∑

t=1

M(P,Qt)

︸ ︷︷ ︸

best average loss (≤ v)

+ ∆T

where ∆T = O

(√

lnm

T

)

→ 0

• regret ∆T is:• logarithmic in # rows m

• independent of # columns

• running time (of MW) is:• linear in # rows m

• independent of # columns

• therefore, can use when working with very large games

Page 15: Theory and Applications of A Repeated Game Playing Algorithm

Can Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as CorollaryCan Prove Minmax Theorem as Corollary

• want to prove:

minP

maxQ

M(P,Q) ≤ maxQ

minP

M(P,Q)

(already argued “≥”)

• game M played repeatedly

• Mindy plays using MW• on round t, Max chooses best response:

Qt = arg maxQ

M(Pt ,Q)

[note: Qt can be pure]

• let

P =1

T

T∑

t=1

Pt , Q =1

T

T∑

t=1

Qt

Page 16: Theory and Applications of A Repeated Game Playing Algorithm

Proof of Minmax TheoremProof of Minmax TheoremProof of Minmax TheoremProof of Minmax TheoremProof of Minmax Theorem

minP

maxQ

P>MQ ≤ maxQ

P>MQ

= maxQ

1

T

T∑

t=1

P>

t MQ

≤1

T

T∑

t=1

maxQ

P>

t MQ

=1

T

T∑

t=1

P>

t MQt

≤ minP

1

T

T∑

t=1

P>MQt + ∆T

= minP

P>MQ + ∆T

≤ maxQ

minP

P>MQ + ∆T

Page 17: Theory and Applications of A Repeated Game Playing Algorithm

Solving a GameSolving a GameSolving a GameSolving a GameSolving a Game

• derivation shows that:

maxQ

M(P,Q) ≤ v + ∆T

andminP

M(P,Q) ≥ v − ∆T

• so: P and Q are ∆T -approximate minmax and maxminstrategies

• further: can choose Qt ’s to be pure⇒ Q = 1

T

t Qt will be sparse (≤ T non-zero entries)

Page 18: Theory and Applications of A Repeated Game Playing Algorithm

Summary of Game PlayingSummary of Game PlayingSummary of Game PlayingSummary of Game PlayingSummary of Game Playing

• presented MW algorithm for repeatedly playing matrix games

• plays almost as well as best fixed strategy in hindsight

• very efficient — independent of # columns

• proved minmax theorem

• MW can also be used to approximately solve a game

Page 19: Theory and Applications of A Repeated Game Playing Algorithm

Application to BoostingApplication to BoostingApplication to BoostingApplication to BoostingApplication to Boosting

[with Freund]

Page 20: Theory and Applications of A Repeated Game Playing Algorithm

Example: Spam FilteringExample: Spam FilteringExample: Spam FilteringExample: Spam FilteringExample: Spam Filtering

• problem: filter out spam (junk email)

• gather large collection of examples of spam and non-spam:

From: [email protected] Rob, can you review a paper... non-spamFrom: [email protected] Earn money without working!!!! ... spam...

......

• goal: have computer learn from examples to distinguish spamfrom non-spam

• main observation:

• easy to find “rules of thumb” that are “often” correct

• If ‘viagra’ occurs in message, then predict ‘spam’

• hard to find single rule that is very highly accurate

Page 21: Theory and Applications of A Repeated Game Playing Algorithm

The Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting Approach

• given:• training examples• access to weak learning algorithm for finding weak

classifiers (rules of thumb)

• repeat:• choose distribution on training examples

• intuition: put most weight on “hardest” examples

• train weak learner on distribution to get weak classifier

• combine weak classifiers into final classifier• use (weighted) majority vote

• weak learning assumption: weak classifiers slightly butconsistently better than random

• want: final classifier to be very accurate

Page 22: Theory and Applications of A Repeated Game Playing Algorithm

Boosting as a GameBoosting as a GameBoosting as a GameBoosting as a GameBoosting as a Game

• Mindy (row player) ↔ booster

• Max (column player) ↔ weak learner

• matrix M:

• row ↔ training example• column ↔ weak classifier• M(i , j) ={

1 if j-th weak classifier correct on i -th training example0 else

• encodes which weak classifiers correct on which examples• huge # of columns — one for every possible weak

classifier

Page 23: Theory and Applications of A Repeated Game Playing Algorithm

Boosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax TheoremBoosting and the Minmax Theorem

• γ-weak learning assumption:• for every distribution on examples• can find weak classifier with weighted error ≤ 1

2− γ

• equivalent to:

(value of game M) ≥ 1

2+ γ

• by minmax theorem, implies that:• ∃ some weighted majority classifier that correctly

classifies all training examples• further, weights are given by maxmin strategy of game M

Page 24: Theory and Applications of A Repeated Game Playing Algorithm

Idea for BoostingIdea for BoostingIdea for BoostingIdea for BoostingIdea for Boosting

• maxmin strategy of M has perfect (training) accuracy

• find approximately using earlier algorithm for solving a game• i.e., apply MW to M

• yields (variant of) AdaBoost

Page 25: Theory and Applications of A Repeated Game Playing Algorithm

AdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game TheoryAdaBoost and Game Theory

• AdaBoost can be derived as special case of general gameplaying algorithm (MW)

• idea of boosting intimately tied to minmax theorem

• game-theoretic interpretation can further be used to analyze:

• AdaBoost’s convergence properties• AdaBoost as a “large-margin” classifier

(margin = measure of confidence, useful for analyzinggeneralization accuracy)

Page 26: Theory and Applications of A Repeated Game Playing Algorithm

Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]

• problem: find faces in photograph or movie

• weak classifiers: detect light/dark rectangles in image

• many clever tricks to make extremely fast and accurate

Page 27: Theory and Applications of A Repeated Game Playing Algorithm

Application to On-line LearningApplication to On-line LearningApplication to On-line LearningApplication to On-line LearningApplication to On-line Learning

[with Freund]

Page 28: Theory and Applications of A Repeated Game Playing Algorithm

Example: Predicting Stock MarketExample: Predicting Stock MarketExample: Predicting Stock MarketExample: Predicting Stock MarketExample: Predicting Stock Market

• want to predict daily if stock market will go up or down

• every morning:• listen to predictions of other forecasters

(based on market conditions)• formulate own prediction

• every evening:• find out if market actually went up or down

• want own predictions to be almost as good as best forecaster

Page 29: Theory and Applications of A Repeated Game Playing Algorithm

On-line LearningOn-line LearningOn-line LearningOn-line LearningOn-line Learning

• given access to pool of predictors• could be: people, simple fixed functions, other learning

algorithms, etc.• for our purposes, think of as fixed functions of observable

“context”

• on each round:• get predictions of all predictors, given current context• make own prediction• find out correct outcome

• want: # mistakes close to that of best predictor

• training and testing occur simultaneously

• no statistical assumptions — examples chosen by adversary

Page 30: Theory and Applications of A Repeated Game Playing Algorithm

On-line Learning as a GameOn-line Learning as a GameOn-line Learning as a GameOn-line Learning as a GameOn-line Learning as a Game

• Mindy (row player) ↔ learner

• Max (column player) ↔ adversary

• matrix M:

• row ↔ predictor• column ↔ context

• M(i , j) =

{1 if i -th predictor wrong on j-th context0 else

• actually, the same game as in boosting, but with roles ofplayers reversed

Page 31: Theory and Applications of A Repeated Game Playing Algorithm

Applying MWApplying MWApplying MWApplying MWApplying MW

• can apply MW to game M

→ weighted majority algorithm [Littlestone & Warmuth]

• MW analysis gives:

(# mistakes of MW) ≤ (# mistakes of best predictor)

+ [“small amount”]

• regret only logarithmic in # predictors

Page 32: Theory and Applications of A Repeated Game Playing Algorithm

Example: Mind-reading GameExample: Mind-reading GameExample: Mind-reading GameExample: Mind-reading GameExample: Mind-reading Game[with Freund & Doshi]

• can use to play penny-matching:• human and computer each choose a bit• if match, computer wins• else human wins

• random play wins with 50% probability• very hard for humans to play randomly• can do better if can predict what opponent will do

• algorithm: MW with many fixed predictors• each prediction based on recent history of plays• e.g.: if human played 0 on last two rounds,

then predict next play will be 1else predict next play will be 0

• play at: seed.ucsd.edu/∼mindreader

Page 33: Theory and Applications of A Repeated Game Playing Algorithm

Histogram of ScoresHistogram of ScoresHistogram of ScoresHistogram of ScoresHistogram of Scores

0

20

40

60

80

100

120

140

160

180

200

-100 -75 -50 -25 0 25 50 75 100

num

ber

score

� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � �

human winscomputer wins

• score = (rounds won by human) − (rounds won by computer)

• based on 11,882 games

• computer won 86.6% of games

• average score = −41.0

Page 34: Theory and Applications of A Repeated Game Playing Algorithm

Slower is Better (for People)Slower is Better (for People)Slower is Better (for People)Slower is Better (for People)Slower is Better (for People)

0

50

100

150

200

250

300

-100 -75 -50 -25 0 25 50

aver

age

dura

tion

(sec

onds

)

score

• the faster people play, the lower their score

• when play fast, probably fall into rhythmic pattern• easy for computer to detect/predict

Page 35: Theory and Applications of A Repeated Game Playing Algorithm

Application to Apprenticeship LearningApplication to Apprenticeship LearningApplication to Apprenticeship LearningApplication to Apprenticeship LearningApplication to Apprenticeship Learning

[with Syed]

Page 36: Theory and Applications of A Repeated Game Playing Algorithm

Learning How to Behave in Complex EnvironmentsLearning How to Behave in Complex EnvironmentsLearning How to Behave in Complex EnvironmentsLearning How to Behave in Complex EnvironmentsLearning How to Behave in Complex Environments

• example: learning to “drive” in toy world

• states: positions of cars• actions: steer left/right; speed up; slow down• reward: for going fast without crashing or going off road

• in general, called Markov decision process (MDP)

• goal: find policy that maximizes expected long-term reward• policy = mapping from states to actions

Page 37: Theory and Applications of A Repeated Game Playing Algorithm

Unknown Reward FunctionUnknown Reward FunctionUnknown Reward FunctionUnknown Reward FunctionUnknown Reward Function[Abbeel & Ng]

• when real-valued reward function known, optimal policy canbe found using well established techniques

• however, often reward function not known

• in driving example:

• may know reward “features”:

• faster is better than slower• collisions are bad• going off road is bad

• but may not know how to combine these into singlereal-valued function

• we assume reward is unknown convex combination of knownreal-valued features

Page 38: Theory and Applications of A Repeated Game Playing Algorithm

Apprenticeship LearningApprenticeship LearningApprenticeship LearningApprenticeship LearningApprenticeship Learning

• also assume get to observe behavior of “expert”

• goal:• imitate expert• improve on expert, if possible

Page 39: Theory and Applications of A Repeated Game Playing Algorithm

Game-theoretic ApproachGame-theoretic ApproachGame-theoretic ApproachGame-theoretic ApproachGame-theoretic Approach

• want to find policy such that improvement over expert is aslarge as possible, for all possible reward functions

• can formulate as game:• rows ↔ features• columns ↔ policies

(note: extremely large number of columns)

• optimal (randomized) policy, under our criterion, is exactlymaxmin of game

• can solve (approximately) using MW:• to pick “best” column on each round, use methods for

standard MDP’s with known reward

Page 40: Theory and Applications of A Repeated Game Playing Algorithm

ConclusionsConclusionsConclusionsConclusionsConclusions

• presented MW algorithm for playing repeated games• used to prove minmax theorem• can use to approximately solve games efficiently

• many diverse applications and insights• boosting

• key concepts and algorithms are all game-theoretic

• on-line learning

• in essence, the same problem as boosting

• apprenticeship learning

• new and richer problem, but solvable as a game

Page 41: Theory and Applications of A Repeated Game Playing Algorithm

ReferencesReferencesReferencesReferencesReferences

• Yoav Freund and Robert E. Schapire.Game theory, on-line prediction and boosting.In Ninth Annual Conference on Computational Learning Theory, 1996.

• Yoav Freund and Robert E. Schapire.Adaptive game playing using multiplicative weights.Games and Economic Behavior, 29:79–103, 1999.

• Umar Syed and Robert E. Schapire.A game-theoretic approach to apprenticeship learning.In Advances in Neural Information Processing Systems 20, 2008.

• Nicolo Cesa-Bianchi and Gabor Lugosi.Prediction, Learning, and Games.Cambridge University Press, 2006.

Coming soon (hopefully):

• Robert E. Schapire and Yoav Freund.Boosting: Foundations and Algorithms.MIT Press, 2010(?).