Learning an Effective Strategy in a Multi-Agent System ...mealingr/documents/Learning an... · First Approach 1 Reinforcement learning (Q-Learning) to learn our own rewards 2 Sequence

Learning an Effective Strategy in a Multi-Agent Systemwith Hidden Information

Richard MealingSupervisor: Jon Shapiro

Machine Learning and Optimisation GroupSchool of Computer Science

University of Manchester

1 / 31

Our Problem: Maximising Reward with An Opponent

We focus on the simplest casewith just 2 agents

Each agent is trying tomaximise its own rewards

But each agent’s actions canaffect the other agent’s rewards

2 / 31

Our Proposal: Predict and Adapt to the Future

Before maximising our rewards we learn:

What our rewards are for actions - use reinforcement/no-regret learning

How the opponent will act - use sequence prediction methods

To maximise our rewards:

Lookahead - take the actions with the maximum expected reward

Simulate - adapt our strategy to rewards against the opponent model

Hidden information - what did the opponent base their decision on?

Learn the hidden information using online expectation maximisation

3 / 31

Why Games?

Games let you focus on the agent and worry less about the environment

Well-defined rules and clear goals

Can allow easy agent comparisons

Can allow complex strategies

Game theory gives a foundation

4 / 31

Artificial Intelligence Success in Games

Year Game Success1979 Backgammon BKG 9.8 beat world champion Luigi Villa [1]1994 Checkers Chinook beat world champion Marion Tinsley [2]1995 Scrabble Quackle beat former champion David Boys [3]1997 Chess Deep Blue beat world champion Garry Kasparov [4]1997 Othello (Reversi) Logistello beat world champion Takeshi Murakami [5]2006 Go Crazy Stone beat various pros [6]2008 Poker Polaris beat various pros in heads-up limit Texas hold’em [7]2011 Jeopardy! Watson beat former winners Brad Rutter and Ken Jennings [8]

5 / 31

Perfect and Imperfect Information

Perfect information - players always know the state e.g.

Tic Tac Toe Checkers

Imperfect information - at some point a player doesn’t know the state e.g.

Rock Paper Scissors Poker

6 / 31

First Approach

1 Reinforcement learning (Q-Learning) to learn our own rewards

2 Sequence prediction to learn the opponent’s strategy

3 Exhaustive explicit lookahead (to a limited depth) with 1 and 2 totake the actions with the maximum expected reward

Outperforms state-of-the-art reinforcement learning agents1 in:

Rock Paper Scissors Prisoner’s Dilemma Littman’s Soccer [10]

1Richard Mealing and Jonathan L. Shapiro. “Opponent Modelling by SequencePrediction and Lookahead in Two-Player Games”. In: 12th International Conference onArtificial Intelligence and Soft Computing. 2013.

7 / 31

Reinforcement Learning

We use Q(uality)-Learning to learn the rewards for action sequences

Comparison agents use Q-Learning or Q-Learning based methods

Q-Learning learns the expected value of taking an action in a stateand then following a fixed strategy [11]

Q(st , atpla)← (1− α)Q(st , atpla) + α[r t + γmaxat+1pla

Q(st+1, at+1pla )]

st = state at time t

α = learning rate

γ = discount factor

atpla = player’s action at time t

r t = reward at time t

We use Q(st , atpla) with lookahead and some exploration

Comparison agents select maxatpla Q(st , atpla) with some exploration

8 / 31

Sequence Prediction

Markov model - the probability of the opponent’s action atopp dependsonly on the current state st

Pr(atopp|st)

Sequence prediction - the probability of the opponent’s actiondepends on a history H

Pr(atopp|H) where H ⊆ {st , at−1, st−1, at−2, st−2, . . . , a1, s1}

9 / 31

Sequence Prediction Methods

Long-term memory L - a set of distributions, each one conditioned ona different history H

L = {Pr(atopp|H) : H ⊆ {st , at−1, st−1, at−2, st−2, . . . , a1, s1}}

Short-term memory S - a list of recent observations (states/actions)

S = (ot , ot−1, ot−2, . . . , ot−n)

Observing a symbol ot

1 Generate a set of histories H = {H1,H2, . . . } using S2 For each H ∈ H create/update Pr(atopp|H) using ot

3 Add ot to S (remove the oldest observation if needed)

Predicting an opponent action atopp

1 Generate a set of histories H = {H1,H2, . . . } using S2 Predict using {Pr(atopp|H) : H ∈ H}

10 / 31

Sequence Prediction Method Example

Entropy Learned Pruned Hypothesis Space [12]:

Inputs: memory size n and entropy threshold 0 ≤ e ≤ 1

Observing a symbol ot

1 Generate the powerset P(S) = H of short-term memory S

S = (ot , ot−1, ot−2, . . . , ot−n)

P(S) = {{}, {o1}, . . . , {on}, {o1, o2}, . . . , {o1, on}, . . . , {o1, o2, . . . , on}}

2 For each H ∈ H create/update Pr(atopp|H) using ot

3 For each H ∈ H if Entropy(Pr(atopp|H)) > e then discard it

4 Add ot to S (remove the oldest observation if |S | > n)

Predicting an opponent action atopp

1 Generate the powerset P(S) = H of short-term memory S

2 Predict using arg minPr(atopp|H) Entropy(Pr(atopp|H)) for all H ∈ H

11 / 31

Lookahead Example

D C

D 1,1 4,0

C 0,4 3,3

12 / 31

Lookahead Example

Defect is the dominant action (highest reward)

Cooperate-Cooperate is socially optimal (highest sum of rewards)

Tit-for-tat (copy opponent’s last move) is good for repeated play

Can we learn to play optimally against tit-for-tat?

13 / 31

Lookahead Example

4

D

3

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

D

0

C

Pred. D

14 / 31

Lookahead Example

4

D

3

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

D

0

C

Pred. D

With lookahead 1 D has highest reward

With lookahead 2 (D,C,D,C) has highest total reward (unlikely)

Assume the opponent copies the player’s last move (i.e. tit-for-tat)

15 / 31

Lookahead Example

4

5

D

4

C

Pred. D

D

3

7

D

6

C

Pred. C

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

2

D

1

C

Pred. D

D

0

4

D

3

C

Pred. C

C

Pred. D

16 / 31

Lookahead Example

4

5

D

4

C

Pred. D

D

3

7

D

6

C

Pred. C

C

Pred. C

D C

D 1,1 4,0

C 0,4 3,3

1

2

D

1

C

Pred. D

D

0

4

D

3

C

Pred. C

C

Pred. D

With lookahead of 2 against tit-for-tat C has highest reward

17 / 31

Results of First Approach

Converges to higher average payoffs per game at faster rates thanreinforcement learning algorithms such that in. . .

Iterated rock-paper-scissors:Learns to best-respond against variable-Markov models

Iterated prisoner’s dilemma:Comes first in a tournament against finite automata

Littman’s soccer:Wins 70% of games against reinforcement learning algorithms

18 / 31

Summary of First Approach

We associate a sequence predictor with each game state

During a game we update our:

Rewards for action sequences using Q-Learning

Sequence predictors with observed opponent actions

At each decision point we lookahead and take the first action of anaction sequence with the maximum expected cumulative reward

19 / 31

Second Approach

1 Sequence prediction to learn the opponent’s strategy

2 Online expectation maximisation [13, 14] to predict the opponent’shidden information (to know H to update our opponent model)

3 No-regret learning algorithm to adjust our strategy

4 Simulate games against our opponent model

Improves no-regret algorithm performance vs itself, a state-of-the-artreinforcement learning agent and a popular bandit algorithm in:

Die-Roll Poker [15] Rhode Island Hold’em [16]

20 / 31

Online Expectation Maximisation

A rational agent will act based on its hidden information

At the end of a game, we have observed the opponent’s (public)actions but not necessarily their hidden information (e.g. they folded)

Expectation step:

1 For each possible instance of hidden information the opponent couldhold, calculate the probability of their actions

2 Normalise these probabilities

Each normalised probability corresponds to the expected number ofopponent visits to the path associated with that hidden information

Maximisation step: update the opponent’s action probabilities alongeach path to account for their expected number of visits

21 / 31


I1

I3

0

0.8C

I7

-1

1.0F

0

0.0C

0.2R

0.6C

I4

1

1.0F

0

0.0C

0.4R

0.5J

I1

I5

-1

0.1C

I7

-1

1.0F

-2

0.0C

0.9R

0.6C

I6

1

0.0F

-2

1.0C

0.4R

0.5K

0.5J

I2

I3

1

0.8C

I8

-1

0.0F

2

1.0C

0.2R

0.3C

I4

1

1.0F

2

0.0C

0.7R

0.5J

I2

I5

0

0.1C

I8

-1

0.0F

0

1.0C

0.9R

0.3C

I6

1

0.0F

0

1.0C

0.7R

0.5K

0.5K

J = Jack, K = King, F = Fold, C = Call, R = Raise

22 / 31


I1

I3

0

0.8C

I7

-1

1.0F

0

0.0C

0.2R

0.6C

I4

1

1.0F

0

0.0C

0.4R

0.5J

I1

I5

-1

0.1C

I7

-1

1.0F

-2

0.0C

0.9R

0.6C

I6

1

0.0F

-2

1.0C

0.4R

0.5K

0.5J

I2

I3

1

0.8C

I8

-1

0.0F

2

1.0C

0.2R

0.3C

I4

1

1.0F

2

0.0C

0.7R

0.5J

I2

I5

0

0.1C

I8

-1

0.0F

0

1.0C

0.9R

0.3C

I6

1

0.0F

0

1.0C

0.7R

0.5K

0.5K

Assume we are P1, we got a Jack, opponent P2 got a Jack or a King

23 / 31


I1

I3

0

0.8C

I7

-1

1.0F

0

0.0C

0.2R

0.6C

I4

1

1.0F

0

0.0C

0.4R

0.5J

I1

I5

-1

0.1C

I7

-1

1.0F

-2

0.0C

0.9R

0.6C

I6

1

0.0F

-2

1.0C

0.4R

0.5K

0.5J

I2

I3

1

0.8C

I8

-1

0.0F

2

1.0C

0.2R

0.3C

I4

1

1.0F

2

0.0C

0.7R

0.5J

I2

I5

0

0.1C

I8

-1

0.0F

0

1.0C

0.9R

0.3C

I6

1

0.0F

0

1.0C

0.7R

0.5K

0.5K

Pr((J, J,R,F )|σ−i ) = 1 and Pr((J,K ,R,F )|σ−i ) = 0

Update visits to (J, J,R,F ) by 1 and to (J,K ,R,F ) by 0

24 / 31

No-Regret Learning

Our no-regret method is based on counterfactual regret minimisation

State-of-the-art algorithm that provably minimises regret intwo-player, zero-sum, imperfect information games [17]

In self-play its average strategy profile approaches a Nash equilibrium

Can handle games with 1012 states (1010 states was the previous limitusing Nesterov’s excessive gap technique, limit poker has 1018 states)

Needs opponent’s strategy, we use an online version that removes this

25 / 31

Results of Second Approach

Has higher average payoffs per game and a higher final performancethan the no-regret algorithm on its own such that in. . .

Die-roll poker and Rhode Island hold’em:Learns to win against all opponents (except near Nash where it draws)

But online expectation maximisation seems less effective in RhodeIsland hold’em compared to die-roll poker - investigating why

26 / 31

Summary of Second Approach

We associate a sequence predictor with each game state from theopponent’s perspective (opponent information set)

At the end of a game we:

Predict opponent’s hidden information by online expectationmaximisation

Update the sequence predictors along the path associated with thepredicted hidden information and public actions

Update our strategy with the reward from the actual game as well asthe rewards from a number of simulated games

27 / 31

Summary

Maximise our rewards when an opponent’s actions can affect them

Use games to focus on the agent, worry less about the environment

Approaches:

1 Reinforcement learning + sequence prediction + lookahead

2 Sequence prediction + online EM + no-regret + simulation

28 / 31

References I

[1] Backgammon Programming. http://www.bkgm.com/rgb/rgb.cgi?view+782. Accessed: 10/10/2013.

[2] Chinook vs. the Checkers Champ - Top 10 Man-vs.-Machine Moments - TIME.http://content.time.com/time/specials/packages/article/0,28804,2049187_2049195_2049286,00.html.Accessed: 10/10/2013.

[3] Scrabble Showdown: Quackle vs. David Boys - Top 10 Man-vs.-Machine Moments - TIME.http://content.time.com/time/specials/packages/article/0,28804,2049187_2049195_2049083,00.html.Accessed: 10/10/2013.

[4] IBM100 - Deep Blue. http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/. Accessed:10/10/2013.

[5] Othello match of the year. https://skatgame.net/mburo/event.html. Accessed: 10/10/2013.

[6] CrazyStone at Sensei’s Library. http://senseis.xmp.net/?CrazyStone. Accessed: 10/10/2013.

[7] Man vs Machine II - Polaris vs Online Poker’s Best. http://www.poker-academy.com/man-machine/2008/. Accessed:10/10/2013.

[8] IBM Watson. https://www-03.ibm.com/innovation/us/watson/. Accessed: 10/10/2013.

[9] Richard Mealing and Jonathan L. Shapiro. “Opponent Modelling by Sequence Prediction and Lookahead in Two-PlayerGames”. In: 12th International Conference on Artificial Intelligence and Soft Computing. 2013.

[10] Michael L. Littman. “Markov games as a framework for multi-agent reinforcement learning”. In: 11th Proc. of ICML.Morgan Kaufmann, 1994, pp. 157–163.

[11] C. J. C. H. Watkins. “Learning from delayed rewards”. PhD thesis. Cambridge, 1989.

[12] Jensen et al. “Non-stationary policy learning in 2-player zero sum games”. In: Proc. of 20th Int. Conf. on AI. 2005,pp. 789–794.

29 / 31

http://www.bkgm.com/rgb/rgb.cgi?view+782

http://content.time.com/time/specials/packages/article/0,28804,2049187_2049195_2049286,00.html

http://content.time.com/time/specials/packages/article/0,28804,2049187_2049195_2049083,00.html

http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/

https://skatgame.net/mburo/event.html

http://senseis.xmp.net/?CrazyStone

http://www.poker-academy.com/man-machine/2008/

https://www-03.ibm.com/innovation/us/watson/

References II

[13] A. P. Dempster, N. M. Laird, and D. B. Rubin. “Maximum Likelihood from Incomplete Data via the EM Algorithm”.In: Journal of the Royal Statistical Society 39 (1977), pp. 1–38.

[14] Olivier Cappe and Eric Moulines. “Online EM Algorithm for Latent Data Models”. In: Journal of the Royal StatisticalSociety 71 (2008), pp. 593–613.

[15] Marc Lanctot et al. “No-Regret Learning in Extensive-Form Games with Imperfect Recall”. In: Proceedings of the 29thInternational Conference on Machine Learning (ICML-12). 2012.

[16] Jiefu Shi and Michael L. Littman. “Abstraction Methods for Game Theoretic Poker”. In: Revised Papers from theSecond International Conference on Computers and Games. 2000.

[17] Martin Zinkevich et al. “Regret Minimization in Games with Incomplete Information”. In: Advances in NeuralInformation Processing Systems 20. 2008.

[18] G.W. Brown. “Activity Analysis of Production and Allocation”. In: ed. by T. J. Koopmans. New York: Wiley, 1951.Chap. Iterative Solutions of Games by Fictitious Play.

[19] Carmel and Markovitch. “Learning Models of Intelligent Agents”. In: Proc. of 13th Int. Conf. on AI. AAAI, 1996,pp. 62–67.

[20] John M Butterworth. “Stability of gradient-based learning dynamics in two-agent imperfect-information games”.PhD thesis. The University of Manchester, 2010.

[21] Knoll and de Freitas. “A Machine Learning Perspective on Predictive Coding with PAQ”. arXiv:1108.3298. 2011.

30 / 31

Appendix: Future Work

Change detection methods to discard outdated observations

Use the opponent model more when it is more accurate

More challenging domains e.g. n-player, continuous values

Real-world applications e.g. peer-to-peer file sharing

Use implicit as well as explicit opponent modelling

31 / 31

Appendix: Potential Applications

Learning conditional and adaptive strategies

Adapting to user interaction

Adjusting the workload or relocating the system resources

Responding to network traffic (p2p, spam filtering, virus detection)

Overlapping areas: speech recognition/synthesis/tagging, musicalscore, machine translation, gene prediction, DNA/protein sequenceclassification/identification, bioinformatics, handwriting, gesturerecognition, partial discharges, cryptanalysis, protein folding,metamorphic virus detection, statistical process control, roboticteams, distributed control, resource management, collaborativedecision support systems, economics, industrial manufacturing,complex simulations, combinatorial search, etc...

31 / 31

Appendix: What has been tried before?

Fictitious play assumes a Markov model opponent strategy [18]

Unsupervised L* infers deterministic finite automata models [19]

ELPH defeated human and agent players in rock-paper-scissors [12]

Stochastic gradient ascent with the lagging anchor algorithm [20]

PAQ8L defeated human players in rock-paper-scissors [21]

31 / 31

Appendix: Counterfactual Regret Minimisation

Counterfactual Value:

vi (I |σ) =∑n∈I

Pr(n|σ−i )ui (n)

ui (n) =1

Pr(n|σ)

∑z∈Z [n]

Pr(z |σ)ui (z)

vi (I |σ) = player i ’s counterfactual value of information set I given strategy profile σ

Pr(n|σ−i ) = probability of reaching node n from the root given the opponent’s strategy

ui (n) = player i ’s expected reward at node n

Pr(n|σ) = probability of reaching node n from the root given all players’ strategies

Z [n] = set of terminal nodes that can be reached from node n

ui (z) = player i ’s reward at terminal node z

31 / 31


Counterfactual Regret:

ri (I , a) = vi (I |σI→a)− vi (I |σ)

ri (I , a) = player i ’s counterfactual regret of not playing action a at information set I

σI→a = same as σ except a is always played at I

vi (I |σI→a) = player i ’s counterfactual value of playing action a at information set I

vi (I |σ) = player i ’s counterfactual value of playing their strategy information set I

31 / 31


Sampled Counterfactual Value:

vi (I |σ,Qj) =∑n∈I

Pr(n|σ−i )ui (n|Qj)

ui (n|Qj) =1

Pr(n|σ)

∑z∈Qj∩Z [n]

1

q(z)Pr(z |σ)ui (z)

q(z) =∑

j :z∈Qj

qj

vi (I |σ,Qj ) = player i ’s sampled counterfactual value of I given strategy profile σ and Qj

Qj = set of sampled terminal nodes

Pr(n|σ−i ) = probability of reaching node n from the root given the opponent’s strategy

ui (n) = player i ’s sampled expected reward at node n given Qj

Pr(n|σ) = probability of reaching node n from the root given all players’ strategies

Z [n] = set of terminal nodes that can be reached from node n

ui (z) = player i ’s reward at terminal node z

qj = probability of sampling Qj

31 / 31


Outcome Sampling (|Qj | = 1 and qj = q(z)):

vi (Ix |σ,Qj) =∑n∈Ix

Pr(n|σ−i )

1

Pr(n|σ)

∑z∈Qj∩Z [n]

1

q(z)Pr(z |σ)ui (z)

=

Pr(n|σ−i ) Pr(z |σ)ui (z)

Pr(n|σ)q(z)

=Pr(n|σ−i ) Pr(z |σ)ui (z)

Pr(n|σ) Pr(z |σ′)

=Pr(n|σ−i ) Pr(z |σi ) Pr(z |σ−i )ui (z)

Pr(n|σi ) Pr(n|σ−i ) Pr(z |σ′i ) Pr(z |σ′−i )

=Pr(z |σi ) Pr(z |σ−i )ui (z)

Pr(n|σi ) Pr(z |σ′i ) Pr(z |σ′−i )

=Pr(z |σi )ui (z)

Pr(n|σi ) Pr(z |σ′i )Assuming σ′−i ≈ σ−i

31 / 31

Appendix: Zero-Determinant Strategies

Unilaterally set an opponent’s expected payoff in the iteratedprisoner’s dilemma irrespective of the opponents strategy

Turns the prisoner’s dilemma into an ultimatum game

Works well against evolutionary players without an opponent model

An opponent model could recognise the unfair offer and refuse

31 / 31

Documents

Learning an Effective Strategy in a Multi-Agent System ...mealingr/documents/Learning an... · First Approach 1 Reinforcement learning (Q-Learning) to learn our own rewards 2 Sequence