Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Learning an Effective Strategy in a Multi-Agent Systemwith Hidden Information
Richard MealingSupervisor: Jon Shapiro
Machine Learning and Optimisation GroupSchool of Computer Science
University of Manchester
1 / 31
Our Problem: Maximising Reward with An Opponent
We focus on the simplest casewith just 2 agents
Each agent is trying tomaximise its own rewards
But each agent’s actions canaffect the other agent’s rewards
2 / 31
Our Proposal: Predict and Adapt to the Future
Before maximising our rewards we learn:
What our rewards are for actions - use reinforcement/no-regret learning
How the opponent will act - use sequence prediction methods
To maximise our rewards:
Lookahead - take the actions with the maximum expected reward
Simulate - adapt our strategy to rewards against the opponent model
Hidden information - what did the opponent base their decision on?
Learn the hidden information using online expectation maximisation
3 / 31
Why Games?
Games let you focus on the agent and worry less about the environment
Well-defined rules and clear goals
Can allow easy agent comparisons
Can allow complex strategies
Game theory gives a foundation
4 / 31
Artificial Intelligence Success in Games
Year Game Success1979 Backgammon BKG 9.8 beat world champion Luigi Villa [1]1994 Checkers Chinook beat world champion Marion Tinsley [2]1995 Scrabble Quackle beat former champion David Boys [3]1997 Chess Deep Blue beat world champion Garry Kasparov [4]1997 Othello (Reversi) Logistello beat world champion Takeshi Murakami [5]2006 Go Crazy Stone beat various pros [6]2008 Poker Polaris beat various pros in heads-up limit Texas hold’em [7]2011 Jeopardy! Watson beat former winners Brad Rutter and Ken Jennings [8]
5 / 31
Perfect and Imperfect Information
Perfect information - players always know the state e.g.
Tic Tac Toe Checkers
Imperfect information - at some point a player doesn’t know the state e.g.
Rock Paper Scissors Poker
6 / 31
First Approach
1 Reinforcement learning (Q-Learning) to learn our own rewards
2 Sequence prediction to learn the opponent’s strategy
3 Exhaustive explicit lookahead (to a limited depth) with 1 and 2 totake the actions with the maximum expected reward
Outperforms state-of-the-art reinforcement learning agents1 in:
Rock Paper Scissors Prisoner’s Dilemma Littman’s Soccer [10]
1Richard Mealing and Jonathan L. Shapiro. “Opponent Modelling by SequencePrediction and Lookahead in Two-Player Games”. In: 12th International Conference onArtificial Intelligence and Soft Computing. 2013.
7 / 31
Reinforcement Learning
We use Q(uality)-Learning to learn the rewards for action sequences
Comparison agents use Q-Learning or Q-Learning based methods
Q-Learning learns the expected value of taking an action in a stateand then following a fixed strategy [11]
Q(st , atpla)← (1− α)Q(st , atpla) + α[r t + γmaxat+1pla
Q(st+1, at+1pla )]
st = state at time t
α = learning rate
γ = discount factor
atpla = player’s action at time t
r t = reward at time t
We use Q(st , atpla) with lookahead and some exploration
Comparison agents select maxatpla Q(st , atpla) with some exploration
8 / 31
Sequence Prediction
Markov model - the probability of the opponent’s action atopp dependsonly on the current state st
Pr(atopp|st)
Sequence prediction - the probability of the opponent’s actiondepends on a history H
Pr(atopp|H) where H ⊆ {st , at−1, st−1, at−2, st−2, . . . , a1, s1}
9 / 31
Sequence Prediction Methods
Long-term memory L - a set of distributions, each one conditioned ona different history H
L = {Pr(atopp|H) : H ⊆ {st , at−1, st−1, at−2, st−2, . . . , a1, s1}}
Short-term memory S - a list of recent observations (states/actions)
S = (ot , ot−1, ot−2, . . . , ot−n)
Observing a symbol ot
1 Generate a set of histories H = {H1,H2, . . . } using S2 For each H ∈ H create/update Pr(atopp|H) using ot
3 Add ot to S (remove the oldest observation if needed)
Predicting an opponent action atopp
1 Generate a set of histories H = {H1,H2, . . . } using S2 Predict using {Pr(atopp|H) : H ∈ H}
10 / 31
Sequence Prediction Method Example
Entropy Learned Pruned Hypothesis Space [12]:
Inputs: memory size n and entropy threshold 0 ≤ e ≤ 1
Observing a symbol ot
1 Generate the powerset P(S) = H of short-term memory S
S = (ot , ot−1, ot−2, . . . , ot−n)
P(S) = {{}, {o1}, . . . , {on}, {o1, o2}, . . . , {o1, on}, . . . , {o1, o2, . . . , on}}
2 For each H ∈ H create/update Pr(atopp|H) using ot
3 For each H ∈ H if Entropy(Pr(atopp|H)) > e then discard it
4 Add ot to S (remove the oldest observation if |S | > n)
Predicting an opponent action atopp
1 Generate the powerset P(S) = H of short-term memory S
2 Predict using arg minPr(atopp|H) Entropy(Pr(atopp|H)) for all H ∈ H
11 / 31
Lookahead Example
D C
D 1,1 4,0
C 0,4 3,3
12 / 31
Lookahead Example
Defect is the dominant action (highest reward)
Cooperate-Cooperate is socially optimal (highest sum of rewards)
Tit-for-tat (copy opponent’s last move) is good for repeated play
Can we learn to play optimally against tit-for-tat?
13 / 31
Lookahead Example
4
D
3
C
Pred. C
D C
D 1,1 4,0
C 0,4 3,3
1
D
0
C
Pred. D
14 / 31
Lookahead Example
4
D
3
C
Pred. C
D C
D 1,1 4,0
C 0,4 3,3
1
D
0
C
Pred. D
With lookahead 1 D has highest reward
With lookahead 2 (D,C,D,C) has highest total reward (unlikely)
Assume the opponent copies the player’s last move (i.e. tit-for-tat)
15 / 31
Lookahead Example
4
5
D
4
C
Pred. D
D
3
7
D
6
C
Pred. C
C
Pred. C
D C
D 1,1 4,0
C 0,4 3,3
1
2
D
1
C
Pred. D
D
0
4
D
3
C
Pred. C
C
Pred. D
16 / 31
Lookahead Example
4
5
D
4
C
Pred. D
D
3
7
D
6
C
Pred. C
C
Pred. C
D C
D 1,1 4,0
C 0,4 3,3
1
2
D
1
C
Pred. D
D
0
4
D
3
C
Pred. C
C
Pred. D
With lookahead of 2 against tit-for-tat C has highest reward
17 / 31
Results of First Approach
Converges to higher average payoffs per game at faster rates thanreinforcement learning algorithms such that in. . .
Iterated rock-paper-scissors:Learns to best-respond against variable-Markov models
Iterated prisoner’s dilemma:Comes first in a tournament against finite automata
Littman’s soccer:Wins 70% of games against reinforcement learning algorithms
18 / 31
Summary of First Approach
We associate a sequence predictor with each game state
During a game we update our:
Rewards for action sequences using Q-Learning
Sequence predictors with observed opponent actions
At each decision point we lookahead and take the first action of anaction sequence with the maximum expected cumulative reward
19 / 31
Second Approach
1 Sequence prediction to learn the opponent’s strategy
2 Online expectation maximisation [13, 14] to predict the opponent’shidden information (to know H to update our opponent model)
3 No-regret learning algorithm to adjust our strategy
4 Simulate games against our opponent model
Improves no-regret algorithm performance vs itself, a state-of-the-artreinforcement learning agent and a popular bandit algorithm in:
Die-Roll Poker [15] Rhode Island Hold’em [16]
20 / 31
Online Expectation Maximisation
A rational agent will act based on its hidden information
At the end of a game, we have observed the opponent’s (public)actions but not necessarily their hidden information (e.g. they folded)
Expectation step:
1 For each possible instance of hidden information the opponent couldhold, calculate the probability of their actions
2 Normalise these probabilities
Each normalised probability corresponds to the expected number ofopponent visits to the path associated with that hidden information
Maximisation step: update the opponent’s action probabilities alongeach path to account for their expected number of visits
21 / 31
Online Expectation Maximisation
I1
I3
0
0.8C
I7
-1
1.0F
0
0.0C
0.2R
0.6C
I4
1
1.0F
0
0.0C
0.4R
0.5J
I1
I5
-1
0.1C
I7
-1
1.0F
-2
0.0C
0.9R
0.6C
I6
1
0.0F
-2
1.0C
0.4R
0.5K
0.5J
I2
I3
1
0.8C
I8
-1
0.0F
2
1.0C
0.2R
0.3C
I4
1
1.0F
2
0.0C
0.7R
0.5J
I2
I5
0
0.1C
I8
-1
0.0F
0
1.0C
0.9R
0.3C
I6
1
0.0F
0
1.0C
0.7R
0.5K
0.5K
J = Jack, K = King, F = Fold, C = Call, R = Raise
22 / 31
Online Expectation Maximisation
I1
I3
0
0.8C
I7
-1
1.0F
0
0.0C
0.2R
0.6C
I4
1
1.0F
0
0.0C
0.4R
0.5J
I1
I5
-1
0.1C
I7
-1
1.0F
-2
0.0C
0.9R
0.6C
I6
1
0.0F
-2
1.0C
0.4R
0.5K
0.5J
I2
I3
1
0.8C
I8
-1
0.0F
2
1.0C
0.2R
0.3C
I4
1
1.0F
2
0.0C
0.7R
0.5J
I2
I5
0
0.1C
I8
-1
0.0F
0
1.0C
0.9R
0.3C
I6
1
0.0F
0
1.0C
0.7R
0.5K
0.5K
Assume we are P1, we got a Jack, opponent P2 got a Jack or a King
23 / 31
Online Expectation Maximisation
I1
I3
0
0.8C
I7
-1
1.0F
0
0.0C
0.2R
0.6C
I4
1
1.0F
0
0.0C
0.4R
0.5J
I1
I5
-1
0.1C
I7
-1
1.0F
-2
0.0C
0.9R
0.6C
I6
1
0.0F
-2
1.0C
0.4R
0.5K
0.5J
I2
I3
1
0.8C
I8
-1
0.0F
2
1.0C
0.2R
0.3C
I4
1
1.0F
2
0.0C
0.7R
0.5J
I2
I5
0
0.1C
I8
-1
0.0F
0
1.0C
0.9R
0.3C
I6
1
0.0F
0
1.0C
0.7R
0.5K
0.5K
Pr((J, J,R,F )|σ−i ) = 1 and Pr((J,K ,R,F )|σ−i ) = 0
Update visits to (J, J,R,F ) by 1 and to (J,K ,R,F ) by 0
24 / 31
No-Regret Learning
Our no-regret method is based on counterfactual regret minimisation
State-of-the-art algorithm that provably minimises regret intwo-player, zero-sum, imperfect information games [17]
In self-play its average strategy profile approaches a Nash equilibrium
Can handle games with 1012 states (1010 states was the previous limitusing Nesterov’s excessive gap technique, limit poker has 1018 states)
Needs opponent’s strategy, we use an online version that removes this
25 / 31
Results of Second Approach
Has higher average payoffs per game and a higher final performancethan the no-regret algorithm on its own such that in. . .
Die-roll poker and Rhode Island hold’em:Learns to win against all opponents (except near Nash where it draws)
But online expectation maximisation seems less effective in RhodeIsland hold’em compared to die-roll poker - investigating why
26 / 31
Summary of Second Approach
We associate a sequence predictor with each game state from theopponent’s perspective (opponent information set)
At the end of a game we:
Predict opponent’s hidden information by online expectationmaximisation
Update the sequence predictors along the path associated with thepredicted hidden information and public actions
Update our strategy with the reward from the actual game as well asthe rewards from a number of simulated games
27 / 31
Summary
Maximise our rewards when an opponent’s actions can affect them
Use games to focus on the agent, worry less about the environment
Approaches:
1 Reinforcement learning + sequence prediction + lookahead
2 Sequence prediction + online EM + no-regret + simulation
28 / 31
References I
[1] Backgammon Programming. http://www.bkgm.com/rgb/rgb.cgi?view+782. Accessed: 10/10/2013.
[2] Chinook vs. the Checkers Champ - Top 10 Man-vs.-Machine Moments - TIME.http://content.time.com/time/specials/packages/article/0,28804,2049187_2049195_2049286,00.html.Accessed: 10/10/2013.
[3] Scrabble Showdown: Quackle vs. David Boys - Top 10 Man-vs.-Machine Moments - TIME.http://content.time.com/time/specials/packages/article/0,28804,2049187_2049195_2049083,00.html.Accessed: 10/10/2013.
[4] IBM100 - Deep Blue. http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/. Accessed:10/10/2013.
[5] Othello match of the year. https://skatgame.net/mburo/event.html. Accessed: 10/10/2013.
[6] CrazyStone at Sensei’s Library. http://senseis.xmp.net/?CrazyStone. Accessed: 10/10/2013.
[7] Man vs Machine II - Polaris vs Online Poker’s Best. http://www.poker-academy.com/man-machine/2008/. Accessed:10/10/2013.
[8] IBM Watson. https://www-03.ibm.com/innovation/us/watson/. Accessed: 10/10/2013.
[9] Richard Mealing and Jonathan L. Shapiro. “Opponent Modelling by Sequence Prediction and Lookahead in Two-PlayerGames”. In: 12th International Conference on Artificial Intelligence and Soft Computing. 2013.
[10] Michael L. Littman. “Markov games as a framework for multi-agent reinforcement learning”. In: 11th Proc. of ICML.Morgan Kaufmann, 1994, pp. 157–163.
[11] C. J. C. H. Watkins. “Learning from delayed rewards”. PhD thesis. Cambridge, 1989.
[12] Jensen et al. “Non-stationary policy learning in 2-player zero sum games”. In: Proc. of 20th Int. Conf. on AI. 2005,pp. 789–794.
29 / 31
References II
[13] A. P. Dempster, N. M. Laird, and D. B. Rubin. “Maximum Likelihood from Incomplete Data via the EM Algorithm”.In: Journal of the Royal Statistical Society 39 (1977), pp. 1–38.
[14] Olivier Cappe and Eric Moulines. “Online EM Algorithm for Latent Data Models”. In: Journal of the Royal StatisticalSociety 71 (2008), pp. 593–613.
[15] Marc Lanctot et al. “No-Regret Learning in Extensive-Form Games with Imperfect Recall”. In: Proceedings of the 29thInternational Conference on Machine Learning (ICML-12). 2012.
[16] Jiefu Shi and Michael L. Littman. “Abstraction Methods for Game Theoretic Poker”. In: Revised Papers from theSecond International Conference on Computers and Games. 2000.
[17] Martin Zinkevich et al. “Regret Minimization in Games with Incomplete Information”. In: Advances in NeuralInformation Processing Systems 20. 2008.
[18] G.W. Brown. “Activity Analysis of Production and Allocation”. In: ed. by T. J. Koopmans. New York: Wiley, 1951.Chap. Iterative Solutions of Games by Fictitious Play.
[19] Carmel and Markovitch. “Learning Models of Intelligent Agents”. In: Proc. of 13th Int. Conf. on AI. AAAI, 1996,pp. 62–67.
[20] John M Butterworth. “Stability of gradient-based learning dynamics in two-agent imperfect-information games”.PhD thesis. The University of Manchester, 2010.
[21] Knoll and de Freitas. “A Machine Learning Perspective on Predictive Coding with PAQ”. arXiv:1108.3298. 2011.
30 / 31
Appendix: Future Work
Change detection methods to discard outdated observations
Use the opponent model more when it is more accurate
More challenging domains e.g. n-player, continuous values
Real-world applications e.g. peer-to-peer file sharing
Use implicit as well as explicit opponent modelling
31 / 31
Appendix: Potential Applications
Learning conditional and adaptive strategies
Adapting to user interaction
Adjusting the workload or relocating the system resources
Responding to network traffic (p2p, spam filtering, virus detection)
Overlapping areas: speech recognition/synthesis/tagging, musicalscore, machine translation, gene prediction, DNA/protein sequenceclassification/identification, bioinformatics, handwriting, gesturerecognition, partial discharges, cryptanalysis, protein folding,metamorphic virus detection, statistical process control, roboticteams, distributed control, resource management, collaborativedecision support systems, economics, industrial manufacturing,complex simulations, combinatorial search, etc...
31 / 31
Appendix: What has been tried before?
Fictitious play assumes a Markov model opponent strategy [18]
Unsupervised L* infers deterministic finite automata models [19]
ELPH defeated human and agent players in rock-paper-scissors [12]
Stochastic gradient ascent with the lagging anchor algorithm [20]
PAQ8L defeated human players in rock-paper-scissors [21]
31 / 31
Appendix: Counterfactual Regret Minimisation
Counterfactual Value:
vi (I |σ) =∑n∈I
Pr(n|σ−i )ui (n)
ui (n) =1
Pr(n|σ)
∑z∈Z [n]
Pr(z |σ)ui (z)
vi (I |σ) = player i ’s counterfactual value of information set I given strategy profile σ
Pr(n|σ−i ) = probability of reaching node n from the root given the opponent’s strategy
ui (n) = player i ’s expected reward at node n
Pr(n|σ) = probability of reaching node n from the root given all players’ strategies
Z [n] = set of terminal nodes that can be reached from node n
ui (z) = player i ’s reward at terminal node z
31 / 31
Appendix: Counterfactual Regret Minimisation
Counterfactual Regret:
ri (I , a) = vi (I |σI→a)− vi (I |σ)
ri (I , a) = player i ’s counterfactual regret of not playing action a at information set I
σI→a = same as σ except a is always played at I
vi (I |σI→a) = player i ’s counterfactual value of playing action a at information set I
vi (I |σ) = player i ’s counterfactual value of playing their strategy information set I
31 / 31
Appendix: Counterfactual Regret Minimisation
Sampled Counterfactual Value:
vi (I |σ,Qj) =∑n∈I
Pr(n|σ−i )ui (n|Qj)
ui (n|Qj) =1
Pr(n|σ)
∑z∈Qj∩Z [n]
1
q(z)Pr(z |σ)ui (z)
q(z) =∑
j :z∈Qj
qj
vi (I |σ,Qj ) = player i ’s sampled counterfactual value of I given strategy profile σ and Qj
Qj = set of sampled terminal nodes
Pr(n|σ−i ) = probability of reaching node n from the root given the opponent’s strategy
ui (n) = player i ’s sampled expected reward at node n given Qj
Pr(n|σ) = probability of reaching node n from the root given all players’ strategies
Z [n] = set of terminal nodes that can be reached from node n
ui (z) = player i ’s reward at terminal node z
qj = probability of sampling Qj
31 / 31
Appendix: Counterfactual Regret Minimisation
Outcome Sampling (|Qj | = 1 and qj = q(z)):
vi (Ix |σ,Qj) =∑n∈Ix
Pr(n|σ−i )
1
Pr(n|σ)
∑z∈Qj∩Z [n]
1
q(z)Pr(z |σ)ui (z)
=
Pr(n|σ−i ) Pr(z |σ)ui (z)
Pr(n|σ)q(z)
=Pr(n|σ−i ) Pr(z |σ)ui (z)
Pr(n|σ) Pr(z |σ′)
=Pr(n|σ−i ) Pr(z |σi ) Pr(z |σ−i )ui (z)
Pr(n|σi ) Pr(n|σ−i ) Pr(z |σ′i ) Pr(z |σ′−i )
=Pr(z |σi ) Pr(z |σ−i )ui (z)
Pr(n|σi ) Pr(z |σ′i ) Pr(z |σ′−i )
=Pr(z |σi )ui (z)
Pr(n|σi ) Pr(z |σ′i )Assuming σ′−i ≈ σ−i
31 / 31
Appendix: Zero-Determinant Strategies
Unilaterally set an opponent’s expected payoff in the iteratedprisoner’s dilemma irrespective of the opponents strategy
Turns the prisoner’s dilemma into an ultimatum game
Works well against evolutionary players without an opponent model
An opponent model could recognise the unfair offer and refuse
31 / 31