A Reinforcement Learning Agent for 1-Card Poker › 7882 › 3cc1ee4c1a... · A Reinforcement Learning Agent for 1-Card Poker Matthew R. Wahab [email protected] McGill University

A Reinforcement Learning Agent for1-Card Poker

Matthew R. [email protected]

McGill UniversityMontreal, Canada

I declare that this work represents my own efforts, and that alltext and code have been written by me (except as indicated).

Signature:

Abstract

Modeling and reasoning about an opponent in a competitive en-vironment is a difficult task. This paper uses a reinforcement learn-ing framework to build an adaptable agent for the game of 1-cardpoker. The resulting agent is evaluated against various opponentsand is shown to be very competitive.

1 Introduction and Motivation

Modeling the preferences and biases of users is becoming an important re-search area. While acquiring a corpus of data on a user may be a simple task,reasoning and predicting future patterns based on it is challenging. This typeof information would be useful in many real world domains such as predictingweb buying (ex. music and movie recommendations), user commands (ex.

1

code completion), and page/file requests (ex. web browsing and operatingsystems). Gathering and processing user data in a competitive environmentis an even more challenging task.

The game of poker has become a popular domain in artificial intelligenceresearch for exploring problems of this nature [3]. This is due to the factthat it has imperfect information (the player doesn’t know the opponent’scard(s)) and hence we do not have complete state observability. Also, theopponent may try to be misleading and hence make the problem of predict-ing their behavior even more difficult.

In this paper, we will present a reinforcement learning agent for the gameof 1-card poker. While 1-card poker is significantly less complex than othervariants, such as Texas Hold ’em, it still captures the nature of our problem:modeling and reasoning about an opponent in a competitive environment.The rest of the paper is organized as follows: Section 2 will describe relatedwork on poker and imperfect information games, Section 3 will explain thegame of 1-card poker, Section 4 will detail our methodology, Section 5 willdiscuss experimental findings, and Section 6 will provide conclusions andfuture work.

2 Related Work

With the recent surge of interest in poker, more and more work has beendone on creating a skilled computer player. Previous endeavors have in-cluded rule-based, simulation-based (decisions are determined by simulatingthe rest of the hand), and game theoretic agents.

[8] applies the sequence representation of extensive form games to one-card poker. This produces an optimal solution that is able to be foundefficiently. Previously, in order to be solved the game had to be convertedto its normal form, which is of exponential size. In [6] the optimal solutionto 1-card poker is computed by solving a linear program generated from thesequence form of the game tree. While this technique generates a strategythat is optimal in the game theoretic perspective, it doesn’t react to the waythe opponent is playing and hence may not yield a maximal reward. How-ever, it does generate a strategy that contains bluffing (betting even if we

2

have a low card) and slow-playing (not betting right away if we hold a highcard). This gives us a benchmark with which to measure our reinforcementlearning agent.

[2] computes a complete approximation of game-theoretic optimal strate-gies for full-scale poker. While their agent, PsOpti, performed very well itstill lacked opponent modeling and hence was taken advantage of by humanplayers once it’s weaknesses were determined. This issue is specifically ad-dressed in [1], where opponent modeling becomes a central component ofthe agent. This agent, Vexbot, uses stochastic game-tree search, where op-ponent modeling is used to calculate the expected value of hands, which isthen propagated back up through the tree. Vexbot has been very successful,winning the gold medal at the 2003 Computer Olympiad. While we don’tattempt to match the scale of their approach, one aim is to have sufficientadaptibility in order to take advantage of opponent weaknesses.

Another partially observable game is hearts. In [5] and [4] a model-basedPOMDP-RL method with Monte Carlo state estimation is used to create anagent for this multi-agent card game. While all previously mentioned workhas only dealt with two-person games, hearts is a four-person game whichmakes the learning problem more difficult. They achieve very good resultswhen the other players are stationary (non-learning), however their successis limited when training multiple agents concurrently.

3 1-Card Poker

While 1-card poker is a simple game, it still has many properties inherentto more complex games, such as chance events and imperfect information.What differentiates it from other poker variants is the notion of hand poten-tial. Instead of using a full deck (52 cards, 4 suits) and having an exponentialnumber of hand possibilities, we use a single-suit deck of 13 cards (2 to Ace).

Here are the rules for 1-card poker for two players [6]: P1 and P2 eachget one card and ante $1. P1 bets first, either $0 or $1. Then P2 gets achance to match if P1 bet $1 or raise if P1 $0. If P1 bet $0 and P2 raises,then P1 has a chance to call. Betting $0 when your opponent has already

3

Figure 1: Flow diagram for 1-card poker

bet $1 means you fold and lose your ante. If no one folds before the end ofbetting, we go to a showdown where the player with the highest card winsthe pot. The pot size is either $1 or $2, equal to the other player’s ante plusthe bet of $0 or $1. The possible outcomes are enumerated in Figure 1.

4 Theory and Approach

4.1 Markov Decision Processes

A Markov Decision Process (MDP) [9] is a 4-tuple, (S, A, T, R), where S isthe set of states, A is the set of actions, T is a transition function S×A×S →[0, 1], and R is a reward function S×A→ <. The transition function definesa probability distribution over next states as a function of the current stateof the agent’s action. The reward function defines the reward received whenchoosing an action from the given state. Solving MDPs involves finding apolicy π : S → A, that maps states to actions so as to maximize the dis-counted future reward, with discount factor γ.

4

MDPs are the main focus of much of the current work in reinforcementlearning [9]. The existence of a deterministic, stationary, and most impor-tantly optimal policy, is the result that drives this work.

4.2 Q-Learning

Q-Learning [11](Algorithm 1) is a single-agent learning algorithm devised tofind optimal policies in MDPs. It works by estimating the values of state-action pairs. The value Q(s, a) is defined to be the expected discounted sumof future payoffs obtained by taking action a from state s and following anoptimal policy thereafter. Once these values have been learned, the optimalaction from any state is the one with the highest Q-value.

This algorithm is guaranteed to converge to the correct Q-values withprobability one if the environment is stationary, and depends on the currentstate and the action taken in it. Although originally designed for single-agentenvironments, Q-Learning has been used successfully in several multi-agentsettings [7]. When other players’ strategies are stationary, we are left withan MDP, in which the algorithm will learn an optimal policy. Although Q-Learning isn’t convergent theoretically in self-play, there have been instancesof success such as [10].

4.3 Q-Learning for 1-card poker

For the game of 1-card poker we created reinforcement learning agents forboth player 1 and player 2. Seeing as each player has different representa-tional needs for the state (player 1 can act twice, whereas player 2 only actsonce) we decided upon separate Q-tables for each player.

Player 1’s state is defined by a 5-tuble, {C,R,P1move1,P2move1,P1move2}.C is the player’s card, R is the current round (0 or 1), and the remainingthree represent the moves that have occurred thus far in the game. Player 1has one binary action, 0 to CALL or 1 to BET.

For player 2 a state is defined by a 3-tuple, {C,P1move1,P2move1}. Cis the player’s card, and the remaining two represent the moves that haveoccurred thus far in the game. Again, player 2 has one binary action, 0 to

5

CALL or 1 to BET. Notice that player 2’s state representation doesn’t needto represent which round it is because it only acts once.

Algorithm 1 Q-Learning Algorithm

1. Let α ∈ (0, 1] be the learning rate. Initialize,

Q(s, a)← uniform(0, 1)

2. Repeat,

(a) From state s select action a = maxa Q(s, a) with suitable explo-ration.

(b) Observing reward r and next state s′,

Q(s, a)← Q(s, a) + α[r + γ maxa′ Q(s′, a′)−Q(s, a)]

5 Experiments

To evaluate the strength of our RL-agents, we conducted a round-robin tour-nament of computer vs computer matches. The field of computer opponentsconsisted of:

1. AlwaysBet This player always bets and is used as a benchmark.

2. AlwaysCall This player always calls and is used as a benchmark.

3. DumbRule This player plays a mixed strategy according to what cardit has. The probability of it betting is 1− (ACE − card)/ACE (whereACE = 13 and card is the numerical value of the card). For exampleif the player holds a 10, then the probability of betting is 1 − (13 −10)/13 = .769 or 76.9%

4. RuleBased This player plays a mixed strategy and has separate bid-ding probabilities determined by the linear program solving the se-quence form of 1-card poker [6],[8].

5. Mixed This player randomly choses a player from the list {AlwaysBet,AlwaysCall, DumbRule, RuleBased} and follows its strategy for thecurrent action.

6

5.1 Evaluation of RL Agent

First, we will look at how well the RL agent is able to model its opponent.The two factors that are important here are how quickly the agent learns andwhat its average reward is. Each RL agent was trained for 100,000 gamesagainst each opponent, where every 100 games its strategy was fixed andevaluated for 100,000 games. This was done three times and then averaged.

Figure 2 shows the learning curves for RL Player 1. The first thing weobserve is that it approaches the theoretical maximum vs the AlwaysCallplayer. To get a clearer picture of the learning rates we examine Figure 3,where the AlwaysCall results are excluded. From this figure we can see thatthe average reward against the Mixed opponent is clearly the highest. This isvery interesting since one would think that the randomness would lead to adecrease in performance and learning rate. We believe that the fact that theAlwaysCall opponent could be chosen led to the high average reward. We alsonote that the learning rate is approximately the same against each opponent.

Figure 4 and 5 show the learning curves for RL Player 2. Again we cansee that it approaches the theoretical maximum vs the AlwaysCall player.Secondly, we observe it does very well against the rule based player which issupposed to be optimal (in the game theoretic sense). This shows us thatby modeling and adapting to our opponents we can take advantage of fixedstrategies. Note that once again the learning rates are approximately thesame against each opponent.

Now we look at Figure 6 where the results of two RL agents learningsimultaneously is displayed. Since neither agent is stationary, it is not ex-pected that their Q-values will converge. However, it appears that as moregames are played the average reward for player 1 is slightly above 1. Wespeculate that this is due to the fact that player 1 has two moves (one morethan player 2) and hence has the option of slow-playing.

5.2 Round-Robin Tournament Results

Now we will analyze the round-robin tournament. Tables 1 and 2 in the Ap-pendix show the percentage of games won and average winnings per hand,respectively, by player 1 averaged over 100,000 games.

7

The RL player did very well and had the second highest winning per-centage, next to the AlwaysBet opponent. It is somewhat surprising that adeterministic strategy could be so dominant in this setting. However, whenwe look at the average winnings per hand we see that the RL agent is the onlyone to have a positive average reward against all opponents. The discrepancybetween percentage games won and average winnings could be attributed tothe emergence of strategies such as slow-playing or learning to fold bad handsearly on, hence minimizing loses. In the real world, average winnings is themore important performance indicator, and hence we can say with confidencethat our RL agent is indeed very competitive against the presented field.

6 Conclusion and Future Work

In this paper, we implemented a reinforcement learning based agent for thegame of 1-card poker. This agent has demonstrated that opponent modelingand adaptability are key ingredients in a successful player by taking advan-tage of opponent tendencies and exploiting them. By not using a purelyrule-based or game-theoretic approach we were able to maximize our rewardagainst our current opponent, whatever their strategy may be.

There are two directions for future work in this area. The first beingextending the approach to handle more than two players. The size of the staterepresentation would not be a limiting factor, however, multiple opponentscould slow the learning rate significantly. The other direction would be toextend the approach to handle a more complex variant of poker, such asTexas Hold ’em. This would greatly increase the complexity of the staterepresentation as well as slow the learning rate. By using a more complexvariant, the hand complexity increases exponentially which would test thelimits of our approach.

References

[1] D. Billings, M. Bowling, N. Burch, A. Davidson, R. Holte, J. Schaeffer,T. Schauenberg, and D. Szafron. Game tree search with adaptation in

8

stochastic imperfect information games. Computers and Games, July,2004.

[2] D. Billings, N. Burch, A. Davidson, R. Holte, J. Schaeffer, T. Schauen-berg, and D. Szafron. Approximating game-theoretic optimal strategiesfor full-scale poker. IJCAI, pages 661–668, 2003.

[3] D. Billings, A. Davidson, J. Schaeffer, and D. Szafron. The challenge ofpoker. Artificial Intelligence, 134:201–240, 2002.

[4] H. Fujita and S. Ishii. A reinforcement learning scheme for a multi-agentcard game with monte carlo state estimation. International Conferenceon Computational Intelligence for Modeling Control and Automation,pages 799–806, 2004.

[5] H. Fujita, Y. Matsuno, and S. Ishii. A reinforcement learning schemefor a multi-agent card game. IEEE International Conference on Systemsand Cybernetics, pages 4071–4078, 2003.

[6] G. Gordon. One-card poker. http://www-2.cs.cmu.edu/ ggordon/poker/.

[7] J.Hua and M.P. Wellman. Multiagent reinforcement learning: Theoret-ical framework and an algorithm. Proceedings of the Fifteenth Interna-tional Conference on Machine Learning, pages 242–250, 1998.

[8] D. Koller and A. Pfeffer. Generating and solving imperfect informationgames. IJCAI, 1995.

[9] R.S. Sutton and A.G. Barto. Reinforcement Learning. The MIT Press,1998.

[10] G. Tesauro. Temporal difference learning and td-gammon. Communi-cations of the ACM, 1995.

[11] C.J. Watkins. Learning from Delayed Rewards. PhD thesis, King’sCollege, 1989.

9

A Tables and Figures

Player 2AlwaysBet AlwaysCall DumbRule RuleBased2 RL P2 Mixed

AlwaysBet 50.041 100.0 69.133 65.188 54.437 71.043AlwaysCall 0.0 50.155 36.975 26.497 1.515 28.418

Player 1 DumbRule 39.754 62.652 52.402 46.503 42.363 50.334RuleBased1 42.521 68.109 54.051 47.930 45.120 53.059

RL P1 44.430 95.244 66.285 63.553 54.259 69.701Percentage of Games Won by Player 1 (over 100,000 games)

Player 2AlwaysBet AlwaysCall DumbRule RuleBased2 RL P2 Mixed

AlwaysBet 0.001 1.00 0.201 0.189 -0.155 0.346AlwaysCall -1.00 0.003 -0.260 -0.470 -0.969 -0.431

Player 1 DumbRule -0.009 0.253 0.022 -0.054 -0.155 0.053RuleBased1 0.007 0.362 0.025 -0.052 -0.222 0.082

RL P1 0.138 0.904 0.181 0.177 0.026 0.3209Average Winnings Per Hand by Player 1 (over 100,000 games)

10

Figure 2: Learning Curve for RL Player 1


11



12

Figure 6: Learning Curve for RL Player 1 (while playing against a learningRL player 2)

13

Documents

A Reinforcement Learning Agent for 1-Card Poker › 7882 › 3cc1ee4c1a... · A Reinforcement Learning Agent for 1-Card Poker Matthew R. Wahab [email protected] McGill University