Case-Based Strategies in Computer Poker
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
1 Case-Based Strategies in Computer Poker Jonathan Rubin a and Ian Watson a prohibitively large. Furthermore, empirical results a Department of Computer Science. tend to support the intuition that solving larger University of Auckland Game AI Group models results in better quality strategies1 . How- E-mail: jrubin01@gmail.com, ever, equilibrium finding algorithms are only one E-mail: ian@cs.auckland.ac.nz of many approaches available within the computer poker test-bed. Alternative approaches such as im- The state-of-the-art within Artificial Intelligence has perfect information game tree search [8] and, more directly benefited from research conducted within the recently, Monte-Carlo tree search [36] have also re- computer poker domain. One such success has been ceived attention from researchers in order to han- the advancement of bottom up equilibrium finding al- dle challenges within the computer poker domain gorithms via computational game theory. On the other that cannot be suitably addressed by equilibrium hand, alternative top down approaches, that attempt finding algorithms, such as dynamic adaptation to to generalise decisions observed within a collection of changing game conditions. data, have not received as much attention. In this work we employ a top down approach in order to construct The algorithms mentioned above take a bottom case-based strategies within three computer poker do- up approach to constructing sophisticated strate- mains. Our analysis begins within the simplest vari- gies within the computer poker domain. While ation of Texas Hold’em poker, i.e. two-player, limit the details of each algorithm differ, they roughly Hold’em. We trace the evolution of our case-based ar- achieve their goal by enumerating (or sampling) chitecture and evaluate the effect that modifications a state space together with its pay-off values in have on strategy performance. The end result of our order to identify a distribution over actions that experimentation is a coherent framework for produc- achieves the greatest expected value. An alterna- ing strong case-based strategies based on the observa- tive top down procedure attempts to construct so- tion and generalisation of expert decisions. The lessons phisticated strategies by generalising decisions ob- learned within this domain offer valuable insights, that we use to apply the framework to the more complicated served within a collection of data. This lazier top domains of two-player, no-limit Hold’em and multi- down approach offers its own set of problems in player, limit Hold’em. For each domain we present re- the domain of computer poker. In particular, any sults obtained from the Annual Computer Poker Com- top down approach is a slave to its data, so quality petition, where the best poker agents in the world are data is a necessity. While massive amounts of data challenged against each other. We also present results from online poker sites is available [25], the quality against human opposition. of the decisions contained within this data is usu- Keywords: Imperfect Information Games, Game AI, ally questionable. The imperfect information world Case-Based Reasoning of the poker domain can often mean that valuable information may be missing from this data. More- over, the stochastic nature of the poker domain en- 1. Introduction sures that it is not enough to simply rely on out- come information in order to determine decision The state-of-the-art within Artificial Intelli- quality. gence (AI) research has directly benefited from re- Despite the problems described above, top down search conducted within the computer poker do- approaches within the computer poker domain main. Perhaps its most notable achievement has have still managed to produce strong strategies been the advancement of equilibrium finding al- [4,28]. In fact, empirical evidence from interna- gorithms via computational game theory. State- of-the-art equilibrium finding algorithms are now 1 See [38] for a discussion of why this is not always the able to solve mathematical models that were once case. AI Communications 25 (2012) 1948 DOI 10.3233/AIC-2012-0513 ISSN 0921-7126, IOS Press. All rights reserved
2 Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker tional computer poker competitions [1] suggest how our framework deals with these issues. For that, in a few cases, top down approaches have each of the three poker sub-domains mentioned managed to out-perform their bottom up counter- above we produce strategies that have been ex- parts. In this work we describe one such top down tensively evaluated. In particular, we present re- approach that we have used to construct sophis- sults from Annual Computer Poker Competitions ticated strategies within the computer poker do- for the years 2009 – 2011 and illustrate the per- main. Our case-based approach can be used to pro- formance trajectory of our case-based strategies duce strategies for a range of sub-domains within against the best available opposition. the computer poker environment, including both The remainder of this document proceeds as limit and no-limit betting structures as well as follows. Section 2 describes the rules of Texas two-player and multi-player matches. The case- Hold’em poker, highlighting the differences be- based strategies produced by our approach have tween the different variations available. Section achieved 1st place finishes for our agent (Sartre) at 3 provides the necessary background and details the Annual Computer Poker Competition (ACPC) some related work. Section 4 further recaps the [1]. The ACPC is the premier computer poker benefits of the poker domain as a test-bed for arti- event and the agents submitted typically represent ficial intelligence research and provides the motiva- the current state-of-the-art in computer poker re- tion for the use of case-based strategies as opposed search. to alternative algorithms. Section 5 details the ini- We have applied and evaluated case-based strate- tial evolution of our case-based architecture for gies within the game of Texas Hold’em. Texas computer poker in the two-player, limit Hold’em Hold’em is currently the most popular poker varia- domain. Experimental results are presented and tion. To achieve strong performance, players must discussed. Sections 6 and 7 extrapolate the result- be able to successfully deal with imperfect infor- ing framework to the more complicated domains mation, i.e. they cannot see their opponents’ hid- of two-player, no-limit Hold’em and multi-player den cards. Also, chance events occur in the do- limit Hold’em. Once again, results are presented main via the random distribution of playing cards. and discussed for each separate domain. Finally, Texas Hold’em can be played as a two-person game Section 8 concludes the document. or a multi-player game. There are multiple varia- tions on the type of betting structures used that can dramatically alter the dynamics of the game 2. Texas Hold’em and hence the strategies that must be employed for successful play. For instance, a limit game restricts Here we briefly describe the game of Texas the size of the bets allowed to predefined values. Hold’em, highlighting some of the common terms On the other hand, a no-limit game imposes no which are used throughout this work. For more de- such restriction. tailed information on Texas Hold’em consult [33], In this work we present case-based strategies in or for further information on poker in general see three poker domains. Our analysis begins within [32]. the simplest variation of Texas Hold’em, i.e. two- Texas Hold’em can be played either as a two- player, limit Hold’em. Here we trace the evolution player game or a multi-player game. When a game of our case-based architecture and evaluate the ef- consists only of two players it is often referred to fect that modifications have on strategy perfor- as a heads up match. Game play consists of four mance. The end result of our experimentation in stages – preflop, flop, turn and river. During each the two-player, limit Hold’em domain is a coherent stage a round of betting occurs. The first round framework for producing strong case-based strate- of play is the preflop where all players at the ta- gies, based on the observation and generalisation ble are dealt two hole cards, which only they can of expert decisions. The lessons learned within this see. Before any betting takes place, two forced bets domain offer valuable insights, which we use to ap- are contributed to the pot, i.e. the small blind and ply the framework to the more complicated do- the big blind. The big blind is typically double mains of two-player, no-limit Hold’em and multi- that of the small blind. In a heads up match, the player, limit Hold’em. We describe the difficulties dealer acts first preflop. In a multi-player match that these more complicated domains impose and the player to the left of the big blind acts first pre-
Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker 3 flop. In both heads up and multi-player matches, from the shuffled deck of cards as follows: flop – 3 the dealer is the last to act on the post-flop betting community cards, turn – 1 community card, river rounds (i.e. the flop, turn and river). The legal bet- – 1 community card. All players combine their hole ting actions are fold, check/call or bet/raise. These cards with the public community cards to form possible betting actions are common to all vari- their best five card poker hand. A showdown oc- ations of poker and are described in more detail curs after the river where the remaining players re- below: veal their hole cards and the player with the best hand wins all the chips in the pot. If both players’ Fold: When a player contributes no further chips hands are of equal value, the pot is split between to the pot and abandons their hand and any them. right to contest the chips that have been added to the pot. Check/Call: When a player commits the minimum 3. Background amount of chips possible in order to stay in the hand and continues to contest the pot. 3.1. Strategy Types A check requires a commitment of zero fur- ther chips, whereas a call requires an amount As mentioned in the introduction, many AI greater than zero. researchers working in the computer poker do- Bet/Raise: When a player commits greater than main have focused their efforts on creating strong the minimum amount of chips necessary to strategies via bottom up, equilibrium finding algo- stay in the hand. When the player could have rithms. When equilibrium finding algorithms are checked, but decides to invest further chips applied to the computer poker domain, they pro- in the pot, this is known as a bet. When the duce -Nash equilibria. -Nash equilibria are ro- player could have called a bet, but decides to bust, static strategies that limit their exploitability invest further chips in the pot, this is known () against worst-case opponents. A pair of strate- as a raise. gies are said to be an -Nash equilibrium if nei- In a limit game all bets are in increments of a ther strategy can gain more than by deviating. certain amount. In a no-limit game a player may In this context, a strategy refers to a probabilistic bet any amount up to the total value of chips that distribution over available actions at every deci- they possess. For example, assuming a player be- sion point. Two state-of-the-art equilibrium find- gins a match with 1000 chips, after paying a forced ing algorithms are Counterfactual Regret Minimi- small blind of one chip they then have the op- sation (CFRM) [39,18] and Excessive Gap Tech- tion to either fold, call one more chip or raise by nique (EGT) [13]. CFRM is an iterative, regret contributing anywhere between 3 and 999 extra minimising algorithm that was developed by the chips2 . In a standard game of heads-up, no-limit University of Alberta Computer Poker Research poker, both players’ chip stacks would fluctuate Group (CPRG)3 . The EGT algorithm, developed between hands, e.g. a win from a previous hand by Andrew Gilpin and Thomas Sandholm from would ensure that one player had a larger chip Carnegie Mellon University, is an adapted version stack to play with on the next hand. In order to of Nesterov’s excessive gap technique [21], which reduce the variance that this structure imposes, a has been specialised for two-player, zero-sum, im- variation known as Doyle’s Game is played where perfect information games. the starting stacks of both players are reset to a The -Nash equilibrium strategies produced via specified amount at the beginning of every hand. CFRM and EGT are solid, unwavering strate- Once the round of betting is complete, as long gies that do not adapt given further observations as at least two players still remain in the hand, made by challenging particular opponents. An al- play continues on to the next stage. Each post- ternative strategy type is one that attempts to flop stage involves the drawing of community cards exploit perceived weaknesses in their opponents’ strategies, by dynamically adapting their strat- 2 The minimum raise would involve paying 1 more chip to egy given further observations. This type of strat- match the big blind and then committing at least another 2 chips as the minimum legal raise. 3 http://poker.cs.ualberta.ca/
4 Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker egy is known as an exploitive (or maximal) strat- As poker is a stochastic game that consists of egy. Exploitive strategies typically select their ac- chance events, the variance can often be large es- tions based on information they have observed pecially between agents that are close in strength. about their opponent. Therefore, constructing an This requires many hands to be played in order to exploitive strategy typically involves the added dif- arrive at statistically significant conclusions. Due ficulty of generating accurate opponent models. to the large variance involved, the ACPC employs a duplicate match structure, whereby all players 3.2. Strategy Evaluation and the Annual end up playing the same set of hands. For example, Computer Poker Competition in a two-player match a set of N hands are played. This is then followed by dealing the same set of Both -Nash equilibrium based strategies and N hands a second time, but having both players exploitive strategies have received attention in the switch seats so that they receive the cards their computer poker literature [14,15,7,8,17]. Overall a opponent received previously. As both players are larger focus has been applied to equilibrium find- ing approaches. This is especially true regarding exposed to the same set of hands, this reduces the agents entered into the Annual Computer Poker amount of variance involved in the game by en- Competition. Since 2006, the ACPC has been held suring one player does not receive a larger pro- every year at conferences such as AAAI and IJCAI. portion of higher quality hands than the other. A The agents submitted to the competition typically two-player match involves two seat enumerations, represent the strongest computer poker agents in whereas a three-player duplicate match involves the world, for that particular year. Since 2009, the six seat enumerations to ensure each player is ex- ACPC has evaluated agents in the following vari- posed to the same scenario as their opponents. For ations of Texas Hold’em: three players (ABC) the following seat enumera- tions need to take place: 1. Two-player, Limit Hold’em. 2. Two-player, No-Limit Hold’em. ABC ACB 3. Three-player, Limit Hold’em. CAB CBA In this work, we restrict our attention to these BCA BAC three sub-domains. Agents are evaluated by play- ing many hands against each other in a round- robin tournament structure. The ACPC employs 4. Research Motivation two winner determination procedures: This work describes the use of case-based strate- 1. Total Bankroll. As its name implies the total gies in games. Our approach makes use of the Case- bankroll winner determination simply records based Reasoning (CBR) methodology [26,19]. The the overall profit or loss of each agent and CBR methodology encodes problems, and their so- uses this to rank competitors. In this divi- lutions, as cases. CBR attempts to solve new prob- sion, agents that are able to achieve larger bankrolls are ranked higher than those with lems or scenarios by locating similar past prob- lower profits. This winner determination pro- lems and re-using or adapting their solutions for cedure does not take into account how an the current situation. Case-based strategies are top agent achieves its overall profit or loss, for in- down strategies, in that they are constructed by stance it is possible that the winning agent processing and analysing a set of training data. could win a large amount against one com- Common game scenarios, together with their play- petitor, but lose to all other competitors. ing decisions are captured as a collection of cases, 2. Bankroll Instant Run-Off. On the other hand, referred to as the case-base. Each case attempts to the instant run-off division uses a recursive capture important game state information that is winner determination algorithm that repeat- likely to have an impact on the final playing de- edly removes the agents that performed the cision. The training data can be both real-world worst against a current pool of players. This data, e.g. from online poker casinos, or artificially way agents that achieve large profits by ex- generated data, for instance from hand history ploiting weak opponents are not favoured, as logs generated by the ACPC. Case-based strate- in the total bankroll division. gies attempt to generalise the game playing deci-
Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker 5 sions recorded within the data via the use of sim- lows the opportunity to apply an abundance of ilarity metrics that determine whether two game strategies ranging from basic concepts to sophisti- playing scenarios are sufficiently similar to each cated strategies and counter-strategies. Moreover, other, such that their decisions can be re-used. the rules of Texas Hold’em poker are incredibly Case-based strategies can be created by training simple. Contrast this with CBR related research on data generated from a range of expert players or into complex environments such as real-time strat- by isolating the decisions of a single expert player. egy games [3,20,22,23], which offer similar issues Where a case-based strategy is produced by train- to deal with – uncertainty, chance, deception – ing on and generalising the decisions of a single but don’t encapsulate this within a simple set of expert player, we refer to the agent produced as rules, boundaries and performance metrics. Suc- an expert imitator. In this way, case-based strate- cesses and failures achieved by applying case-based gies can be produced that attempt to imitate dif- strategies to the game of poker may provide valu- ferent styles of play simply by training on separate able insights for CBR researchers using complex datasets generated by observing the decisions of strategy games as their domain, where immedi- expert players, each with their own style. The lazy ate success is harder to evaluate. Furthermore, it is hoped that results may also generalise to do- learning [2] of case-based reasoning is particularly mains outside the range of games altogether to suited to expert imitation where observations of complex real world domains where hidden infor- expert play can be recorded and stored for use at mation, chance and deception are commonplace. decision time. One of the major benefits of using case-based Case-based approaches have been applied and strategies within the domain of computer poker evaluated in a variety of gaming environments. is the simplicity of the approach. Top down case- CHEBR [24] was a case-based checkers player that based strategies don’t require the construction acquired experience by simply playing games of of massive, complex mathematical models that checkers in real-time. In the RoboCup soccer do- some other approaches rely on [13,30,27]. Instead, main, [11] used case-based reasoning to construct an autonomous agent can be created simply via a team of agents that observes and imitates the the observation of expert play and the encoding behaviour of other agents. Case-based planning of observed actions into cases. Below we outline [16] has been investigated and evaluated in the some further reasons why case-based strategies domain of real-time strategy games [3,22,23,34]. are suited to the domain of computer poker and Case-based tactician (CaT) described in [3] selects hence worthy of investigation. The reasons listed tactics based on a state lattice and the outcome of are loosely based on Sycara’s [35] identification performing the chosen tactic. The CaT system was of characteristics of a domain where case-based shown to successfully learn over time. The Darmok reasoning is most applicable (these were later ad- architecture described by [22,23] pieces together justed by [37]). fragments of plans in order to produce an over- 1. A case is easily defined in the domain. all playing strategy. Performance of the strategies A case is easily identified as a previous sce- produced by the Darmok architecture were im- nario an (expert) player has encountered in proved by first classifying the situation it found the past and the action (solution) associated itself in and having this affect plan retrieval [20]. with that scenario such as whether to fold, Combining CBR with other AI approaches has also call or raise. Each case can also record a final produced successful results. In [31] transfer learn- outcome from the hand, i.e. how many chips ing was investigated in a real time strategy game a player won or lost. environment by merging CBR with reinforcement 2. Expert human poker players compare cur- learning. Also, [6] combined CBR with reinforce- rent problems to past cases. ment learning to produce an agent that could re- It makes sense that poker experts make their spond rapidly to changes in conditions of a domi- decisions based on experience. An expert nation game. poker player will normally have played many The stochastic, imperfect information world of games and encountered many different sce- Texas Hold’em poker is used as a test-bed to narios; they can then draw on this experience evaluate and analyse our case-based strategies. to determine what action to take for a current Texas Hold’em offers a rich environment that al- problem.
6 Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker 3. Cases are available as training data. Nash equilibrium for the game. In fact, it proves While many cases are available to train a impossible to reasonably store this strategy by to- case-based strategy, the quality of their solu- day’s hardware standards [18]. For these reasons tions can vary considerably. The context of alternative approaches, such as case-based strate- the past problem needs to be taken into ac- gies, can prove useful given their ability for gener- count and applied to similar contexts in the alisation. future. As the system gathers more experi- Over the years we have conducted an exten- ence it can also record its own cases, together sive amount of experimentation on the use of case- with their observed outcomes. based strategies, using two-player, limit Hold’em 4. Case comparisons can be done effectively. as our test-bed. In particular we have investigated Cases are compared by determining the sim- and measured the effect that changes have on areas ilarity of their local features. There are many such as feature and solution representation, simi- features that can be chosen to represent a larity metrics, system training and the use of dif- case. Many of the salient features in the poker ferent decision making policies. Modifications have domain (e.g. hand strength) are easily com- ranged from the very minor, e.g. training on dif- parable via standard metrics. Other features, ferent sets of data to the more dramatic, e.g. the such as betting history, require more involved development of custom betting sequence similar- similarity metrics, but are still directly com- ity metrics. For each modification and addition to parable. the architecture we have extensively evaluated the 5. Solutions can be generalised. strategies produced via self-play experiments, as For case-based strategies to be successful, the well as by challenging a range of third-party, arti- re-use or adaptation of similar cases’ solu- ficial agents and human opposition. Due to space tions should produce a solution that is (rea- limitations we restrict our attention to the changes sonably) similar to the actual, known solu- that had the greatest affect on the system architec- tion (if one exists) of the target case in ques- ture and its performance. We have named our sys- tion. This underpins one of CBR’s main as- tem Sartre (Similarity Assessment Reasoning for sumptions: that similar cases have similar so- Texas hold’em via Recall of Experience) and we lutions. We present empirical evidence that trace the evolution of its architecture below. suggests the above assumption is reasonable in the computer poker domain. 5.1. Overview In order to generalise betting decisions from a 5. Two-Player, Limit Texas Hold’em set of (artificial or real-world) training data, first it is required to construct and store a collection We begin with the application of case-based of cases. A case’s feature and solution representa- strategies within the domain of two-player, limit tion must be decided upon, such as the identifica- Texas Hold’em. Two-player, limit Hold’em offers tion of salient attribute-value pairs that describe a beneficial starting point for the experimenta- the environment at the time a case was recorded. tion and evaluation of case-based strategies, within Each case should attempt to capture important in- computer poker. Play is limited to two players and formation about the current environment that is a restricted betting structure is imposed, whereby likely to have an impact on the final solution. Af- all bets and raises are limited to pre-specified ter a collection of cases has been established, deci- amounts. The above restrictions limit the size of sions can be made by searching the case-base and the state space, compared to Hold’em variations locating similar scenarios for which solutions have that allow no-limit betting and multiple oppo- been recorded in the past. This requires the use of nents. However, while the size of the domain is re- local similarity metrics for each feature. duced, compared to more complex poker domains, Given a target case, t, that describes the im- the two-player limit Hold’em domain is still very mediate game environment, a source case, s ∈ large. The game tree consists of approximately S, where S is the entire collection of previously 1018 game states and, given the standards of cur- recorded cases and a set of features, F , global sim- rent hardware, it is intractable to derive a true ilarity is computed by summing each feature’s lo-
Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker 7 Fig. 1. Overview of the architecture used to produce case-based strategies. The numbers identify the six key areas within the architecture where the affects of maintenance has been evaluated. Table 1 cal similarity contribution, simf , and dividing by Preflop and postflop case feature representation. the total number of features: Preflop Postflop X simf (tf , sf ) 1. Hole Cards Hand Strength G(t, s) = (1) |F | 2. Betting Sequence Betting Sequence f ∈F 3. Board Texture Fig. 1. provides a pictorial representation of the architecture we have used to produce case-based strategies. The six areas that have been labelled in for each game scenario. Our case-based strategies Fig. 1. identify six key areas within the architec- use a simple attribute-value representation to de- ture where maintenance has had the most impact scribe a set of case features. Table 1 lists the fea- and led to positive affects on system performance. tures used within our case representation. A sep- They are: arate representation is used for preflop and post- flop cases, given the differences between these two 1. Feature Representation stages of the game. The features listed in Table 1 2. Similarity Metrics 3. Solution Representation were chosen by the authors as they concisely cap- 4. Case Retrieval ture all the necessary public game information, as 5. Solution Re-Use Policies, and well as the player’s personal, hidden information. 6. System Training Each feature is explained in more detail below: Preflop 5.2. Architecture Evolution 1. Hole Cards: the personal hidden cards of the player, represented by 1 out of 169 equivalence Here we describe some of the changes that have classes. taken place within the six key areas of our case- 2. Betting Sequence: a sequence of characters that based architecture, identified above. Where possi- represent the betting actions witnessed until ble, we provide a comparative evaluation for the the current decision point, where actions can maintenance performed, in order to measure the be selected from the set, Alimit = {f, c, r}. impact that changes had on the performance of the case-based strategies produced. Postflop 5.2.1. Feature Representation 1. Hand Strength: a description of the player’s The first area of the system architecture that we hand strength given a combination of their discuss is the feature representation used within personal cards and the public community a case (see Fig. 1, Point 1). We highlight results cards. that have influenced changes to the representation 2. Betting Sequence: identical to the preflop se- over time. In order to construct a case-based strat- quence, however with the addition of round egy a case representation is required that estab- delimiters to distinguish betting from previ- lishes the type of information that will be recorded ous rounds, Alimit ∪ {−}.
8 Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker 3. Board Texture: a description of the public com- ues to hands with greater potential. Typically munity cards that are revealed during the in poker, hands with similar strength values, postflop rounds but differences in potential, are required to be played in strategically different ways [33]. While the case features themselves have re- Once again bucketing is used where the de- mained relatively unchanged throughout the archi- rived E[HS2 ] values are mapped into 1 of 20 tecture’s evolution, the actual values that each fea- unique buckets for each postflop round. ture records has been experimented with to deter- mine the affect on final performance. For example, The resulting case-based strategies were eval- we have compared and evaluated the use of differ- uated by challenging the computerised opponent ent metrics for the hand strength feature from Ta- Fell Omen 2 [10]. Fell Omen 2 is a solid two-player ble 1. Fig. 2. depicts the result of a comparison be- limit Hold’em agent that plays an -Nash equilib- tween three hand strength feature values. In this rium type strategy. Fell Omen 2 was made pub- experiment, the feature values for betting sequence licly available by its creator Ian Fellows and has and board texture were held constant, while the become widely used as an agent for strategy evalu- hand strength value was varied. The values used to ation [12]. The results depicted in Fig. 2. are mea- represent hand strength were as follows: sured in small bets per hand (sb/h), i.e. where the total number of small bets won or lost are divided CATEGORIES: Uses expert defined categories to by the total number of hands played. Each data classify hand strength. Hands are assigned point records the outcome of three matches, where into categories by mapping a player’s per- 3000 duplicate hands were played. The 95% confi- sonal cards and the available board cards dence intervals for each data point are also shown. into one of a number of predefined categories. Results were recorded for various levels of case- Each category represents the type of hand the base usage to get an idea of how well the system is player currently has, together with informa- able to generalise decisions. The results in Fig. 2. tion about the drawing potential of the hand, show that (when using a full case-base) the use of i.e. whether the hand has the ability to im- E[HS2 ] for the hand strength feature produces the prove with future community cards. In total strongest strategies, followed by the use of CATE- 284 categories were defined4 . GORIES and finally E[HS]. The poor performance E[HS]: Expected hand strength is a one-dimensional, of E[HS] is likely due to the fact that this metric numerical metric. The E[HS] metric com- does not fully capture the importance of a hand’s putes the probability of winning at showdown future potential. When only a partial proportion of against a random hand. This is given by enu- the case-base is used it becomes more important merating all possible combinations of commu- for the system to be able to recognise similar at- nity cards and determining the proportion of tribute values in order to make appropriate deci- the time the player’s hand wins against the sions. Both E[HS] and E[HS2 ] are able to gener- set of all possible opponent holdings. Given alise well. However, the results show that decision the large variety of values that can be pro- generalisation begins to break down when using duced by the E[HS] metric, bucketing takes CATEGORIES. This has to do with the similar- place where similar values are mapped into ity metrics used. In particular, the CATEGORIES a discrete set of buckets that contain hands strategy in Fig. 2 is actually a baseline strategy of similar strength. Here we use a total of 20 that used overly simplified similarity metrics for buckets for each postflop round. each of its feature values. Next we discuss the area E[HS2 ]: The final metric evaluated involves squar- of similarity assessment within the system archi- ing the expected hand strength. Johanson [18] tecture, which is intimately tied to the particular points out that squaring the expected hand values chosen within the feature representation. strength (E[HS2 ]) typically gives better re- 5.2.2. Similarity Assessment sults, as this assigns higher hand strength val- For each feature that is used to represent 4 A listing of all 284 categories can be found at a case, a corresponding local similarity metric, the following website: http://www.cs.auckland.ac.nz/ simf (f1 , f2 ), is required that determines how simi- research/gameai/sartreinfo.html lar two feature values, f1 and f2 , are to each other.
Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker 9 Fig. 2. The performance of three separate case-based strategies produced by altering the value used to represent hand strength. Results are measured in sb/h and were obtained by challenging Fell Omen 2. The use of different representations for the hand check if possible, otherwise it would call an oppo- strength feature in Fig. 2. also requires the use nent’s bet. This default-policy was selected by the of separate similarity metrics. The CATEGORIES authors as it was believed to be preferable to other strategy in Fig. 2. employs a trivial all-or-nothing trivial default policies, such as always-fold, which similarity metric for each of its features. If the would always result in a loss for the system. value of one feature has the same value of an- The other two strategies in Fig. 2. (E[HS] and other feature, a similarity score of 1 is assigned. E[HS2 ]) do not use trivial all-or-nothing similar- On the other hand, if the two feature values dif- ity. Instead the hand strength features use a sim- fer at all, a similarity value of 0 is assigned. This ilarity metric based on Euclidean distance. Both was done to get an initial idea of how the sys- the E[HS] and E[HS2 ] strategies also employ in- tem performed using the most basic of similarity formed similarity metrics for their betting sequence retrieval measures. The performance of this base- and board texture features, as well. Recall that line system could then be used to determine how the betting sequence feature is represented as a se- improvements to local similarity metrics affected quence of characters that lists the playing deci- overall performance. sions that have been witnessed so far for the cur- The degradation of performance observed in Fig. rent hand. This requires the use of a non-trivial 2. for the CATEGORIES strategy (as the propor- metric to determine similarity between two non- tion of case-base usage decreases) is due to the use identical sequences. Here we developed a custom of all-or-nothing similarity assessment. The use of similarity metric that involves the identification of the overly simplified all-or-nothing similarity met- stepped levels of similarity, based on the number ric meant that the system’s ability to retrieve sim- of bets/raises made by each player. The exact de- ilar cases could often fail, leaving the system with- tails of this metric are presented in Section 5.3.2. out a solution for the current game state. When Finally, for completeness, we determine similarity this occurred a default-policy was relied upon to between different board texture classes via the use provide the system with an action. The default- of hand picked similarity values. policy used by the system was an always-call pol- Fig. 2. shows that, compared to the CATE- icy, whereby the system would first attempt to GORIES strategy, the E[HS] and E[HS2 ] strategies
10 Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker Table 2 do a much better job of decision generalisation as Total cases stored for each playing round using single value the usable portion of the case-base is reduced. The solution representation compared to vector valued solutions eventual strategies produced do not suffer the dra- matic performance degradation that occurs with Round Total Cases - Single Total Cases - Vector the use of all-or-nothing similarity. Preflop 201,335 857 Flop 300,577 6,743 5.2.3. Solution Representation Turn 281,529 35,464 As well as recording feature values, each case River 216,597 52,088 also needs to specify a solution. The most obvious Total 1,000,038 95,152 solution representation is a single betting action, a ∈ Alimit . As well as a betting action, the solution can also record the actual outcome, i.e. the numeri- to decrease the number of cases required to be cal result, o ∈
Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker 11 tion from a single action solution representation 1. Probabilistic The first solution re-use policy to a vector valued solution representation (as de- simply selects a betting action probabilisti- scribed in Section 5.2.3). Initially, a variable value cally, given the proportions specified within of k was allowed, whereby the total number of the action vector, P (ai ) = ai , for i = 1 . . . n. similar cases retrieved varied with each search of Betting decisions that have greater propor- the case-base. Recall, that a case representation tions within the vector will be made more of- that encodes solutions as single actions results in ten then those with lower proportions. In a a redundant case-base that contains multiple cases game-theoretic sense, this policy corresponds with the exact same feature values. The solution to a mixed strategy. of those cases may or may not advocate different 2. Max-frequency Given an action vector A = playing decisions. Given this representation, a final (a1 , a2 , . . . , an ), the max-frequency solution probability vector was required to be created on- re-use policy selects the action that corre- the-fly at runtime by retrieving all identical cases sponds to arg maxi ai , i.e. it selects the ac- and merging their solutions. Hence, the number of tion that was made most often and ignores all retrieved cases, k, could vary between 0 and N . other actions. In a game-theoretic sense, this When k > 0, the normalised entries of the proba- policy corresponds to a pure strategy. bility vector were used to make a final playing de- 3. Best-Outcome Instead of using the values con- cision. However, if k = 0, the always-call default- tained within the action vector, the best- policy was used. outcome solution re-use policy selects an ac- Once the solution representation was updated to tion, given the values contained within the record action vectors (instead of single decisions) outcome vector, O = (o1 , o2 , . . . , on ). The fi- a variable k value was no longer required. Instead, nal playing decision is given by the action, ai , the algorithm was updated to simply always re- that corresponds to arg maxi oi , i.e. the action trieve the nearest neighbour, i.e. k = 1. Given fur- that corresponds to the maximum entry in the ther improvements to the similarity metrics used, outcome vector. the use of a default-policy was no longer required as it was no longer possible to encounter scenarios Given the three solution re-use policies de- where no similar cases could be retrieved. Instead, scribed above, it is desirable to know which policies the most similar neighbour was always returned, produce the strongest strategies. Table 3 presents no matter what the similarity value. This has re- the results of self-play experiments where the three sulted in a much more robust system that is actu- solution re-use policies were challenged against ally capable of generalising decisions recorded in each other. A round robin tournament structure the training data, as opposed to the initial proto- was used, where each policy challenged every other type system which offered no ability for graceful policy. The figures presented are from the row degradation, given dissimilar case retrieval. player’s perspective and are in small bets per hand. Each match consists of 3 separate dupli- 5.2.5. Solution Re-use Policies cate matches of 3000 hands. Hence, in total 18,000 The fifth area of the architecture that we dis- hands of poker were played between each competi- cuss (Fig. 1, Point 5) concerns the choice of tor. All results are statistically significant with a a final playing decision via the use of separate 95% level of confidence. policies, given a retrieved case and its solution. Table 3 shows that the max-frequency pol- Consider the probabilistic action vector, A = icy outperforms its probabilistic and best-outcome (a1 , a2 , . . . , an ), and a corresponding outcome vec- counterparts. Of the three, best-outcome fares the tor, O = (o1 , o2 , . . . , on ). There are various ways worst, losing all of its matches. The results indicate to use the information contained in the vectors to that simply re-using the most commonly made de- make a final playing decision. We have identified cision results in better performance than mixing and empirically evaluated several different policies from a probability vector and that choosing the for re-using decision information, which we label decision that resulted in the best outcome was the solution re-use policies. Below we outline three so- worst solution re-use policy. Moreover, these re- lution re-use policies, which have been used for sults are representative of further experiments in- making final decisions by our case-based strategies. volving other third-party computerised agents.
12 Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker Table 3 Results of experiments between solution re-use policies. The values shown are in sb/h with 95% confidence intervals. Max-frequency Probabilistic Best-outcome Average Max-frequency 0.011 ± 0.005 0.076 ± 0.008 0.044 ± 0.006 Probabilistic −0.011 ± 0.005 0.036 ± 0.009 0.012 ± 0.004 Best-outcome −0.076 ± 0.008 −0.036 ± 0.009 −0.056 ± 0.005 One of the reasons for the poor performance of down. For hands that were folded before a show- best-outcome is likely due to the fact that good down, this information is lost. It is difficult to train outcomes don’t necessarily represent good betting a strategy on data where this information is miss- decisions and vice-versa. The reason for the suc- ing. More importantly, any attempt to train a sys- cess of the max-frequency policy is less obvious. In tem on only the data where showdowns occurred our opinion, this has to do with the type of oppo- would result in biased actions, as the decision to nent being challenged, i.e. agents that play a static, fold would never be encountered. non-exploitive strategy, such as those listed in Ta- It is for these reasons that our case-based strate- ble 3, as well as strategies that attempt to approxi- gies have been trained on data made publicly avail- mate a Nash equilibrium. As an equilibrium-based able from the Annual Computer Poker Competi- strategy does not attempt to exploit any bias in tion [1]. This data records hand history logs for its opponent’s strategy, it will only gain when the all matches played between computerised agents opponent ends up making a mistake by selecting at a particular year’s competition. The data con- an inappropriate action. The action that was made tains perfect information for every hand played most often is unlikely to be an inappropriate ac- and therefore can easily be used to train an tion, therefore sticking to this decision avoids any imitation-based system. Furthermore, the comput- exploration errors made by choosing other (possi- erised agents that participate at the ACPC each bly inappropriate) actions. Moreover, biasing play- year are expected to improve in playing strength ing decisions towards this action is likely to go un- over the years and hence re-training the system punished when challenging a non-exploitive agent. on updated data should have a follow on affect on On the other hand, against an exploitive opponent performance for any imitation strategies produced the bias imposed by choosing only one action is from the data. Our case-based strategies have typ- likely to be detrimental to performance in the long ically selected subsets of data to train on, based run and therefore it would become more important on the decisions made by the agents that have per- to mix up decisions. formed the best in either of the two winner deter- mination methods used by the ACPC. 5.2.6. System Training There are both advantages and disadvantages How the system is trained is the final key area of for producing strategies that rely on generalising the architecture that we discuss, in regard to sys- decisions from training data. While this provides a tem maintenance. One of the major benefits of pro- convenient mechanism for easily upgrading a sys- ducing case-based strategies via expert imitation, tem’s play, there is an inherent reliance on the is that different types of strategies can be produced quality of the underlying data in order to produce by simply modifying the data that is used to train reasonable strategies. Furthermore, it is reasonable the system. Decisions that were made by an expert to assume that strategies produced in this way are player can be extracted from hand history logs and typically only expected to do as well as the original used to train a case-based strategy. Experts can be expert(s) they are trained on. either human or other artificial agents. In order to train a case-based strategy, per- 5.3. A Framework for Producing Case-Based fect information is required, i.e. the data needs to Strategies in Two-Player, Limit Texas record the hidden card information of the expert Hold’em player. Typically, data collected from online poker sites only contains this information when the orig- For the six key areas of our architecture (de- inal expert played a hand that resulted in a show- scribed above) maintenance was guided via com-
Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker 13 Table 4 A case is made up of three attribute-value pairs, which describe the current state of the game. A solution consists of an action and outcome triple, which records the average numerical value of applying the action (-∞ refers to an unknown outcome). Attribute Type Example 1. Hand Strength Integer 1 – 50 2. Betting Sequence String rc-c, crrc-crrc-cc-, r, ... No-Salient, Flush-Possible, 3. Board Texture Class Straight-Possible, Flush-Highly-Possible, ... Action Triple (0.0, 0.5, 0.5), (1.0, 0.0, 0.0), ... Outcome Triple (-∞, 4.3, 15.6), (-2.0, -∞, -∞), ... parative evaluation and overall impact on perfor- actions that have taken place in the current mance. The outcome of this intensive, systematic round, as well as previous rounds. Characters maintenance is the establishment of a final frame- in the string are selected from the set of al- work for producing case-based strategies in the do- lowable actions, Alimit = {f, c, r}, rounds are main of two-player, limit Hold’em. delimited by a hyphen. Here we present the details of the final frame- 3. Board Texture: The board texture refers to im- work we have established for producing case-based portant information available, given the com- strategies. The following sections illustrate the de- bination of the publicly available community tails of our framework by specifying the following cards. In total, nine board texture categories sufficient components: were selected by the authors. These categories are displayed in Table 5 and are believed 1. A representation for encoding cases and game to represent salient information that any hu- state information man player would notice. Specifically, the cat- 2. The corresponding similarity metrics required egories focus on whether it is possible that an for decision generalisation. opponent has made a flush (five cards of the 5.3.1. Case Representation same suit) or a straight (five cards of sequen- Table 4 depicts the final post-flop case repre- tial rank), or a combination of both. The cate- sentation used to capture game state information. gories are broken up into possible and highly- A single case is represented by a collection of possible distinctions. A category labelled pos- attribute-value pairs. Separate case-bases are con- sible refers to the situation where the oppo- structed for the separate rounds of play by pro- nent requires two of their personal cards in cessing a collection of hand histories and recording order to make their flush or straight. On the values for each of the three attributes listed in Ta- other hand, a highly-possible category only ble 4. The attributes have been selected by the au- requires the opponent to use one of their per- thors as they capture all the necessary information sonal cards to make their hand, making it required to make a betting decision. Each of the more likely they have a straight or flush. post-flop attribute-value pairs are now described in more detail: 5.3.2. Similarity Metrics Each feature requires a corresponding local sim- 1. Hand Strength: The quality of a player’s hand ilarity metric in order to generalise decisions con- is represented in our framework by calculat- tained in a set of data. Here we present the metrics ing the E[HS2 ] of the player’s cards and then specified by our framework. mapping these values into 1 out of 50 evenly divided buckets, i.e. uniform bucketing. 1. Hand Strength: Equation 2 specifies the met- 2. Betting Sequence: The betting sequence is rep- ric used to determine similarity between two resented as a string. It records all observed hand strength buckets (f1 , f2 ).
14 Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker A B C D E F G H I A 1 0 0 0 0 0 0 0 0 |f1 − f2 | sim(f1 , f2 ) = max{1 − k · , 0} (2) B 0 1 0.8 0.7 0 0 0 0 0 T C 0 0.8 1 0.7 0 0 0 0 0 Here, T refers to the total number of buckets D 0 0.7 0.7 1 0 0 0 0 0 that have been defined, where f1 , f2 ∈ [1, T ] E0 0 0 0 1 0.8 0.7 0 0.6 and k is a scalar parameter used to adjust the F 0 0 0 0 0.8 1 0.7 0 0.5 rate at which similarity should decrease. G 0 0 0 0 0.7 0.7 1 0.8 0.8 2. Betting Sequence: To determine similarity be- H0 0 0 0 0 0 0.8 1 0.8 tween two betting sequences we developed I 0 0 0 0 0.6 0.5 0.8 0.8 1 a custom similarity metric that involves the Fig. 3. Board texture similarity matrix. identification of stepped levels of similarity, based on the number of bets/raises made by each player. The first level of similarity (level0) refers to the situation when one bet- Table 5 ting sequence exactly matches that of another. Board Texture Key If the sequences do not exactly match the next A No salient level of similarity (level1) is evaluated. If two B Flush possible distinct betting sequences exactly match for C Straight possible the active betting round and for all previous D Flush possible, straight possible betting rounds the total number of bets/raises E Straight highly possible made by each player are equal then level1 sim- F Flush possible, straight highly possible ilarity is satisfied and a value of 0.9 is as- G Flush highly possible signed. Consider the following example where H Flush highly possible, straight possible the active betting round is the turn and the I Flush highly possible, straight highly possible two betting sequences are: 1. crrc-crrrrc-cr 2. rrc-rrrrc-cr equal (the same applies for the flop and the Here, level0 is clearly incorrect as the se- turn). Therefore, level1 similarity is not sat- quences do not match exactly. However, for isfied. However, the number of raises encoun- the active betting round (cr ) the sequences tered for all the previous betting rounds com- do match. Furthermore, during the preflop (1. bined (1. rrc-cc-cc and 2. cc-rc-crc) are the crrc and 2. rrc) both players made 1 raise same for each player, namely 1 raise by each each, albeit in a different order. During the player. Hence, level2 similarity is satisfied and flop (1. crrrrc and 2. rrrrc) both players now a similarity value of 0.8 would be assigned. Fi- make 4 raises each. Given that the number nally, if level0, level1 and level2 are not satis- of bets/raises in the previous rounds (preflop fied level3 is reached where a similarity value and flop) match, these two betting sequences of 0 is assigned. would be assigned a similarity value of 0.9. 3. Board Texture: To determine similarity between If level1 similarity was not satisfied the next board texture categories a similarity matrix level (level2) would be evaluated. Level2 simi- was derived. Matrix rows and columns in Fig. larity is less strict than level1 similarity as the 3. represent the different categories defined in previous betting rounds are no longer differen- Table 5. Diagonal entries refer to two sets of tiated. Consider the river betting sequences: community cards that map to the same cate- 1. rrc-cc-cc-rrr gory, in which case similarity is always 1. Non- 2. cc-rc-crc-rrr diagonal entries refer to similarity values be- Once again the sequences for the active round tween two dissimilar categories. These values (rrr ) matches exactly. This time, the num- were hand picked by the authors. The matrix ber of bets/raises in the preflop round are not given in Fig. 3. is symmetric.
Jonathan Rubin and Ian Watson / Case-Based Strategies in Computer Poker 15 5.4. Experimental Results 5.4.2. 2010 AAAI Computer Poker Competition Following the maintenance experiments pre- We now present a series of experimental results sented in Section 5.2, an updated case-based strat- collected in the domain of two-player, limit Texas egy was submitted to the 2010 ACPC, held at Hold’em. The results presented are obtained from the Twenty-Forth AAAI Conference on Artificial annual computer poker competitions and data col- Intelligence. Our entry, once again named Sartre, lected by challenging human opposition. For each used the following architecture snapshot: evaluated case-based strategy, we provide an ar- chitecture snapshot that captures the relevant de- 1. Feature Representation tails of the parameters used for each of the six key (a) Hand Strength – 50 buckets E[HS2 ] architecture areas, that were previously discussed. (b) Betting Sequence – string 5.4.1. 2009 IJCAI Computer Poker Competition (c) Board Texture – categories We begin with the results of the 2009 ACPC, 2. Similarity Assessment held at the International Joint Conference on Ar- tificial Intelligence. Here, we submitted our case- (a) Hand Strength – Euclidean based agent, Sartre, for the first time, to challenge (b) Betting Sequence – custom other computerised agents submitted from all over (c) Board Texture – matrix the world. The following architecture snapshot de- 3. Solution Representation – vector picts the details of the submitted agent: 4. Case Retrieval – k = 1 1. Feature Representation 5. Re-Use Policy – probabilistic 6. System Training MANZANA (a) Hand Strength – categories (b) Betting Sequence – string Here a vector valued solution representation was (c) Board Texture – categories used together with improved similarity assessment. 2. Similarity Assessment – all-or-nothing Given the updated solution representation, a sin- 3. Solution Representation – single gle nearest neighbour, k = 1, was retrieved via 4. Case Retrieval – variable k the k-NN algorithm. A probabilistic solution re-use 5. Re-Use Policy – max-frequency policy was employed and the system was trained 6. System Training – Hyperborean-08 on the decisions of the winner of the 2009 total bankroll division. The final results are presented The architecture snapshot above represents a in Table 7. Once again two winner determination baseline strategy where maintenance had yet to be divisions are presented and the values are depicted performed. Each of the entries listed above corre- in small bets per hand with 95% confidence inter- sponds to one of the six key architecture areas in- vals. Given the improvements, Sartre was able to troduced in Section 5.2. Notice that trivial all-or- achieve a 6th place finish in the runoff division and nothing similarity was employed along with a sin- a 3rd place finish in the total bankroll division. gle action solution representation, which resulted in a redundant case-base. The value for system 5.4.3. 2011 AAAI Computer Poker Competition training refers to the original expert whose deci- The 2011 ACPC was held at the Twenty-Fifth sions were used to train the system. AAAI Conference on Artificial Intelligence. Our The final results are displayed in Table 6. The entry to the competition is represented by the fol- competition consisted of two winner determina- lowing architecture snapshot: tion methods: bankroll instant run-off and total 1. Feature Representation bankroll. Each agent played between 75 and 120 duplicate matches against every other agent in or- (a) Hand Strength – 50 buckets E[HS2 ] der to obtain the average values displayed. Each (b) Betting Sequence – string match consisted of 3000 duplicate hands. The val- (c) Board Texture – categories ues presented are the number of small bets per 2. Similarity Assessment hand won or lost. Our case-based agent, Sartre, achieved a 7th place finish in the instant run-off (a) Hand Strength – Euclidean division and a 6th place finish in the total bankroll (b) Betting Sequence – custom division. (c) Board Texture – matrix
You can also read