Particle Filtering for Dynamic Agent Modelling in Simplified Poker
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Particle Filtering for Dynamic Agent Modelling in Simplified Poker Nolan Bard and Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta, Canada T6G 2E8 {nolan,bowling}@cs.ualberta.ca Abstract time, a web visitor’s goals can change while browsing a site, or a game opponent may switch strategies during play. Agent modelling is a challenging problem in many modern artificial intelligence applications. The agent modelling task In this work, we approach the agent modelling problem is especially difficult when handling stochastic choices, de- as a specific instance of the general problem of state esti- liberately hidden information, dynamic agents, and the need mation. State estimation involves tracking a stochastic pro- for fast learning. State estimation techniques, such as Kalman cess’s hidden state variables by observing noisy functions of filtering and particle filtering, have addressed many of these these variables. Such techniques represent uncertainty in the challenges, but have received little attention in the agent mod- hidden state as a probability distribution and use a recursive elling literature. This paper looks at the use of particle fil- Bayes’ formula to update the belief with each new obser- tering for modelling a dynamic opponent in Kuhn poker, a vation. Although these techniques require a complete prob- simplified version of Texas Hold’em poker. We demonstrate abilistic model of the process dynamics and observations, effective modelling both against static opponents as well as techniques for dual estimation have relaxed this requirement dynamic opponents, when the dynamics are known. We then examine an application of Rao-Blackwellized particle filter- by inferring these models from data. State estimation seems ing for doing dual estimation, inferring both the opponent’s to be well-suited to the agent modelling task, and yet it has state as well as a model of its dynamics. Finally, we exam- received little attention in the agent modelling literature. ine the robustness of the approach to incorrect beliefs about Poker is a challenging testbed for AI research that em- the opponent and compare it to previous work on opponent bodies all of the agent modelling challenges. The environ- modelling in Kuhn poker. ment and often the players are stochastic. Each player has their own private information which may never be revealed. In addition, modelling is expected to play a key role in Introduction the eventual development of world champion poker-playing Agent modelling is the problem of building a predictive programs.1 Dynamic behavior is also an integral part of ex- model of another agent’s future decisions from, possibly pert poker. Exploiting one’s “table image” by “switching incomplete, observations of past behavior. The rise of ap- gears” is a common tactic, which involves drastically chang- plications requiring interaction between independent agents ing play style. This makes poker an ideal domain for exam- — human, robotic, and software — has made coping with ining the problem of dynamic agent modelling. In particular, the presence of other decision makers a key challenge for we will focus on a simplified version called Kuhn poker. artificial intelligence. Agent modelling is one approach This paper examines the use of the state estimation tech- to this challenge. With accurate predictive models of the nique, particle filtering, along with an extension to dual es- other agents, an effective response can be planned and ex- timation, for the problem of agent modelling in poker. We ecuted. Applications as diverse as assistive technologies, begin with background on agent modelling and state estima- autonomous driving, electronic commerce, and interactive tion. Next, we introduce Kuhn poker, and describe our ap- entertainment all require or would benefit significantly from plication of particle filtering to this domain. We then show accurate models of other agents, artificial or human. results of our approach on a broad range of scenarios, in- There are many factors that complicate the agent mod- cluding when the opponent’s dynamics are known, inferred, elling task. Models often need to be constructed from very few observations. The environment may involve stochastic 1 Many of the competitive poker playing programs today em- events or hidden information only known to the agent be- ploy a game-theoretic approach without any opponent mod- ing modelled. The agent itself may, in fact, make stochas- elling (Billings et al. 2003; Gilpin & Sandholm 2006). These tic choices. And finally, the agent may be dynamic, chang- approaches, though, by failing to identify and exploit their oppo- ing their behavior throughout the interaction. For example, nents’ weaknesses are unlikely to best the world’s top human play- a long-term customer’s product preferences can drift over ers. This was also the observation of a world class player after evaluating one of these game theoretic programs, “You have a very Copyright c 2007, Association for the Advancement of Artificial strong program. Once you add opponent modelling to it, it will kill Intelligence (www.aaai.org). All rights reserved. everyone.” (Billings et al. 2004) 515
and even wrong. These results illustrate both the effective- Hoehn and colleagues (Hoehn et al. 2005) focused on short- ness and robustness of the technique. Finally, we conclude. term modelling in Kuhn poker. We will discuss this work in more detail below, along with a comparison of results. All Background of this previous work in poker opponent modeling, however, is focused on modelling static opponents, where as our work This work brings ideas from state estimation to the prob- explicitly models dynamic behavior. lem of agent modelling. We briefly review past work in State estimation has been wildly successful in a broad agent modelling, before describing the necessary founda- range of applications. Although seemingly well-suited to tional concepts from state estimation. the challenges of agent modelling, it has received very lit- tle attention in this area. Bererton (Bererton 2004) proposed Agent Modelling the use of particle filtering in commercial computer games, Agent modelling has been explored in the literature under but this was to imbue computer players with more realistic a number of different guises. Early work, which contin- beliefs about enemy positions. We propose to use state es- ues today, focused on plan recognition (e.g., Kautz 1991) timation techniques to infer an agent’s subjective state, i.e., as the inverse of the traditional planning problem: recover- its behavior, rather than any objective quantities. ing a goal concept from a sequence of actions in a symbolic planning language. If the agent’s goals can be deduced, one State Estimation can then predict the agent’s actions in novel situations. This State estimation is the problem of determining the current inversion of planning has also been attempted for stochas- state of a system given a sequence of observations. This is tic domains as the problem of inverse reinforcement learn- done by representing the uncertainty in the current state as a ing (Ng & Russell 2000), where a reward function is iden- probability distribution and using Bayes’ rule to update the tified from samples of the agent’s policy. These techniques, belief after every observation. though, generally require complete knowledge of the rest of Formally, let xt be the state vector, and zt be the ob- the system, restricting them to situations of modelling a sin- servation vector at time t. Define x1:t to be the state gle agent making deterministic choices in a fully observable sequence x1 , . . . , xt and similarly for the observation se- environment. quence z1:t . Then the state estimation problem is concerned Behavior classification (Han & Veloso 1999) and policy with estimating xT |z1:T , or rather the complete distribution recognition (Bui, Venkatesh, & West 2002) try to identify Pr(xT |z1:T ). By applying Bayes’ rule, simple arithmetic, behavior rather than goals. They were proposed for par- and assuming that xt is a sufficient statistic for the events tially observable, stochastic domains and take a more prob- up to time t (i.e., the Markov assumption), we arrive at the abilistic approach that can even account for an agent mak- standard recursive Bayesian filtering equation, ing stochastic decisions. However, they only recognize or classify observed behavior into a small discrete set of tar- Pr(xt |z1:t ) = η Pr(zt |xt ) get behaviors that must be specified in advance. Also, these approaches (along with plan recognition and inverse rein- Pr(xt |xt−1 ) Pr(xt−1 |z1:(t−1) )dxt−1 , forcement learning) assume the agent’s behavior is static, unchanging throughout the interaction. where η is a normalization constant. Given our previous Agent modelling has also been investigated in the realm belief Pr(xt−1 |z1:(t−1) ), we can use this equation to find of game theory, particularly in the context of small strate- our new belief after the latest observation. The equation re- gic form games such as prisoner’s dilemma. For example, quires an observation model Pr(zt |xt ) and a motion model2 Carmel and Markovitch (1996) proposed a technique for Pr(xt |xt−1 ). Lastly, for a practical implementation the form modelling an “opponent” as a deterministic finite automa- of the belief distribution Pr(xt |z1:t ) needs to allow the inte- ton, which can then be used for planning an optimal re- gral in the Bayesian filtering equation to be computed easily. sponse. Although the approach can learn models of dynamic For example, a Gaussian form results in a Kalman filter and agents, it is not obvious how it can be extended to stochastic its assorted variants. A Monte Carlo approximation results choices, private information, or larger games. in a particle filter, which is the approach taken in this paper. Previous work on agent modelling in Texas Hold’em Particle Filters. Particle filters are a Monte Carlo ap- poker has focused on techniques for modelling static agents. proach to Bayesian filtering. Particle filters approximate the Simple frequentist opponent modelling systems have been probability distribution over the state using a set of samples used to modify the distribution over the opponent’s hidden called particles. Each particle is a state vector, which we de- cards based on their past behaviour (Billings et al. 1998). (i) note as xt . Particle filters are flexible and powerful. They Artificial neural networks were used to learn opponent mod- can handle non-linear dynamics while representing arbitrary els offline from previous game data. The models were then belief distributions over the state variables. The accuracy used for choosing salient features to augment previous fre- and computational cost of a particle filter scales with the quentist modelling (Davidson et al. 2000). Vexbot, the fore- number of particles used. most opponent modelling poker program, incorporates fre- quentist statistics into an imperfect information game-tree 2 Often the motion model will include other terms such as a con- search (Billings et al. 2004). These techniques, however, trol input signal ut . Since our application does not involve any require considerable data to be effective. Recent work by additional signals we exclude them from our presentation. 516
The particle filter update is based on importance sam- rameters (α, β, γ), and player two by two parameters (η, ξ). (i) pling. Given a particle xt−1 approximately sampled from These parameters are all in the range [0, 1] and specify the (i) probability of betting in certain information sets. For ex- Pr(xt−1 |z1:(t−1) ) we want a new particle xt approxi- ample, η is the probability the second player bets when fac- mately sampled from Pr(xt |z1:t ). We do this by sam- ing a bet while holding the Queen, and ξ is the probabil- (i) pling x̃t from the candidate distribution Pr(xt |z1:(t−1) ) ity the second player bets after a pass when holding the and weighting it by the importance sampling correction Jack. The game has a whole continuum of Nash equilibria, (i) which can be written in this parameterization as , α = γ/3, Pr(zt |x̃t ). Sampling from the candidate distribution in- (i) β = (1 + γ)/3, and η = ξ = 1/3. The value of the equi- volves sampling from the motion model Pr(xt |xt−1 ). The librium is −1/18 dollars per game. In other words, if either weighting comes from the observation model. In order to player plays an equilibrium strategy then player one will lose turn the samples from the candidate distribution into sam- 5.5 cents per game on average. ples from our desired distribution we then select n particles, Although the equilibrium strategy guarantees a minimum with replacement, randomly in proportion to their weights. payoff regardless of the opponent’s strategy, greater payoffs Dual Estimation. Here we consider a simple case of dual may still be possible. For example, suppose player two is estimation where the system’s dynamics are parameterized choosing randomly among undominated actions (i.e., η = by some unknown value θ, which we need to simultane- ξ = 0.5). If player one persists with an equilibrium strategy, ously infer along with the system state (Doucet, de Freitas, the loss of −1/18 per game is unchanged. If player one & Gordon 2000; Storvik 2002). We can do this through an instead responds by passing in the first round and betting application of Rao-Blackwellized particle filters (RBPFs), with a King or Queen when bet to (i.e., the best response), it a hybrid state estimation technique that allows a portion of is no longer a losing game, with the expected payoff being the state variables to be modelled in a Monte Carlo fashion zero. Other deviations from the equilibrium strategy can be while others are modelled in a parametric form. For each exploited by even more. It is clear that an accurate model of (i) the opponent’s strategy can be used to great advantage. particle, we additionally store a sufficient statistic st for (i) Kuhn poker is an ideal domain for our investigation. It has computing Pr(θ|x1:t ). Sampling from the candidate distri- most of the strategic properties found in the real world vari- (i) bution now involves sampling θ̃ from Pr(θ|st−1 ) and then ations of poker, yet a game theoretic analysis of the game is (i) (i) tractable and exists. For instance, Kuhn preserves the abil- sampling x̃t from Pr(xt |xt−1 , θ̃). The sufficient statistic for each candidate particle is updated for the new transition ity to trap (i.e., acting as though your hand is weaker than it (i) (i) truly is) and bluff (i.e., acting as though your hand is stronger s̃t = U PDATE(st−1 , xit−1 → x̃it ). The weight from the that it truly is). In Kuhn we bluff by betting with a Jack and observation model and resampling is performed in the usual trap by passing with a King. In addition, players’ strate- fashion. As long as we choose Pr(θ) carefully with an ap- gies have a natural parameterization, and best-response is a propriate sufficient statistic this adds very little additional straightforward computation. Since these features are all ac- running time to the basic particle filtering algorithm. tive research directions in full versions of poker, Kuhn poker offers a clean domain for evaluating ideas. It has also al- Kuhn Poker ready been a testbed for modelling research. Hoehn and col- Kuhn poker is a tiny, toy variant of poker for which a com- leagues (Hoehn et al. 2005) examined two modelling tech- plete game theoretic analysis exists (Kuhn 1950). The game niques for exploitation of static opponents. We will return involves two players; two actions, bet and pass; and a three to their results later. card deck, containing a Jack (J), Queen (Q) and King (K). Each player is dealt one card privately. The first player may Application to Kuhn Poker then either bet or pass. If the first player bets the second We now describe how we adapt the particle filtering tech- player may either bet, causing a showdown, or pass, to fold. nique to Kuhn poker. To simplify things, we will restrict our- If the first player passes, the second player can also pass, selves to the situation of player one modelling player two. causing a showdown, or bet, forcing the first player to make To use particle filtering, we need to define five components: a final decision of either bet, for a showdown, or pass, for a the state variables, the observations, an observation model, fold. In the case of a fold, the non-folding player wins one a motion model, and an initial belief. dollar from the folding player. In the case of a showdown, For the state variables, one natural choice is simply the player with the higher card (King is high, Jack is low) Kuhn’s own parameterization of the strategy space. This wins one dollar if neither bet, or two dollars if both players choice involves two assumptions (i) our opponent will not bet. play outside the parameterization (i.e., does not play dom- With its introduction, Kuhn also presented a complete inated strategies), and (ii) our opponent’s future strategies game theoretic analysis of the game. Although there are are conditionally independent of past strategies given their 64 different pure strategies for each player, many of these current strategy (i.e., the Markov property). For our situa- are dominated, i.e., another strategy has a higher expected tion of modelling player two, this means the state vector xt value against every possible strategy of the opponent. Af- is two-dimensional with the values ηt and ξt . ter eliminating dominated strategies, the undominated strat- As the particle filter is being updated after each hand, the egy space of player one can be parameterized by three pa- observation zt naturally consists of any known cards and 517
the betting sequence for that hand. The observation model Pr(zt |ηt , ξt ) comes from the definition of the game itself Table 1: Total expected winnings of Hoehn’s parameter es- and is a straightforward computation. With folded hands timation algorithm compared to our particle filter with a sta- we are still able to compute the likelihood by marginaliz- tionary motion model over 200 hands. ing over the unknown cards, which with our small deck is a Opponent Particle Filter Parameter Estimation simple operation. An encoding of these probabilities can be O1 = (0.8, 0.29) 4.3 -9.0 easily worked out by hand due to the small size of Kuhn and O2 = (0.75, 0.8) 18.7 9.0 its simple strategy parameterization. O3 = (0.67, 0.4) -2.7 -9.0 O4 = (0.17, 0.2) 2.5 3.0 The motion model must encode our belief about how our O5 = (0.25, 0.17) -1.3 -2.5 opponents will change their strategies over time. In this pa- O6 = (0.25, 0.67) 10.6 16.0 per we will explore two naive types of motion models. The first model assumes that players will change with probabil- ity ρ to a random strategy after every hand. With probabil- ity (1 − ρ) they continue using their previous strategy. We Results call this a switching model. The second model assumes that We now evaluate our various modelling approaches against a players’ strategies drift after each hand. The player gener- variety of opponents, both static and dynamic, and when our ates their next strategy by sampling from a spherical Gaus- prior beliefs are both correct and incorrect. This will help sian distribution with a mean of the current strategy and vari- us to understand both the effectiveness and robustness of the ance σ 2 , restricting it to the unit square, and re-normalizing approach. We begin by briefly examining static opponents appropriately. We also refer to a combined motion model and then move on to dynamic opponents. which involves both the ρ and σ parameters. In this model, the player switches to a uniform random strategy at the end Static Opponents of the hand with probability ρ and with probability (1 − ρ) Hoehn and colleagues’ previous work on opponent mod- it drifts with variance σ2 . elling in Kuhn poker focused on static opponents (Hoehn Finally, we use a uniform random distribution over the et al. 2005; Hoehn 2006). We begin by replicating one of state space as our initial belief Pr(x0 ). their experiments with our particle filtering approach. They examined two modelling approaches: one using explicit pa- Dual Estimation. As we have chosen motion models that rameter estimation with similar Bayesian roots as our ap- are parameterized, we can employ an RBPF to infer the mo- proach and an implicit modelling technique similar to an tion model parameter, either ρ or σ. The only detail is spec- “experts” algorithm. For 200 hand matches, they found the ifying a prior on ρ and σ such that the posterior can be sum- parameter estimation approach with a fixed 50 hand explo- marized with a sufficient statistic. For ρ we can model the ration period worked the best. We played a particle fil- prior and posterior as a beta distribution B ETA(α, β) where ter modeller using a stationary motion model (since all of α and β are the sufficient statistics. These are updated easily the opponents were stationary) against their six static op- on observing a transition: if xt−1 = xt increment β, other- ponents. Table 1 shows the results. The particle filter ap- wise increment α. For σ 2 we model the prior and posterior proach is competitive (performing better against some op- as an inverse-gamma distribution I NV-G AMMA(v, w) where ponents and worse against others) with Hoehn’s parameter v and w are the sufficient statistics. Because we use a spher- estimation approach. However, our approach was specifi- ical Guassian, transitions are treated as one observation of cally designed to handle dynamic agents, so we now move σ 2 per dimension. On observing a transition the update adds on to those opponents. ||xt − xt−1 ||2 /2 to w and d/2 to v, where d is the number of dimensions in the state vector (two, in our case). Dynamic Opponents Finally, we need to choose a prior belief for ρ or σ, We compared various incarnations of the particle filter depending on the motion model in use. For the experi- against several simple dynamic agents. The simple agents ments in the next section we fairly arbitrarily chose Pr(ρ) = employ one of our two motion strategies described above: B ETA(1, 30) and Pr(σ 2 ) = I NV-G AMMA(0.6, 0.00005). either switching randomly after each hand or moving ac- cording to the truncated spherical Gaussian drift after each Using Our Model. One final consideration is what to do hand. The opponents use the same parameterization as the with our belief about the opponent’s strategy. The correct model with ρ specifying the switching probability for the Bayesian approach is to select the action that maximizes the switching opponents and σ the drift standard deviation for sum of all future expected utility given our belief. This is, the drift opponents. Nine specific opponents (4 switching, 4 in general, an intractable computation. Instead, we opt for drifting, and 1 stationary) were chosen, with the specific val- the simpler greedy response to maximize expected utility for ues of ρ and σ given in Table 2. These values were chosen the current hand given our belief. Because of the linearity of to cover opponents that moved both slowly and quickly. the expected value in our chosen state variables the best- We played various incarnations of particle filtering against response to our particle posterior is just the best-response to each of our nine opponents for 1000 hand matches. These the mean of the particles, which can be computed quickly. matches were repeated 5000 times for statistical confidence. We will simply play this best-response to our posterior dis- Recall that since these experiments involve player one mod- tribution on every hand. elling player two, this is actually a “losing” situation for the 518
80 80 70 70 B AB A C B C Total Winnings ($ over 1000 hands) Total Winnings ($ over 1000 hands) C A 60 60 B C C D D 50 50 D C C 40 D 40 B D D B D C B E E D E B 30 D 30 E E E E C E E 20 D 20 A E E A C E A 10 D 10 D A B B A A ABC A ABC 0 0 A B C D E Avg A B C D E Avg S WITCHING O PPONENTS D RIFTING O PPONENTS Figure 1: Total winnings of vanilla particle filter modelling against different switching opponents (left) and drifting opponents (right). Each bar represents a different setting of ρ (left) or σ (right) in the filter’s motion model. per bounds on possible performance, as it takes many hands Table 2: Opponent Legend of Kuhn poker to estimate even a static opponent’s strategy Motion Model parameters. Opponent Switch (ρ) Drift (σ) Figure 1 gives a summary of the results of applying vari- A 0.0 0.0 ous incarnations of particle filtering to these opponents. The B 0.002 0.005 left graph shows the results for the switching opponents and C 0.01 0.02 the right graph shows the results for the drift opponents. In D 0.05 0.05 both graphs, the x-axis corresponds to the different oppo- E 0.1 0.1 nents from Table 2. For each opponent the set of bars show the average total winnings of particle filtering when using the correct form of the motion model, but varying the pa- rameter in that model (i.e., ρ or σ). The parameters took modelling player. If the modelling player instead played any on the same set of values as the opponents and so each bar equilibrium strategy it would lose over $55 on average over is labeled with the corresponding opponent’s identifier from 1000 hands. A less naive alternative would be to exploit the Table 2. We dissect this figure further below. fact that all of the described opponents are effectively ran- dom. Without considering the knowledge of past play, they Known Motion Model. Consider the case when the oppo- will make uniformly random decisions between any undom- nent’s motion model is known and correct. Specifically, we inated choices on every round. The best-response to such know whether the opponent is a switching or drifting player random play is an alternative static strategy that gains and as well as the exact probability of switching or the average loses nothing on average against any of our opponents over amount of drift. This corresponds to the bars of Figure 1 1000 hands. This makes zero a good baseline comparison that match the opponent being played on the x-axis. Al- for which effective opponent modelling should rise above. though the winnings vary depending on the opponent, for all To provide further context to our results, we display two nine opponents the correct model outperformed the baseline exploitability lines for each opponent. The top line in each static strategy which nets zero. graph is the expected exploitability if we knew our oppo- Realistically, knowing the exact parameterization of an- nent’s strategy on every hand. This is the upper bound on other agent’s motion is unlikely. It is interesting to consider any agent’s expected performance. The bottom line is the how the approach performs if its model is wrong. This case expected exploitability if we knew our opponent’s strategy corresponds to the remaining bars in Figure 1. Naturally, on their previous hand and their motion model. This baseline performance is higher when the opponent’s motion parame- is a more practical goal than the first exploitability line since ter is close to the model’s parameter. The fall off for an in- it is more sensitive to how quickly the particular opponent’s correct model, though, can be quite drastic, sometimes drop- strategy is changing. Both lines are, however, only loose up- ping lower than our simple static baseline performance. For 519
80 80 70 70 R Rd d Rd C C Total Winnings ($ over 1000 hands) Total Winnings ($ over 1000 hands) Rs R Rc s Rc Rs 60 C 60 Rc C Rs C Rd Rc Rd 50 50 Rs Rc Rs Rd R CR C c s Rc 40 40 Rs Rc Rd C Rd Rs Rc 30 30 Rd C 20 Rs Rc 20 C R Rc s 10 Rd Rs Rc 10 Rd Rd C C 0 0 A B C D E Avg A B C D E Avg S WITCHING O PPONENTS D RIFTING O PPONENTS Figure 2: Total winnings of RBPF modelling against different switching opponents (left) and drifting opponents (right). The bars labelled “C” come from Figure 1. example, when playing switching opponent D, if we incor- ure 2 we see that against most opponents Rs performs better rectly believed the opponent was stationary (model A) we than the static model and never performs significantly worse. would actually lose money on the match.3 Against highly switching opponent E it actually performs It is interesting to note that when playing against a static quite a bit better than the static model as it can infer that opponent (opponent A) it can be beneficial to model them as its opponent is switching rapidly. On average, although the if they moved with low probability or a slight drift (model improvement is not dramatic, it is still suggesting that the B). This phenomenon is common for particle filtering and is inference is providing some benefit. We see a similar result a result of the bias that comes from using a finite number of against drifting opponents (comparing model C and Rd ). particles (Liu & West 2000). Taking matters one step further, we are not likely to have even the form of our opponent’s motion correct. We can ob- Unknown Motion Parameters. A more realistic setting serve the effect of this error in Figure 2. Consider model Rd is if the opponent’s motion parameters are not known. One in the left-hand graph and model Rs in the right-hand graph. simple approach is to choose a fixed motion parameter that This shows the result of using a completely different form works well even when incorrect. Consulting Figure 1, model to model our opponent’s motion. Although the model per- C for switching and drifting opponents has the highest aver- forms worse than the correct model 4 , the loss is never more age performance across the five opponents (see the bars on than $9 over 1000 hands. This is relatively small consid- the far right of the graph). ering the incorrect beliefs. More importantly, the average Alternatively, we could use a RBPF to infer this param- performance using our dual estimation with the wrong form eter of the motion model, as described in the previous sec- is very similar to model C — the best average static model tion. The results of this approach for both switching and with the correct form from Figure 1. Hence, dual estima- drifting opponents is shown in Figure 2. For each oppo- tion may even compensate for using the wrong form of the nent we show a bar for the best static model from the pre- motion model. vious results (model C) as well as three RBPF modellers. Rs corresponds to using our prior over switching probabil- Unknown Model. If even the form of the model is un- ities (assuming no drift), Rd corresponds to using our prior known, but a small set of possible forms can be enumerated, over drifting amounts (assuming no switching), and Rc cor- we can still employ dual estimation. Rc in Figure 2 shows responds to an independent prior over both forming a com- the results of a combined prior. The performance is rea- bined model. sonably competitive with the correct model against all op- Examining model C and Rs in the left-hand graph of Fig- Note that Rd actually does worse than Rs at modelling drift 4 opponents with a high drift standard deviation. This is likely due 3 Although not shown, the loss is almost $22 for 1000 hands. to the fact that for our choice of prior, the switching model actually This is our worst case result, which is still considerably higher than assigns a higher likelihood than the drift model to large strategy the equilibrium value of the game of –$55. changes. 520
ponents suggesting that even this lack of knowledge can be 2004. Game tree search with adaptation in stochastic im- accounted for with dual estimation. perfect information games. In Computers and Games. Bui, H.; Venkatesh, S.; and West, G. 2002. Policy recog- Conclusion nition in the abstract hidden markov model. Journal of Artificial Intelligence Research 17:451–499. In this paper we advocated the use of state estimation tech- niques as a promising approach for modelling agents with Carmel, D., and Markovitch, S. 1996. Learning models dynamic behavior. We described the application of parti- of intelligent agents. In Proceedings of the Thirteenth Na- cle filtering to the domain of Kuhn poker, and showed that tional Conference on Artificial Intelligence. Menlo Park, it could effectively model and exploit both static and dy- CA: AAAI Press. namic opponents. We also explored the use of a dual esti- Davidson, A.; Billings, D.; Schaeffer, J.; and Szafron, D. mation technique for estimating both the opponent’s strategy 2000. Improved opponent modeling in poker. In Proceed- and dynamics model, demonstrating significant additional ings of the 2000 International Conference on Artificial In- robustness to incorrect beliefs. telligence. Although this work has focused on the small domain of Doucet, A.; de Freitas, N.; and Gordon, N., eds. 2000. Kuhn poker, we believe the techniques can be effectively Sequential Monte Carlo Methods in Practice. New York. applied to much larger domains. Particle filters can be used Springer-Verlag, New York. effectively in a large domain as long as agent behavior can Gilpin, A., and Sandholm, T. 2006. A competitive be represented with a relatively small number of parameters texas hold’em poker player via automated abstraction and and the motion and observation models can be computed ef- real-time equilibrium computation. In Proceedings of ficiently. Although the size of the Texas Hold’em game tree the Twenty-First National Conference on Artificial Intelli- is intractably large, we are investigating parameterizations gence. of the full game that might be sufficient for describing and exploiting a substantial range of human play. In addition, Han, K., and Veloso, M. 1999. Automated robot behavior it has already been shown that observation models can be recognition applied to robotic soccer. In Proceedings of the computed even for the full game of Texas Hold’em in real- International Symposium of Robotics Research, 199–204. time (Southey et al. 2005). Hoehn, B.; Southey, F.; Holte, R. C.; and Bulitko, V. In addition to applying these ideas to the full game, we 2005. Effective short-term opponent exploitation in sim- are also planning on investigating more interesting forms of plified poker. In Proceedings of the Twentieth National agent dynamics. Strong human players do not adapt their Conference on Artificial Intelligence, 783–788. play in a random fashion nor are they oblivious to recent Hoehn, B. 2006. The Effectiveness of Opponent Modelling play. In the future we hope to investigate the effectiveness in a Small Imperfect Information Game. Ph.D. Disserta- and robustness of such dynamics models that may more ac- tion, University of Alberta. curately capture likely human dynamics. Kautz, H. A. 1991. A formal theory of plan recognition and its implementation. In Allen, J. F.; Kautz, H. A.; Pelavin, Acknowledgments R.; and Tenenberg, J., eds., Reasoning About Plans. Mor- We would like to thank Rob Holte and Bret Hoehn for their gan Kaufmann Publishers. 69–125. valuable insights, along with all of the members of the Uni- Kuhn, H. W. 1950. A simplified two-person poker. Con- versity of Alberta Computer Poker Research Group. This tributions to the Theory of Games 1:97–103. research was supported by NSERC, iCore, and Alberta In- Liu, J., and West, M. 2000. Combined parameter and genuity through the Alberta Ingenuity Centre for Machine state estimation in simulation-based filtering. In Doucet, Learning. A.; de Freitas, N.; and Gordon, N., eds., Sequential Monte Carlo Methods in Practice. New York. Springer-Verlag, References New York. Bererton, C. 2004. State estimation for game ai using par- Ng, A. Y., and Russell, S. 2000. Algorithms for inverse ticle filters. In AAAI workshop on challenges in game AI. reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, 663–670. Billings, D.; Papp, D.; Schaeffer, J.; and Szafron, D. 1998. Opponent modeling in poker. In Proceedings of the Fif- Southey, F.; Bowling, M.; Larson, B.; Piccione, C.; Burch, teenth National Conference on Artificial Intelligence, 493– N.; Billings, D.; and Rayner, C. 2005. Bayes’ bluff: Op- 498. Madison, WI: AAAI Press. ponent modelling in poker. In Proceedings of the Twenty- First Conference on Uncertainty in Artificial Intelligence, Billings, D.; Burch, N.; Davidson, A.; Holte, R.; Schaeffer, 550–558. J.; Schauenberg, T.; and Szafron, D. 2003. Approximating Storvik, G. 2002. Particle filters for state-space models game-theoretic optimal strategies for full-scale po ker. In with the presence of unknown static parameters. IEEE Proceedings of the Eighteenth International Joint Confer- Transactions on Signal Processing 50:281–289. ence on Artificial Intelligence. Billings, D.; Davidson, A.; Schauenberg, T.; Burch, N.; Bowling, M.; Holte, R.; Schaeffer, J.; and Szafron, D. 521
You can also read