Winning Isn't Everything: Enhancing Game Development with Intelligent Agents - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
1 Winning Isn’t Everything: Enhancing Game Development with Intelligent Agents Yunqi Zhao,* Igor Borovikov,* Fernando de Mesentier Silva,* Ahmad Beirami,* Jason Rupert, Caedmon Somers, Jesse Harder, John Kolen, Jervis Pinto, Reza Pourabolghasem, James Pestrak, Harold Chaput, Mohsen Sardari, Long Lin, Sundeep Narravula, Navid Aghdaie, and Kazi Zaman Abstract—Recently, there have been several high-profile achievements of agents learning to play games against humans arXiv:1903.10545v2 [cs.AI] 20 Aug 2019 and beat them. In this paper, we study the problem of training intelligent agents in service of game development. Unlike the agents built to “beat the game”, our agents aim to produce human-like behavior to help with game evaluation and balancing. We discuss two fundamental metrics based on which we measure the human-likeness of agents, namely skill and style, which are multi-faceted concepts with practical implications outlined in this paper. We discuss how this framework applies to multiple games under development at Electronic Arts, followed by some of the lessons learned. Index Terms—Artificial intelligence; playtesting; non-player character (NPC); multi-agent learning; reinforcement learning; imitation learning; deep learning; A* search. I. I NTRODUCTION Fig. 1. A depiction of the possible ranges of AI agents and the possible tradeoff/balance between skill and style. In this tradeoff, there is a region that The history of artificial intelligence (AI) can be mapped captures human-like skill and style. AI Agents may not necessarily land in by its achievements playing and winning various games. the human-like region. High-skill AI agents land in the green region while From the early days of Chess-playing machines to the most their style may fall out of the human-like region. recent accomplishments of Deep Blue [10], AlphaGo [11], and AlphaStar [12], game-playing AI1 has advanced from competent, to competitive, to champion in even the most we train game-playing AI agents to perform tasks ranging complex games. Games have been instrumental in advancing from automated playtesting to interaction with human players AI, and most notably in recent times through tree search and tailored to enhance game-play immersion. deep reinforcement learning (RL). To approach the challenge of creating agents that generate Complementary to these great efforts on training high-skill meaningful interactions that inform game developers, we pro- gameplaying agents, at Electronic Arts, our primary goal is pose techniques to model different behaviors. Each of these to train agents that assist in the game design process, which has to strike a different balance between style and skill. We is iterative and laborious. The complexity of modern games define skill as how efficient the agent is at completing the task steadily increases, making the corresponding design tasks it is designed for. Style is vaguely defined as how the player even more challenging. To support designers in this context, engages with the game and what makes the player enjoy their game-play. Defining and gauging skill is usually much easier This paper was presented in part at 14th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE 18) [1], NeurIPS than that of style. However, we attempt to evaluate style of 2018 Workshop on Reinforcement Learning under Partial Observability [2], an artificial agent using statistical properties of the underlying [3], AAAI 2019 Workshop on Reinforcement Learning in Games (Honolulu, model in this paper. HI) [4], AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering [5], ICML 2019 Workshop on Human in One of the most crucial tasks in game design is the process the Loop Learning [6], ICML 2019 Workshop on Imitation, Intent, and of playtesting. Game designers usually rely on playtesting Interaction [7], The 23rd Annual Signal and Image Sciences Workshop at sessions and feedback they receive from playtesters to make Lawrence Livermore National Laboratory, 2019 [8], and GitHub, 2019 [9]. J. Rupert and C. Somers are with EA Sports, Electronic Arts, 4330 design choices in the game development process. Playtesting is Sanderson Way, Burnaby, BC V5G 4X1, Canada. performed to guarantee quality game-play that is free of game- The rest of the authors are with EA Digital Platform – Data & AI, Electronic breaking exceptions (e.g., bugs and glitches) and delivers Arts, Redwood City, CA 94065 USA. *Y. Zhao, I. Borovikov, FDM Silva and A. Beirami contributed the experience intended by the designers. Since games are equally to this paper. e-mails: {yuzhao, iborovikov, fdemesentiersilva, complex entities with many moving parts, solving this multi- abeirami}@ea.com. faceted optimization problem is even more challenging. An 1 We refer to game-playing AI as any AI solution that powers an agent in the game. This can range from scripted AI solutions all the way to the iterative loop where data is gathered from the game by one state-of-the-art deep reinforcement learning agents. or more playtesters, followed by designer analysis is repeated
2 many times throughout the game development process. Another advantage in creating specialized AI is the cost of To mitigate this expensive process, one of our major efforts implementation and training. The agents needed for these tasks is to implement agents that can help automate aspects of are, commonly, of smaller complexity than their optimal play playtesting. These agents are meant to play through the game, alternative, making it easier to implement as well as faster to or a slice of it, trying to explore behaviors that can generate train. data to assist is answering questions that designers pose. These To summarize, we mainly pursue two use-cases for having can range from exhaustively exploring certain sequences of AI agents enhance the game development process. actions, to trying to play a scenario from start to finish in 1) playtesting AI agents to provide feedback during game the least amount of actions possible. We showcase use-cases design, particularly when a large number of concurrent focused on creating AI agents to playtest games at Electronic players are to interact in huge game worlds. Arts and discuss the related challenges. 2) game-playing AI agents to interact with real human Another key task in game development is the creation of in- players to shape their game-play experience. game characters that are human-facing and interact with real The rest of the paper is organized as follows. In Section II, human players. Agents must be trained and delicate tuning we review the related work on training agents for playtesting has to be performed to guarantee quality experience. An AI and NPCs. In Section III, we describe our training pipeline. adversary that reacts in a small amount of frames can be In Sections IV and V, we provide four case studies that cover deemed unfair rather than challenging. On the other hand, a playtesting and game-playing, respectively. These studies are pushover agent might be an appropriate introductory opponent performed to help with the development process of multiple for novice players, while it fails to retain player interest after a games at Electronic Arts. These games vary considerably in handful of matches. While traditional AI solutions are already many aspects, such as the game-play platform, the target providing excellent experiences for the players, it is becoming audience, and the engagement duration. The solutions in increasingly more difficult to scale those traditional solutions these case studies were created in constant collaboration with up as the game worlds are becoming larger and the content is the game designers. The first case study in Section IV-A, becoming dynamic. which covers game balancing and playtesting was done in As Fig. 1 shows, there is a range of style/skill pairs that are conjunction with the development of The Sims Mobile. The achievable by human players, and hence called human-like. other case studies are performed on games that are still under On the other hand, high-skill game-playing agents may have development, at the moment this paper was written. Hence, an unrealistic style rating, if they rely on high computational we had to omit specific details regarding them purposely to power and memory size, unachievable by humans. Efforts to comply with company confidentiality. Finally, the concluding evaluate techniques to emulate human-like behavior have been remarks are provided in Section VI. presented [13], but measuring non-objective metrics such as fun and immersion is an open research question [14], [15]. II. R ELATED W ORK Further, we cannot evaluate player engagement prior to the game launch, so we rely on our best approximation: designer A. Playtesting AI agents feedback. Through an iterative process, the designers evaluate To validate their design, game designers conduct playtesting the game-play experience by interacting with the AI agents sessions. Playtesting consists of having a group of players to measure whether the intended game-play experience is interact with the game in the development cycle to not only provided. gauge the engagement of players, but also to discover elements These challenges each require a unique equilibrium be- and states that result in undesirable outcomes. As a game goes tween style and skill. Certain agents could take advantage of through the various stages of development, it is essential to superhuman computation to perform exploratory tasks, most continuously iterate and improve the relevant aspects of the likely relying more heavily on skill. Others need to interact game-play and its balance. Relying exclusively on playtesting with human players, requiring a style that won’t break player conducted by humans can be costly and inefficient. Artificial immersion. Then there are agents that need to play the game agents could perform much faster play sessions, allowing the with players cooperatively, which makes them rely on a much exploration of much more of the game space in much shorter more delicate balance that is required to pursue a human-like time. This becomes even more valuable as game worlds grow play style. Each of these these individual problems call for large enough to hold tens of thousands of simultaneously different approaches and have significant challenges. Pursuing interacting players. Games at this scale render traditional human-like style and skill can be as challenging (if not more) human playtesting infeasible. than achieving high performance agents. Recent advances in the field of RL, when applied to playing Finally, training an agent to satisfy a specific need is often computer games assume that the goal of a trained agent is to more efficient than trying to reach such solution through high- achieve the best possible performance with respect to clearly skill AI agents. This is the case, for example, when using defined rewards while the game itself remains fixed for the game-playing AI to automatically run multiple playthroughs of foreseen future. In contrast, during game development the a specific in-game scenario to trace the origin of an unintended objectives and the settings are quite different and vary over game-play behavior. In this scenario, an agent that would time. The agents can play a variety of roles with the rewards explore the game space would potentially be a better option that are not obvious to define formally, e.g., an objective of than one that reaches the goal state of the level more quickly. an agent exploring a game level is different from foraging,
3 defeating all adversaries, or solving a puzzle. Also, the game Agent environment Gameplay environment environment changes frequently between the game builds. In such settings, it is desirable to quickly train agents that help loop Game with automated testing, data generation for the game balance Code evaluation and wider coverage of the game-play features. It Reusable Agents is also desirable that the agent be mostly re-usable as the game build is updated with new appearance and game-play Fig. 2. The AI agent training pipeline, which is consisted of two main features. Following the direct path of throwing computational components: 1) game-play environment; and 2) agent environment. The agent resources combined with substantial engineering efforts at submits actions to the game-play environment and receives back the next state. training agents in such conditions is far from practical and calls for a different approach. The idea of using artificial agents for playtesting is not new. playing the game [36]. MCTS has also been applied to the Algorithmic approaches have been proposed to address the game of 7 Wonders [37] and Ticket to Ride [38]. Furthermore, issue of game balance, in board games [16], [17] and card Baier et al. biased MCTS with a player model, extracted from games [18], [19], [20]. More recently, Holmgard et al. [21], game-play data, to have an agent that was competitive while as well as, Mugrai et al. [22] built variants of MCTS to create approximating human-like play [39]. Tesauro [40], on the a player model for AI Agent based playtesting. Guerrero- other hand, used TD-Lambda which is a temporal difference Romero et al. created different goals for general game-playing RL algorithm to train Backgammon agents at a superhuman agents in order to playtest games emulating players of dif- level. The impressive recent progress on RL to solve video ferent profiles [23]. These techniques are relevant to creating games is partly due to the advancements in processing power rewarding mechanisms for mimicking player behavior. AI and and AI computing technology.2 More recently, following the machine learning can also play the role of a co-designer, success stories in deep learning, deep Q networks (DQNs) making suggestions during development process [24]. Tools use deep neural networks as function approximators within for creating game maps [25] and level design [26], [27] are Q-learning [42]. DQNs can use convolutional function ap- also proposed. See [28], [29] for a survey of these techniques proximators as a general representation learning framework in game design. from the pixels in a frame buffer without need for task-specific In this paper, we describe our framework that supports feature engineering. game designers with automated playtesting. This also entail DeepMind researchers remarried the two approaches by a training pipeline that universally applies this framework to a showing that DQNs combined with MCTS would lead to AI variety of games. We then provide two case studies that entail agents that play Go at a superhuman level [11], and solely different solution techniques. via self-play [43], [44]. Subsequently, OpenAI researchers showed that a policy optimization approach with function ap- proximation, called Proximal Policy Optimization (PPO) [45], B. Game-playing AI agents would lead to training agents at a superhuman level in Dota Game-playing AI has been a main constituent of games 2 [46]. Cuccu et al. proposed learning policies and state since the dawn of video gaming. Analogously, games, representations individually, but at the same time, and did so given their challenging nature, have been a target for AI using two novel algorithms [47]. With such approach they research [30]. Over the years, AI agents have become more were able to play Atari games with neural networks of 18 sophisticated and have been providing excellent experiences neurons or less. Recently, highly publicized progress was to millions of players as games have grown in complexity. reported by DeepMind on StarCraft II, where AlphaStar was Scaling traditional AI solutions in ever growing worlds with unveiled to play the game at a superhuman level by combining thousands of agents and dynamic content is a challenging a variety of techniques including attention networks [12]. problem calling for alternative approaches. The idea of using machine learning for game-playing AI dates back to Arthur Samuel [31], who applied some form III. T RAINING P IPELINE of tree search combined with basic reinforcement learning to To train AI agents efficiently, we have developed a unified the game of checkers. His success motivated researchers to training pipeline that is applicable to all of EA games, regard- target other games using machine learning, and particularly less of the platform and the genre of the game. In this section, reinforcement learning. we present our training pipeline that is used for solving the IBM Deep Blue followed the tree search path and was the case studies presented in the section that follows. first artificial game agent who beat the chess world champion, Gary Kasparov [10]. A decade later, Monte Carlo Tree Search (MCTS) [32], [33] was a big leap in AI to train game agents. A. Gameplay and Agent Environments MCTS agents for playing Settlers of Catan were reported The AI agent training pipeline, which is depicted in Fig. 2, in [34], [35] and shown to beat previous heuristics. Other consists of two key components: work compares multiple approaches of agents to one another in the game Carcassonne on the two-player variant of the game 2 The amount of AI compute has been doubling every 3-4 months in the and discusses variations of MCTS and Minimax search for past few years [41].
4 • Gameplay environment refers to the simulated game The presence of frame buffers as the representative of world that executes the game logic with actions submitted game state would significantly increase this communica- by the agent every timestep and produces the next state.3 tion cost whereas derived game state features enable more • Agent environment refers to the medium where the agent compact encodings. interacts with the game world. The agent observes the (d) Obtaining an artificial agent in a reasonable time (a few game state and produces an action. This is where training hours at most) usually requires that the game be clocked occurs. Note that in case of reinforcement learning, the at a rate much higher than the usual game-play speed. As reward computation and shaping also happens in the agent rendering each frame takes a significant portion of every environment. frame’s time, overclocking with rendering enabled is In practice, the game architecture can be complex and it might not practical. Additionally, moving large amount of data be too costly for the game to directly communicate the com- from GPU to main memory drastically slows down the plete state space information to the agent at every timestep. To game execution and can potentially introduce simulation train artificial agents, we create a universal interface between artifacts, by interfering with the target timestep rate. the game-play environment and the learning environment.4 (e) Last but not least, we can leverage the advantage of The interface extends OpenAI Gym [48] and supports actions having privileged access to the game code to let the game that take arguments, which is necessary to encode action engine distill a compact state representation that could be functions and is consistent with PySC2 [49], [50]. In addition, inferred by a human player from the game and pass it to our training pipeline enables creating new players on the game the agent environment. By doing so we also have a better server, logging in/out an existing player, and gathering data hope of learning in environments where the pixel frames from expert demonstrations. We also adapt Dopamine [51] only contain partial information about the the state space. to this pipeline to make DQN [42] and Rainbow [52] agents The compact state representation could include the inven- available for training in the game. Additionally, we add support tory, resources, buildings, the state of neighboring players, for more complex preprocessing other than the usual frame and the distance to target. In an open-world shooter game buffer stacking, which we explicitly exclude following the the features may include the distance to the adversary, angle motivation presented in the next section. at which the agent approaches the adversary, presence of line of sight to the adversary, direction to the nearest waypoint generated by the game navigation system, and other features. B. State Abstraction The feature selection may require some engineering efforts but The use of frame buffer as an observation of the game it is logically straightforward after the initial familiarization state has proved advantageous in eliminating the need for with the game-play mechanics, and often similar to that of manual feature-engineering in Atari games [42]. However, to traditional game-playing AI, which will be informed by the achieve the objectives of RL in a fast-paced game development game designer. We remind the reader that our goal is not to process, the drawbacks of using frame buffer outweigh its ad- train agents that win but to simulate human-like behavior, so vantages. The main considerations which we take into account we train on information that would be accessible to a human when deciding in favor of a lower-dimensional engineered player. representation of game state are: (a) During almost all stages of the game development, the IV. P LAYTESTING AI AGENTS game parameters are evolving on a daily basis. In par- A. Measuring player experience for different player styles ticular, the art may change at any moment and the look of already learned environments can change overnight. In this section, we consider the early development of The Hence, it is desirable to train agents using features that are Sims Mobile, whose game-play is about “emulating life”: more stable to minimize the need for retraining agents. players create avatars, called Sims, and conduct them through (b) Another important advantage of state abstraction is that a variety of everyday activities. In this game, there is no single it allows us to train much smaller models (networks) predetermined goal to achieve. Instead, players craft their because of the smaller input size and use of carefully own experiences, and the designer’s objective is to evaluate engineered features. This is critical for deployment for different aspects of that experience. In particular, each player real time applications in console and consumer PC en- can pursue different careers, and as a result will have a vironments where rendering, animation and physics are different experience and trajectory in the game. In this specific occupying much of the GPU and CPU power. case study, the designer’s goal is to evaluate if the current (c) In playtesting, the game-play environment and the learn- tuning of the game achieves the intended balanced game- ing environment may reside in physically separate nodes. play experience across different careers. For example, different Naturally, closing the RL state-action-reward loop in such careers should prove similarly difficult to complete. We refer environments requires a lot of network communication. the interested reader to [1] for a more comprehensive study of this problem. 3 Note that the reward is usually defined by the user as a function of the The game is single-player, deterministic, real-time, fully state and action outside of the game-play environment. observable and the dynamics are fully known. We also have 4 These environments may be physically separated, and hence, we prefer a thin (i.e., headless) client that supports fast cloud execution, and is not tied access to the complete game state, which is composed mostly to frame rendering. of character and on-going action attributes. This simplified
5 reward R. Instead, we learn parameters that define U (a, s) to optimize the environment rewards using the black-box ES optimization technique. In that sense optimizing R by learning parameters of U is similar to Proximal Policy Optimization (PPO), however, in much more constrained settings. To this end, we design utility of an action as a weighted sum of the immediate action rewards r(a) and costs c(a). These are vector-valued quantities and are explicitly present in the game tuning describing the outcome of executing such actions. The parameters evolving by the ES are the linear weights for the Fig. 3. Comparison of the average amount of career actions (appointments) utility function U explained below and the temperature of the taken to complete the career using A* search and evolution strategy adapted from [1]. softmax function. An additional advantage of the proposed linear design of the utility function is a certain level of interpretability of the weights corresponding to the perceived case allows for the extraction of a lightweight model of the by the agent utilities of the individual components of the game (i.e., state transition probabilities). While this requires resources or the immediate rewards. Such interpretability can some additional development effort, we can achieve a dra- guide changes to the tuning data. matic speedup in training agents by avoiding (reinforcement) Concretely, given the game state s, we design the utility U learning and resorting to planning techniques instead. of an action a as In particular, we use the A* algorithm for the simplicity of proposing a heuristic that can be tailored to the specific U (s, a) = r(a)v(s) + c(a)w(s) designer need by exploring the state transition graph instead The immediate reward r(a) here is a vector that can include of the more expensive iterative processes, such as dynamic quantities like the amount of experience, amount of career programming,5 and even more expensive Monte Carlo search points earned for the action and the events triggered by it. based algorithms. The customizable heuristics and the target The costs c(a) is a vector defined similarly. The action costs states corresponding to different game-play objectives, which specify the amounts of resources required to execute such an represent the style we are trying to achieve, provide sufficient action, e.g., how much time, energy, hunger, etc. a player needs control to conduct various experiments and explore multiple to spend to successfully trigger and complete the action. The aspects of the game. design of the tuning data makes the both quantities r and c Our heuristic for the A* algorithm is the weighted sum of only depend on the action itself. Since both - the immediate the 3 main parameters that contribute for career progression: reward r(a) and c(a) are vector values, the products in the career level, current career experience points and amount definition of U above are dot products. The vectors v(s) and of completed career events. These parameters are directly w(s) introduce dependence of the utility on the current game related. To gain career levels players have to accumulate career state and are the weights defining relative contribution of the experience points and to obtain experience, players have to immediate resource costs and immediate rewards towards the complete career events. The weights are attributed based on current goals of the agent. the order of magnitude each parameter has. Since levels are The inferred utilities of the actions depend on the state since the most important, it receives the highest weight. The amount some actions in certain states are more beneficial than in other of completed career events has the lowest weight because it is states. E.g., triggering a career event while not having enough already partially factored into the the amount of career points resources to complete it successfully may be wasteful and an received so far. optimal policy should avoid it. The relevant state components We also compare A* results to the results from an opti- s = (s1 , .., sk ) include available commodities like energy and mization problem over a subspace of utility-based policies hunger and a categorical event indicator (0 if outside of the approximately solved with an evolution strategy (ES) [54]. event and 1 otherwise) wrapped into a vector. The total number Our goal, in this case, is to achieve a high environment reward of the relevant dimensions here is k. We design the weights against selected objective, e.g., reach the end of a career track va (s) and wa (s) as bi-linear functions with the coefficients while maximizing earned career event points. We design ES p = (p1P , ..., pk ) and q = (q1 , ..., qkP ) that we are learning: objective accordingly. The agent performs an action a based on va (s) = i=1,..,k pi si and wa (s) = i=1,..,k qi si . a probabilistic policy by taking a softmax on the utility U (a, s) To define the optimization objective J, we construct it as measure of the actions in a game state s. Utility here serves a function of the number of successfully completed events N as an action selection mechanism to compactly represent a and the number of attempted events M . We aim to maximize policy. In a sense, it is a proxy to a state-action value Q- the ratio of successful to attempted events times total number function Q(a, s) in RL. However, we do not attempt to derive of successful events in the episode as follows: utility from Bellman’s equation and the actual environment N (N + ) 5 Unfortunately, J(N, M ) = . in the dynamic programming every node will participate in M + the computation while it is often true that most of the nodes are not relevant to the shortest path problem in the sense that they are unlikely candidates for where is a small number less than 1 eliminating division by inclusion in a shortest path [53]. zero when the policy fails to attempt any events. The overall
6 optimization problem looks like: The Sims Mobile in the sense that it requires the players to exhibit strategic decision making for them to progress in the max J(N, M ), p,q game. When the game dynamics are unknown or complex, subject to the policy defined by actions selected with a softmax most recent success stories are based on model-free RL (and over their utilities parameterized with the (learned) parameters particularly variants of DQN and PPO). In this section, we p and q. show how such model-free control techniques fit into the The utility-based ES, as we describe it here, captures the paradigm of playtesting modern games. design intention of driving career progression in the game-play In this game, the goal of the player (and subsequently the by successful completion of career events. Due to the emphasis agent) is to level up and reach a particular milestone in the on the events completion, our evolution strategy setup is not game. To this end, the player needs to make smart decisions necessarily resulting in an optimization problem equivalent to in terms of resource mining and resource management for the one we solve with A∗ . However, as we discuss below, it has different tasks. In the process, the agent needs to also upgrade similar optimum most of the time, supporting the design view some buildings. Each upgrade requires a certain level of on the progression. A similar approach works for evaluating resources. If the player’s resources are insufficient, the upgrade relationship progression, which is another important element is not possible. A human player will be able to visually of the game-play. discern the validity of such action by clicking on the particular We compare the number of actions that it takes to reach building for upgrade. The designer’s primary concern in this the goal for each career in Fig. 3 as computed by the two case study is to measure how a competent player would approaches. We emphasize that our goal is to show that a progress in the early stages of this game. In particular, the simple planning method, such as A*, can sufficiently satisfy competent player is required to balance resources and make the designer’s goal in this case. We can see that the more other strategic choices that the agent needs to discern as well. expensive optimization based evolution strategy performs sim- We consider a simplified version of the state space that con- ilarly to the much simpler A* search. tains information about this early stage of the game ignoring The largest discrepancy arises for the Barista career, which the full state space. The relevant part of the state space might be explained by the fact that this career has an action consists of ∼50 continuous and ∼100 discrete state variables. that does not reward experience by itself, but rather enables The set of possible actions α is a subset of a space A, which another action that does it. This action can be repeated often consists of ∼25 action classes, some of which are from a and can explain the high numbers despite having half the continuous range of possible action values, and some are from number of levels. Also, we observe that in the case of the a discrete set of action choices. The agent has the ability medical career, the 2,000 node A* cutoff was potentially to generate actions a ∈ A but not all of them are valid at responsible for the under performance in that solution. every game state since α = α(s, t), i.e., α depends on the When running the two approaches, another point of com- timestep and the game state. Moreover, the subset α(s, t) of parison can be made: how many sample runs are required valid actions may only partially be known to the agent. If to obtain statistically significant results? We performed 2,000 the agent attempts to take an unavailable action, such as a runs for the evolution strategy while it is notable that the A* building upgrade without sufficient resources, the action will agent learns a deterministic playstyle, which has no variance. be deemed invalid and no actual action will be taken by the On the other hand, the agent trained using an evolution strategy game server. has a high variance and requires a sufficiently high number While the problem of a huge state space [55], [56], [57], of runs of the simulation to approach a final reasonable a continuous action space [58], and a parametric action strategy [1]. space [59] could be dealt with, these techniques are not In this use case, we were able to use a planning algorithm, directly applicable to our problem. This is because, as we A*, to explore the game space to gather data for the game shall see more, some actions will be invalid at times and designers to evaluate the current tuning of the game. This was inferring that information may not be fully possible from the possible due to the goal being straightforward, to evaluate pro- observation space. Finally, the game is designed to last tens gression in the different careers. With such, the requirements of millions of timesteps, taking the problem of training a of skill and style for the agent were achievable and simple functional agent in such an environment outside of the domain to model. Over the next use cases, we analyze scenarios that of previously explored problems. call for different approaches as consequence of having more We study game progression while taking only valid actions. complex requirements and subjective agent goals. As we already mentioned, the set of valid actions α may not be fully determined by the current observation, and hence, we deal with a partially observable Markov decision process B. Measuring competent player progression (POMDP). Given the practical constraints outlined above, it is In the next case study, we consider a real-time multi-player infeasible to apply deep reinforcement learning to train agents mobile game, with a stochastic environment, with sequential in the game in its entirety. In this game, we show progress actions. The game is designed to engage a large number of toward training an artificial agent that takes valid actions and players for months. The game dynamics are governed by a progresses in the game like a competent human player. To this complex physics engine, which makes it impractical to apply end, we wrap this game in the game environment and connect planning methods. This game is much more complex than it to our training pipeline with DQN and Rainbow agents. In
7 Fig. 4. This plot belongs to Section IV-B. Average cumulative reward (return) in training and evaluation for the agents as a function of the number of iterations. Each iteration is worth ∼60 minutes of game-play. The trained agents are: (1) a DQN agent with complete state space, (2) a Rainbow agent with complete state space, (3) a DQN agent with augmented observation space, and (4) a Rainbow agent with augmented observation space. the agent environment, we use a feedforward neural network We trained four types of agents as shown in Fig. 4, where we with two fully connected hidden layers, each with 256 neurons are plotting the average undiscounted return in each training followed by ReLU activation. episode. By design, this quantity is upper bounded by ‘+1’, As a first step in measuring game progression, we define which is achieved if the agent keeps taking valid actions until an episode by setting an early goal state in the game that takes reaching the final goal state. In reality, this may not always an expert human player ∼5 minutes to reach. We let the agent be achievable as there are periods of time where no action is submit actions to the game server every second. We may have available and the agent has to choose the “do nothing” action to revisit this assumption for longer episodes where the human and be rewarded with ‘-0.1’. Hence, the best a competent player is expected to interact with the game more periodically. human player would achieve on these episodes would be We use a simple rewarding mechanism, where we reward the around zero. agent with ‘+1’ when they reach the goal state, ‘-1’ when they We see that after a few iterations, both the Rainbow and submit an invalid action, ‘0’ when they take a valid action, and DQN agents converge to their asymptotic performance values. ‘-0.1’ when they choose the “do nothing” action. The game is The Rainbow agent converges to a better asymptotic perfor- such that at times the agent has no other valid action to choose, mance level as compared to the DQN agent. However, in the and hence they should choose “do nothing”, but such periods the transient behavior we observe that the DQN agent achieves do not last more than a few seconds in the early stages of the the asymptotic behavior faster than the Rainbow agent. We game, which is the focus of this case study. believe this might be due to the fact that we did not tune hyperparameters of prioritized experience replay [60], and We consider two different versions of the observation space, distributional RL [61].6 We used the default values that worked both extracted from the game engine (state abstraction). The best on Atari games with frame buffer as state space. Extra first is what we call the “complete” state space. The complete hyperparameter tuning would have been costly in terms of state space contains information that is not straightforward to cloud infrastructure for this particular problem as the game infer from the real observation in the game and is only used as server does not allow speedup and training once already takes a baseline for the agent. In particular, the complete state space a few hours. also includes the list of available actions at each state. The As expected, we see in Fig. 4 that the augmented obser- polar opposite of this state space could be called the “naive” vation space makes the training slower and also results in state space, which only contains straightforward information a worse performance on the final strategy. In addition, at that is always shown on the screen of the player. The second evaluation time, the agent keeps attempting invalid actions in state space we consider is what we call the “augmented” ob- some cases as the state remains mostly unchanged after each servation space, which contains information from the “naive” attempt and the policy is (almost) deterministic. These results state space and information the agent would reasonably infer in accumulating large negative returns in such episodes which and retain from current and previous game observations. For account for the dips in the right-hand-side panel in Fig. 4 at example, this includes the amount of resources needed for evaluation time. an upgrade after the agent has checked a particular building The observed behavior drew our attention to the question of for an upgrade. The augmented observation space does not whether it is too difficult to discern and keep track of the set of include the set of all available actions, and hence, we rely valid actions for a human player as well. In fact, after seeking on the game server to validate whether a submitted action is more extensive human feedback the game designers concluded available because it is not possible to encode and pass the set that better visual cues were needed for a human player on α of available actions. Hence, if an invalid action is chosen by the agent, the game server will ignore the action and will 6 This is consistent with the results of Section V-C, where Rainbow with flag the action so that we can provide a ‘-1’ reward. default hyperparameters does not outperform DQN either.
8 information about valid actions at each state so that the human are also other cases where the preferred solution for training players could progress more smoothly without being blocked agents would utilize a few relatively short demonstration by invalid actions. As next steps, we intend to experiment episodes played by the software developers or designers at with shaping the reward function for achieving different play the end of the current development cycle [67]. styles to be able to better model different player clusters. In this application, we consider training artificial agents We also intend to investigate augmenting the replay buffer in an open-world video game, where the game designer is with expert demonstrations for faster training and also for interested in training non-player characters that exhibit certain generative adversarial imitation learning [62] once the game behavioral styles. The game we are exploring is a shooter with is released and human play data is available. contextual game-play and destructible environment. While the We remark that without state abstraction (streamlined access game can run in multi-player mode, we focus on single-player, to the game state), the neural network function approximator which provides us with an environment tractable yet rich used for Q-learning would have needed to discern all such enough to test the approach we discuss in this section. An information from the pixels in the frame buffer, and hence we agent in such a game would have its state composed of a would not have been able to get away with such a simple two- 3D-vector location, velocity, animation pose, health, weapon layer feedforward function approximator to solve this problem. type, ammunition, scope on-off, cone of sight, collision state, However, we observe that the training within the paradigm distance to the obstacles in the principle directions, etc. Overall of model-free RL remains costly. Specifically, even using the the dimensionality of the agent state can grow to several complete state space, it takes several hours to train an agent dozens of variables with some of them continuous and the that achieves a level of performance expected of a competent other categorical. We construct similar vectors for NPCs with human player on this relatively short episode of ∼5 minutes. which the agent needs to engage. This calls for the exploration of complementary approaches The NPC state variables account for partial observability to augment the training process. In particular, we also would and line-of-site constraints imposed by the level layout, its like to streamline this process by training reusable agents and destruction state, and the current location of the agent rel- capitalizing on existing human data through imitation learning. ative to the NPCs. The NPCs in this environment represent adversarial entities, trying to eliminate the agent by attack- V. G AME - PLAYING AI ing it until the agent runs out of health. Additionally, the environment can contain objects of interest, like health and We have shown the value of simulated agents in a fully ammo boxes, dropped weapons, etc. The environment itself modeled game, and the potential of training agents in a com- is non-deterministic stochastic, i.e., there is no single random plex game to model player progression for game balancing. seed which we can set to control all random choices in the We can take these techniques a step further and make use of environment. Additionally, frequent saving and reloading game agent training to help build the game itself. Instead of applying state is not practical due to relatively long loading times. RL to capture player behaviors, we consider an approach The main objective for us in this application of the agents to game-play design where the player agents learn behavior training is to provide designer with a tool to playtest the game policies from the game designers. The primary motivation by interacting with the game in a number of particular styles of that is to give direct control into the designer hands and to emulate different players. The styles can include: enable easy interactive creation of various behavior types, • Aggressive, when an agent tries to find, approach and e.g., aggressive, exploratory, stealthy, etc. At the same time, defeat any adversarial NPC, we aim to complement organic demonstrations with bootstrap • Sniper, when an agent finds a good sniping spot and waits and heuristics to eliminate the need for a human to train an for adversarial NPCs to appear in the cone of sight to agent on the states normally not encountered by humans, e.g., shoot them, unblocking an agent using obstacle avoidance. • Exploratory, when an agent attempts to visit as many locations and uncover as many objects of interest in the A. Human-Like Exploration in an Open-World Game level as possible without getting engaged into combat To bridge the gap between the agent and the designer, unless encountering an adversarial NPC, we introduce imitation learning (IL) to our system [62], • Sneaky, when an agent actively avoids combat while [63], [64], [65]. In the present application, IL allows us to trying to achieve its objectives like reaching a particular translate the intentions of the game designer into a primer points on the map. and a target for our agent learning system. Learning from Additionally, combat style can vary with deploying different expert demonstrations has traditionally proved very helpful in weapons, e.g., long-range or hand-to-hand. training agents, including in games [66]. In particular, the Obviously, manual test and evaluation of its results while original Alpha Go [11] used expert demonstrations in training following outlined styles is very time consuming and tedious. a deep Q network. While it is argued in subsequent work Having an agent that can replace a designer in this process that learning via self-play could achieve a better asymptotic would be a great time saver. Conceivably, an agent trained return compared to relying on expert demonstrations, the as the design helper can also be playing as a stand-in or an better performance comes with significantly higher cost in “avatar” of an actual human player to replace the player when terms of training computational resources and the superhuman she is not online or the connection drops out for short period performance is not what we are seeking in this work. There of time. An agent trained in a specific style can also fill a
9 Fig. 5. Model performance measures the probability of the event that the Markov agent finds at least one previous action from human-played demonstration episodes in the current game state. The goal of interactive learning is to add support for new game features to the already trained model or improve its performance in underexplored game states. Plotted is the model performance during interactive training from demonstrations in a proprietary open-world game as a function of time measured in milliseconds (with the total duration around 10 minutes). The corresponding section of the paper offers additional details for the presented experiment. vacant spot in a squad, or to help with the cold start problem [8] . for multi-player games. 1) Markov Decision Process with Extended State: Follow- ing the established treatment of training artificial agents, we In terms of the skill/style tradeoff laid out earlier in the place the problem into the standard MDP framework and aug- paper, these agents are not designed to have any specific level ment it as follows. Firstly, we ignore the difference between of performance (e.g., a certain kill-death ratio) and they may the observation and the actual state s of the environment. not necessarily follow any long-term goals. These agents are The actual state may be observable by the teacher but may intended to explore the game and also be able to interact with be impractical to expose to the agent. To mitigate the partial human players at a relatively shallow level of engagement. observability, we extended observations with a short history Hence, the problem boils down to efficiently training an agent of previously taken actions. In addition to implicitly encoding using demonstrations capturing only certain elements of the the intent of a teacher and her reactions to potentially richer game-play. The training process has to be computationally observations, it also helps to preserve the stylistic elements of inexpensive and the agent has to imitate the behavior of the human demonstrations. teacher(s) by mimicking their relevant style (in a statistical Concretely, we assume the following. The interaction of the sense) for implicit representation of the teacher’s objectives. agent and the environment takes place at discrete moments Casting this problem directly into the RL framework is t = 1, . . . , T with the value of t trivially observable by the complicated by two issues. First, it is not straightforward how agent. The agent, after receiving an observation st at time to design a rewarding mechanism for imitating the style of t, can take an action at ∈ A(s, t) from the set of allowed the expert. While inverse RL aims at solving this problem, its actions A(s, t) using policy π : s → a. Executing an action in applicability is not obvious given the reduced representation the environment results in a new state st+1 . Since we focus of the huge state-action space that we deal with and the ill- on the stylistic elements of the agent behavior, the rewards posed nature of the inverse RL problem [68], [69]. Second, the are are inconsequential for the model we build, and we drop RL training loop often requires thousands of episodes to learn them from the discussion. Next, we consider the episode- useful policies, directly translating to a high cost of training in based environment, i.e., after reaching a certain condition, the terms of time and computational resources. Hence, rather than execution of the described state-action loop ends. A complete using more complex solutions such as generative adversarial episode is a sequence E = {(st , at )}t∈1,...,T . The fundamental imitation learning [62] which use an RL network as their assumption regarding the described decision process is that it generator, we propose a solution to the stated problem based has the Markov property. on an ensemble of multi-resolution Markov models. One of Besides the most recent action taken before time t, i.e., the major benefits of the proposed model is the ability to action at−1 , we also consider a recent history of the past n perform an interactive training within the same episode. As actions, where 1 ≤ n < T , αt,n := at−1 t−n = {at−n , . . . , at−1 }, useful byproduct of our formulation, we can also sketch a whenever it is defined in an episode E. For n = 0, we define mechanism for numerical evaluation of the style associated at,0 as the empty sequence. We augment the directly observed with the agents we train. We outline the main elements of the state st with the action history αt,n , to obtain an extended state approach next and for additional details point at [6], [2], [5], St,n = (st , αt,n ).
10 The purpose of including the action history is to capture For illustration purposes, it is sufficient to consider only the additional information (i.e., stylistic features and the elements uniform quantization Qσ . In practice, most variables have of the objective-driven behavior of the teacher) from human naturally defined limits which are at least approximately controlling the input during interactive demonstrations. An known. Knowing the environment scale gives an estimate of extended policy πn , which operates on the extended states the smallest step size σ0 at which we will have complete πn : St,n → at , is useful for modeling human actions in a information loss, i.e., all observed values map to a single manner similar to n-grams text models in natural language bin. For each continuous variable in the state-action space, processing (NLP) (e.g., [70], [71], [72]). Of course, the we consider a sequence of quantizers with decreasing step analogy with n-gram models in NLP works only if both state size Q = {Qσj }j=0,...,K , σj > σj+1 , which naturally gives and action spaces are discrete. We will address this restriction a quantization sequence Q̄ for the entire state-action space, in the next subsection using multi-resolution quantization. provided K is fixed across the continuous dimensions. To For a discrete state-action space and various n, we can simplify notation, we collapse the sub index and write Qj compute probabilities P {at |St,n } of transitions St,n → at to stand for Qσj . For more general quantization schemes, the occurring in (human) demonstrations and use them as a main requirement is the decreasingly smaller reconstruction Markov model Mn of order n of (human) actions. We say error for Qj+1 in comparison to Qj . that the model Mn is defined on an extended state S.,n if For an episode E, we compute its quantized representation the demonstrations contain at least one occurrence of S.,n . in an obvious component-wise manner: When a model Mn is defined on S, we can use P {at |St,n } Ej = Q̄j (E) = {(Q̄j (st ), Q̄j (at ))}t∈1,...,T (1) to sample the next action from all ever observed next actions in state S.,n . Hence, Mn defines a partial stochastic mapping which defines a multi-resolution representation of the episode Mn : S.,n → A from extended states to action space A. as a corresponding ordered set {Ej }j∈{0,...,K} of quantized 2) Stacked Markov models: We call a sequence of Markov episodes, where Q̄ is the vector version of quantization Q. models Mn = {Mi }i=0,...,n a stack of models. A (partial) In the quantized Markov model Mn,j = Q̄j (Mn ), which policy defined by Mn computes the next action at a state st , we construct from the episode Ej , we compute extended states see [6] for the pseudo-code of the corresponding algorithm. using the corresponding quantized values. Hence, the extended Such policy performs a simple behavior cloning. The policy state is Q̄j (St,n ) = (Q̄(st ), Q̄(αt,n )). Further, we define the is partial since it may not be defined on all possible extended model Q̄j (Mn ) to contain probabilities P {at |Q̄j (St,n )} for states and needs a fallback policy π∗ to provide a functional the original action values. In other words, we do not rely on agent acting in the environment. the reconstruction mapping Q̄−1 j to recover action but store Note that it is possible to implement sampling from a the original actions explicitly. In practice, continuous action Markov model using an O(1) complexity operation with hash values tend to be unique and the model samples from the set tables, making the inference very efficient and suitable for real- of values observed after the occurrences of the corresponding time execution in a video game or other interactive application extended state. Our experiments show that replaying the orig- where expected inference time has to be on the scale of 1 ms inal actions instead of their quantized representation provides or less 7 . better continuity and natural true-to-the-demonstration look of the cloned behavior. 3) Quantization: Quantization (aka discretization) allows 4) Markov Ensemble: Combining together stacking and us to work around the limitation of discrete state-action space multi-resolution quantization of Markov models, we obtain enabling the application of the Markov Ensemble approach Markov Ensemble E as an array of Markov models param- to environments with continuous dimensions. Quantization is eterized by the model order n and the quantization schema commonly used in solving MDPs [73] and has been exten- Qj : sively studied in the signal processing literature [74], [75]. Us- ing quantization schemes that have been optimized for specific EN,K = Mi,j , i = 0, . . . , N, j = 0, . . . , K (2) objectives can lead to significant gains in model performance, improving various metrics vs. ad-hoc quantization schemes, The policy defined by the ensemble (2) computes each e.g., [73], [76]. next action in an obvious manner (). The Markov Ensemble Instead of trying to pose and solve the problem of optimal technique, together with the policy defined by it, are our quantization, we use a set of quantizers covering a range of primary tools for cloning behavior from demonstrations. schemes from coarse to fine. At the conceptual level, such Note, that with the coarsest quantization σ0 present in the an approach is similar to multi-resolution methods in image multi-resolution schema, the policy should always return an processing, mip-mapping and Level-of-Detail (LoD) represen- action sampled using one of the quantized models, which at tations in computer graphics [77]. The simplest quantization the level 0 always finds a match. Hence, such models always is a uniform one with step σ: “generalize” by resorting to simple sampling of actions when jxk no better match found in the observations. Excluding too Qσ (x) = σ coarse quantizers and Markov order 0 will result in executing σ some “default policy” π∗ , which we discuss in the next section. 7 A modern video game runs at least at 30 frames per second with lots The agent execution with the outlined ensemble of quantized computations happening during about 33 ms allowed per frame, drastically stacked Markov models is easy to express as an algorithm, limiting the “budget” allocated for inference. which in essence boils down to a look-up tables [6].
You can also read