Winning Isn't Everything: Enhancing Game Development with Intelligent Agents - arXiv

Page created by Doris Erickson

Arts & Entertainment

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Winning Isn't Everything: Enhancing Game Development with Intelligent Agents - arXiv

1

                                                     Winning Isn’t Everything: Enhancing Game
                                                        Development with Intelligent Agents
                                         Yunqi Zhao,* Igor Borovikov,* Fernando de Mesentier Silva,* Ahmad Beirami,* Jason Rupert, Caedmon Somers,
                                          Jesse Harder, John Kolen, Jervis Pinto, Reza Pourabolghasem, James Pestrak, Harold Chaput, Mohsen Sardari,
                                                                Long Lin, Sundeep Narravula, Navid Aghdaie, and Kazi Zaman

                                            Abstract—Recently, there have been several high-profile
                                         achievements of agents learning to play games against humans
arXiv:1903.10545v2 [cs.AI] 20 Aug 2019

                                         and beat them. In this paper, we study the problem of training
                                         intelligent agents in service of game development. Unlike the
                                         agents built to “beat the game”, our agents aim to produce
                                         human-like behavior to help with game evaluation and balancing.
                                         We discuss two fundamental metrics based on which we measure
                                         the human-likeness of agents, namely skill and style, which are
                                         multi-faceted concepts with practical implications outlined in this
                                         paper. We discuss how this framework applies to multiple games
                                         under development at Electronic Arts, followed by some of the
                                         lessons learned.
                                           Index Terms—Artificial intelligence; playtesting; non-player
                                         character (NPC); multi-agent learning; reinforcement learning;
                                         imitation learning; deep learning; A* search.

                                                                    I. I NTRODUCTION                                       Fig. 1. A depiction of the possible ranges of AI agents and the possible
                                                                                                                           tradeoff/balance between skill and style. In this tradeoff, there is a region that
                                            The history of artificial intelligence (AI) can be mapped                      captures human-like skill and style. AI Agents may not necessarily land in
                                         by its achievements playing and winning various games.                            the human-like region. High-skill AI agents land in the green region while
                                         From the early days of Chess-playing machines to the most                         their style may fall out of the human-like region.
                                         recent accomplishments of Deep Blue [10], AlphaGo [11],
                                         and AlphaStar [12], game-playing AI1 has advanced from
                                         competent, to competitive, to champion in even the most                           we train game-playing AI agents to perform tasks ranging
                                         complex games. Games have been instrumental in advancing                          from automated playtesting to interaction with human players
                                         AI, and most notably in recent times through tree search and                      tailored to enhance game-play immersion.
                                         deep reinforcement learning (RL).                                                    To approach the challenge of creating agents that generate
                                            Complementary to these great efforts on training high-skill                    meaningful interactions that inform game developers, we pro-
                                         gameplaying agents, at Electronic Arts, our primary goal is                       pose techniques to model different behaviors. Each of these
                                         to train agents that assist in the game design process, which                     has to strike a different balance between style and skill. We
                                         is iterative and laborious. The complexity of modern games                        define skill as how efficient the agent is at completing the task
                                         steadily increases, making the corresponding design tasks                         it is designed for. Style is vaguely defined as how the player
                                         even more challenging. To support designers in this context,                      engages with the game and what makes the player enjoy their
                                                                                                                           game-play. Defining and gauging skill is usually much easier
                                            This paper was presented in part at 14th AAAI Conference on Artificial
                                         Intelligence and Interactive Digital Entertainment (AIIDE 18) [1], NeurIPS        than that of style. However, we attempt to evaluate style of
                                         2018 Workshop on Reinforcement Learning under Partial Observability [2],          an artificial agent using statistical properties of the underlying
                                         [3], AAAI 2019 Workshop on Reinforcement Learning in Games (Honolulu,             model in this paper.
                                         HI) [4], AAAI 2019 Spring Symposium on Combining Machine Learning
                                         with Knowledge Engineering [5], ICML 2019 Workshop on Human in                       One of the most crucial tasks in game design is the process
                                         the Loop Learning [6], ICML 2019 Workshop on Imitation, Intent, and               of playtesting. Game designers usually rely on playtesting
                                         Interaction [7], The 23rd Annual Signal and Image Sciences Workshop at            sessions and feedback they receive from playtesters to make
                                         Lawrence Livermore National Laboratory, 2019 [8], and GitHub, 2019 [9].
                                            J. Rupert and C. Somers are with EA Sports, Electronic Arts, 4330              design choices in the game development process. Playtesting is
                                         Sanderson Way, Burnaby, BC V5G 4X1, Canada.                                       performed to guarantee quality game-play that is free of game-
                                            The rest of the authors are with EA Digital Platform – Data & AI, Electronic   breaking exceptions (e.g., bugs and glitches) and delivers
                                         Arts, Redwood City, CA 94065 USA.
                                            *Y. Zhao, I. Borovikov, FDM Silva and A. Beirami contributed                   the experience intended by the designers. Since games are
                                         equally to this paper. e-mails: {yuzhao, iborovikov, fdemesentiersilva,           complex entities with many moving parts, solving this multi-
                                         abeirami}@ea.com.                                                                 faceted optimization problem is even more challenging. An
                                            1 We refer to game-playing AI as any AI solution that powers an agent
                                         in the game. This can range from scripted AI solutions all the way to the         iterative loop where data is gathered from the game by one
                                         state-of-the-art deep reinforcement learning agents.                              or more playtesters, followed by designer analysis is repeated

many times throughout the game development process. Another advantage in creating specialized AI is the cost of
To mitigate this expensive process, one of our major efforts implementation and training. The agents needed for these tasks
is to implement agents that can help automate aspects of are, commonly, of smaller complexity than their optimal play
playtesting. These agents are meant to play through the game, alternative, making it easier to implement as well as faster to
or a slice of it, trying to explore behaviors that can generate train.
data to assist is answering questions that designers pose. These To summarize, we mainly pursue two use-cases for having
can range from exhaustively exploring certain sequences of AI agents enhance the game development process.
actions, to trying to play a scenario from start to finish in 1) playtesting AI agents to provide feedback during game
the least amount of actions possible. We showcase use-cases design, particularly when a large number of concurrent
focused on creating AI agents to playtest games at Electronic players are to interact in huge game worlds.
Arts and discuss the related challenges. 2) game-playing AI agents to interact with real human
Another key task in game development is the creation of in- players to shape their game-play experience.
game characters that are human-facing and interact with real The rest of the paper is organized as follows. In Section II,
human players. Agents must be trained and delicate tuning we review the related work on training agents for playtesting
has to be performed to guarantee quality experience. An AI and NPCs. In Section III, we describe our training pipeline.
adversary that reacts in a small amount of frames can be In Sections IV and V, we provide four case studies that cover
deemed unfair rather than challenging. On the other hand, a playtesting and game-playing, respectively. These studies are
pushover agent might be an appropriate introductory opponent performed to help with the development process of multiple
for novice players, while it fails to retain player interest after a games at Electronic Arts. These games vary considerably in
handful of matches. While traditional AI solutions are already many aspects, such as the game-play platform, the target
providing excellent experiences for the players, it is becoming audience, and the engagement duration. The solutions in
increasingly more difficult to scale those traditional solutions these case studies were created in constant collaboration with
up as the game worlds are becoming larger and the content is the game designers. The first case study in Section IV-A,
becoming dynamic. which covers game balancing and playtesting was done in
As Fig. 1 shows, there is a range of style/skill pairs that are conjunction with the development of The Sims Mobile. The
achievable by human players, and hence called human-like. other case studies are performed on games that are still under
On the other hand, high-skill game-playing agents may have development, at the moment this paper was written. Hence,
an unrealistic style rating, if they rely on high computational we had to omit specific details regarding them purposely to
power and memory size, unachievable by humans. Efforts to comply with company confidentiality. Finally, the concluding
evaluate techniques to emulate human-like behavior have been remarks are provided in Section VI.
presented [13], but measuring non-objective metrics such as
fun and immersion is an open research question [14], [15].
II. R ELATED W ORK
Further, we cannot evaluate player engagement prior to the
game launch, so we rely on our best approximation: designer A. Playtesting AI agents
feedback. Through an iterative process, the designers evaluate To validate their design, game designers conduct playtesting
the game-play experience by interacting with the AI agents sessions. Playtesting consists of having a group of players
to measure whether the intended game-play experience is interact with the game in the development cycle to not only
provided. gauge the engagement of players, but also to discover elements
These challenges each require a unique equilibrium be- and states that result in undesirable outcomes. As a game goes
tween style and skill. Certain agents could take advantage of through the various stages of development, it is essential to
superhuman computation to perform exploratory tasks, most continuously iterate and improve the relevant aspects of the
likely relying more heavily on skill. Others need to interact game-play and its balance. Relying exclusively on playtesting
with human players, requiring a style that won’t break player conducted by humans can be costly and inefficient. Artificial
immersion. Then there are agents that need to play the game agents could perform much faster play sessions, allowing the
with players cooperatively, which makes them rely on a much exploration of much more of the game space in much shorter
more delicate balance that is required to pursue a human-like time. This becomes even more valuable as game worlds grow
play style. Each of these these individual problems call for large enough to hold tens of thousands of simultaneously
different approaches and have significant challenges. Pursuing interacting players. Games at this scale render traditional
human-like style and skill can be as challenging (if not more) human playtesting infeasible.
than achieving high performance agents. Recent advances in the field of RL, when applied to playing
Finally, training an agent to satisfy a specific need is often computer games assume that the goal of a trained agent is to
more efficient than trying to reach such solution through high- achieve the best possible performance with respect to clearly
skill AI agents. This is the case, for example, when using defined rewards while the game itself remains fixed for the
game-playing AI to automatically run multiple playthroughs of foreseen future. In contrast, during game development the
a specific in-game scenario to trace the origin of an unintended objectives and the settings are quite different and vary over
game-play behavior. In this scenario, an agent that would time. The agents can play a variety of roles with the rewards
explore the game space would potentially be a better option that are not obvious to define formally, e.g., an objective of
than one that reaches the goal state of the level more quickly. an agent exploring a game level is different from foraging,

defeating all adversaries, or solving a puzzle. Also, the game
Agent environment Gameplay environment
environment changes frequently between the game builds. In
such settings, it is desirable to quickly train agents that help loop
Game
with automated testing, data generation for the game balance Code
evaluation and wider coverage of the game-play features. It Reusable Agents

is also desirable that the agent be mostly re-usable as the
game build is updated with new appearance and game-play
Fig. 2. The AI agent training pipeline, which is consisted of two main
features. Following the direct path of throwing computational components: 1) game-play environment; and 2) agent environment. The agent
resources combined with substantial engineering efforts at submits actions to the game-play environment and receives back the next state.
training agents in such conditions is far from practical and
calls for a different approach.
The idea of using artificial agents for playtesting is not new. playing the game [36]. MCTS has also been applied to the
Algorithmic approaches have been proposed to address the game of 7 Wonders [37] and Ticket to Ride [38]. Furthermore,
issue of game balance, in board games [16], [17] and card Baier et al. biased MCTS with a player model, extracted from
games [18], [19], [20]. More recently, Holmgard et al. [21], game-play data, to have an agent that was competitive while
as well as, Mugrai et al. [22] built variants of MCTS to create approximating human-like play [39]. Tesauro [40], on the
a player model for AI Agent based playtesting. Guerrero- other hand, used TD-Lambda which is a temporal difference
Romero et al. created different goals for general game-playing RL algorithm to train Backgammon agents at a superhuman
agents in order to playtest games emulating players of dif- level. The impressive recent progress on RL to solve video
ferent profiles [23]. These techniques are relevant to creating games is partly due to the advancements in processing power
rewarding mechanisms for mimicking player behavior. AI and and AI computing technology.2 More recently, following the
machine learning can also play the role of a co-designer, success stories in deep learning, deep Q networks (DQNs)
making suggestions during development process [24]. Tools use deep neural networks as function approximators within
for creating game maps [25] and level design [26], [27] are Q-learning [42]. DQNs can use convolutional function ap-
also proposed. See [28], [29] for a survey of these techniques proximators as a general representation learning framework
in game design. from the pixels in a frame buffer without need for task-specific
In this paper, we describe our framework that supports feature engineering.
game designers with automated playtesting. This also entail DeepMind researchers remarried the two approaches by
a training pipeline that universally applies this framework to a showing that DQNs combined with MCTS would lead to AI
variety of games. We then provide two case studies that entail agents that play Go at a superhuman level [11], and solely
different solution techniques. via self-play [43], [44]. Subsequently, OpenAI researchers
showed that a policy optimization approach with function ap-
proximation, called Proximal Policy Optimization (PPO) [45],
B. Game-playing AI agents
would lead to training agents at a superhuman level in Dota
Game-playing AI has been a main constituent of games 2 [46]. Cuccu et al. proposed learning policies and state
since the dawn of video gaming. Analogously, games, representations individually, but at the same time, and did so
given their challenging nature, have been a target for AI using two novel algorithms [47]. With such approach they
research [30]. Over the years, AI agents have become more were able to play Atari games with neural networks of 18
sophisticated and have been providing excellent experiences neurons or less. Recently, highly publicized progress was
to millions of players as games have grown in complexity. reported by DeepMind on StarCraft II, where AlphaStar was
Scaling traditional AI solutions in ever growing worlds with unveiled to play the game at a superhuman level by combining
thousands of agents and dynamic content is a challenging a variety of techniques including attention networks [12].
problem calling for alternative approaches.
The idea of using machine learning for game-playing AI
dates back to Arthur Samuel [31], who applied some form III. T RAINING P IPELINE
of tree search combined with basic reinforcement learning to To train AI agents efficiently, we have developed a unified
the game of checkers. His success motivated researchers to training pipeline that is applicable to all of EA games, regard-
target other games using machine learning, and particularly less of the platform and the genre of the game. In this section,
reinforcement learning. we present our training pipeline that is used for solving the
IBM Deep Blue followed the tree search path and was the case studies presented in the section that follows.
first artificial game agent who beat the chess world champion,
Gary Kasparov [10]. A decade later, Monte Carlo Tree Search
(MCTS) [32], [33] was a big leap in AI to train game agents. A. Gameplay and Agent Environments
MCTS agents for playing Settlers of Catan were reported The AI agent training pipeline, which is depicted in Fig. 2,
in [34], [35] and shown to beat previous heuristics. Other consists of two key components:
work compares multiple approaches of agents to one another in
the game Carcassonne on the two-player variant of the game 2 The amount of AI compute has been doubling every 3-4 months in the
and discusses variations of MCTS and Minimax search for past few years [41].

• Gameplay environment refers to the simulated game The presence of frame buffers as the representative of
world that executes the game logic with actions submitted game state would significantly increase this communica-
by the agent every timestep and produces the next state.3 tion cost whereas derived game state features enable more
• Agent environment refers to the medium where the agent compact encodings.
interacts with the game world. The agent observes the (d) Obtaining an artificial agent in a reasonable time (a few
game state and produces an action. This is where training hours at most) usually requires that the game be clocked
occurs. Note that in case of reinforcement learning, the at a rate much higher than the usual game-play speed. As
reward computation and shaping also happens in the agent rendering each frame takes a significant portion of every
environment. frame’s time, overclocking with rendering enabled is
In practice, the game architecture can be complex and it might not practical. Additionally, moving large amount of data
be too costly for the game to directly communicate the com- from GPU to main memory drastically slows down the
plete state space information to the agent at every timestep. To game execution and can potentially introduce simulation
train artificial agents, we create a universal interface between artifacts, by interfering with the target timestep rate.
the game-play environment and the learning environment.4 (e) Last but not least, we can leverage the advantage of
The interface extends OpenAI Gym [48] and supports actions having privileged access to the game code to let the game
that take arguments, which is necessary to encode action engine distill a compact state representation that could be
functions and is consistent with PySC2 [49], [50]. In addition, inferred by a human player from the game and pass it to
our training pipeline enables creating new players on the game the agent environment. By doing so we also have a better
server, logging in/out an existing player, and gathering data hope of learning in environments where the pixel frames
from expert demonstrations. We also adapt Dopamine [51] only contain partial information about the the state space.
to this pipeline to make DQN [42] and Rainbow [52] agents The compact state representation could include the inven-
available for training in the game. Additionally, we add support tory, resources, buildings, the state of neighboring players,
for more complex preprocessing other than the usual frame and the distance to target. In an open-world shooter game
buffer stacking, which we explicitly exclude following the the features may include the distance to the adversary, angle
motivation presented in the next section. at which the agent approaches the adversary, presence of line
of sight to the adversary, direction to the nearest waypoint
generated by the game navigation system, and other features.
B. State Abstraction
The feature selection may require some engineering efforts but
The use of frame buffer as an observation of the game it is logically straightforward after the initial familiarization
state has proved advantageous in eliminating the need for with the game-play mechanics, and often similar to that of
manual feature-engineering in Atari games [42]. However, to traditional game-playing AI, which will be informed by the
achieve the objectives of RL in a fast-paced game development game designer. We remind the reader that our goal is not to
process, the drawbacks of using frame buffer outweigh its ad- train agents that win but to simulate human-like behavior, so
vantages. The main considerations which we take into account we train on information that would be accessible to a human
when deciding in favor of a lower-dimensional engineered player.
representation of game state are:
(a) During almost all stages of the game development, the IV. P LAYTESTING AI AGENTS
game parameters are evolving on a daily basis. In par-
A. Measuring player experience for different player styles
ticular, the art may change at any moment and the look
of already learned environments can change overnight. In this section, we consider the early development of The
Hence, it is desirable to train agents using features that are Sims Mobile, whose game-play is about “emulating life”:
more stable to minimize the need for retraining agents. players create avatars, called Sims, and conduct them through
(b) Another important advantage of state abstraction is that a variety of everyday activities. In this game, there is no single
it allows us to train much smaller models (networks) predetermined goal to achieve. Instead, players craft their
because of the smaller input size and use of carefully own experiences, and the designer’s objective is to evaluate
engineered features. This is critical for deployment for different aspects of that experience. In particular, each player
real time applications in console and consumer PC en- can pursue different careers, and as a result will have a
vironments where rendering, animation and physics are different experience and trajectory in the game. In this specific
occupying much of the GPU and CPU power. case study, the designer’s goal is to evaluate if the current
(c) In playtesting, the game-play environment and the learn- tuning of the game achieves the intended balanced game-
ing environment may reside in physically separate nodes. play experience across different careers. For example, different
Naturally, closing the RL state-action-reward loop in such careers should prove similarly difficult to complete. We refer
environments requires a lot of network communication. the interested reader to [1] for a more comprehensive study of
this problem.
3 Note that the reward is usually defined by the user as a function of the
The game is single-player, deterministic, real-time, fully
state and action outside of the game-play environment. observable and the dynamics are fully known. We also have
4 These environments may be physically separated, and hence, we prefer a
thin (i.e., headless) client that supports fast cloud execution, and is not tied access to the complete game state, which is composed mostly
to frame rendering. of character and on-going action attributes. This simplified

reward R. Instead, we learn parameters that define U (a, s)
to optimize the environment rewards using the black-box ES
optimization technique. In that sense optimizing R by learning
parameters of U is similar to Proximal Policy Optimization
(PPO), however, in much more constrained settings. To this
end, we design utility of an action as a weighted sum of
the immediate action rewards r(a) and costs c(a). These are
vector-valued quantities and are explicitly present in the game
tuning describing the outcome of executing such actions. The
parameters evolving by the ES are the linear weights for the
Fig. 3. Comparison of the average amount of career actions (appointments) utility function U explained below and the temperature of the
taken to complete the career using A* search and evolution strategy adapted
from [1]. softmax function. An additional advantage of the proposed
linear design of the utility function is a certain level of
interpretability of the weights corresponding to the perceived
case allows for the extraction of a lightweight model of the by the agent utilities of the individual components of the
game (i.e., state transition probabilities). While this requires resources or the immediate rewards. Such interpretability can
some additional development effort, we can achieve a dra- guide changes to the tuning data.
matic speedup in training agents by avoiding (reinforcement) Concretely, given the game state s, we design the utility U
learning and resorting to planning techniques instead. of an action a as
In particular, we use the A* algorithm for the simplicity
of proposing a heuristic that can be tailored to the specific U (s, a) = r(a)v(s) + c(a)w(s)
designer need by exploring the state transition graph instead The immediate reward r(a) here is a vector that can include
of the more expensive iterative processes, such as dynamic quantities like the amount of experience, amount of career
programming,5 and even more expensive Monte Carlo search points earned for the action and the events triggered by it.
based algorithms. The customizable heuristics and the target The costs c(a) is a vector defined similarly. The action costs
states corresponding to different game-play objectives, which specify the amounts of resources required to execute such an
represent the style we are trying to achieve, provide sufficient action, e.g., how much time, energy, hunger, etc. a player needs
control to conduct various experiments and explore multiple to spend to successfully trigger and complete the action. The
aspects of the game. design of the tuning data makes the both quantities r and c
Our heuristic for the A* algorithm is the weighted sum of only depend on the action itself. Since both - the immediate
the 3 main parameters that contribute for career progression: reward r(a) and c(a) are vector values, the products in the
career level, current career experience points and amount definition of U above are dot products. The vectors v(s) and
of completed career events. These parameters are directly w(s) introduce dependence of the utility on the current game
related. To gain career levels players have to accumulate career state and are the weights defining relative contribution of the
experience points and to obtain experience, players have to immediate resource costs and immediate rewards towards the
complete career events. The weights are attributed based on current goals of the agent.
the order of magnitude each parameter has. Since levels are The inferred utilities of the actions depend on the state since
the most important, it receives the highest weight. The amount some actions in certain states are more beneficial than in other
of completed career events has the lowest weight because it is states. E.g., triggering a career event while not having enough
already partially factored into the the amount of career points resources to complete it successfully may be wasteful and an
received so far. optimal policy should avoid it. The relevant state components
We also compare A* results to the results from an opti- s = (s1 , .., sk ) include available commodities like energy and
mization problem over a subspace of utility-based policies hunger and a categorical event indicator (0 if outside of the
approximately solved with an evolution strategy (ES) [54]. event and 1 otherwise) wrapped into a vector. The total number
Our goal, in this case, is to achieve a high environment reward of the relevant dimensions here is k. We design the weights
against selected objective, e.g., reach the end of a career track va (s) and wa (s) as bi-linear functions with the coefficients
while maximizing earned career event points. We design ES p = (p1P , ..., pk ) and q = (q1 , ..., qkP
) that we are learning:
objective accordingly. The agent performs an action a based on va (s) = i=1,..,k pi si and wa (s) = i=1,..,k qi si .
a probabilistic policy by taking a softmax on the utility U (a, s) To define the optimization objective J, we construct it as
measure of the actions in a game state s. Utility here serves a function of the number of successfully completed events N
as an action selection mechanism to compactly represent a and the number of attempted events M . We aim to maximize
policy. In a sense, it is a proxy to a state-action value Q- the ratio of successful to attempted events times total number
function Q(a, s) in RL. However, we do not attempt to derive of successful events in the episode as follows:
utility from Bellman’s equation and the actual environment
N (N + )
5 Unfortunately,
J(N, M ) = .
in the dynamic programming every node will participate in M +
the computation while it is often true that most of the nodes are not relevant
to the shortest path problem in the sense that they are unlikely candidates for where is a small number less than 1 eliminating division by
inclusion in a shortest path [53]. zero when the policy fails to attempt any events. The overall

optimization problem looks like: The Sims Mobile in the sense that it requires the players to
exhibit strategic decision making for them to progress in the
max J(N, M ),
p,q game. When the game dynamics are unknown or complex,
subject to the policy defined by actions selected with a softmax most recent success stories are based on model-free RL (and
over their utilities parameterized with the (learned) parameters particularly variants of DQN and PPO). In this section, we
p and q. show how such model-free control techniques fit into the
The utility-based ES, as we describe it here, captures the paradigm of playtesting modern games.
design intention of driving career progression in the game-play In this game, the goal of the player (and subsequently the
by successful completion of career events. Due to the emphasis agent) is to level up and reach a particular milestone in the
on the events completion, our evolution strategy setup is not game. To this end, the player needs to make smart decisions
necessarily resulting in an optimization problem equivalent to in terms of resource mining and resource management for
the one we solve with A∗ . However, as we discuss below, it has different tasks. In the process, the agent needs to also upgrade
similar optimum most of the time, supporting the design view some buildings. Each upgrade requires a certain level of
on the progression. A similar approach works for evaluating resources. If the player’s resources are insufficient, the upgrade
relationship progression, which is another important element is not possible. A human player will be able to visually
of the game-play. discern the validity of such action by clicking on the particular
We compare the number of actions that it takes to reach building for upgrade. The designer’s primary concern in this
the goal for each career in Fig. 3 as computed by the two case study is to measure how a competent player would
approaches. We emphasize that our goal is to show that a progress in the early stages of this game. In particular, the
simple planning method, such as A*, can sufficiently satisfy competent player is required to balance resources and make
the designer’s goal in this case. We can see that the more other strategic choices that the agent needs to discern as well.
expensive optimization based evolution strategy performs sim- We consider a simplified version of the state space that con-
ilarly to the much simpler A* search. tains information about this early stage of the game ignoring
The largest discrepancy arises for the Barista career, which the full state space. The relevant part of the state space
might be explained by the fact that this career has an action consists of ∼50 continuous and ∼100 discrete state variables.
that does not reward experience by itself, but rather enables The set of possible actions α is a subset of a space A, which
another action that does it. This action can be repeated often consists of ∼25 action classes, some of which are from a
and can explain the high numbers despite having half the continuous range of possible action values, and some are from
number of levels. Also, we observe that in the case of the a discrete set of action choices. The agent has the ability
medical career, the 2,000 node A* cutoff was potentially to generate actions a ∈ A but not all of them are valid at
responsible for the under performance in that solution. every game state since α = α(s, t), i.e., α depends on the
When running the two approaches, another point of com- timestep and the game state. Moreover, the subset α(s, t) of
parison can be made: how many sample runs are required valid actions may only partially be known to the agent. If
to obtain statistically significant results? We performed 2,000 the agent attempts to take an unavailable action, such as a
runs for the evolution strategy while it is notable that the A* building upgrade without sufficient resources, the action will
agent learns a deterministic playstyle, which has no variance. be deemed invalid and no actual action will be taken by the
On the other hand, the agent trained using an evolution strategy game server.
has a high variance and requires a sufficiently high number While the problem of a huge state space [55], [56], [57],
of runs of the simulation to approach a final reasonable a continuous action space [58], and a parametric action
strategy [1]. space [59] could be dealt with, these techniques are not
In this use case, we were able to use a planning algorithm, directly applicable to our problem. This is because, as we
A*, to explore the game space to gather data for the game shall see more, some actions will be invalid at times and
designers to evaluate the current tuning of the game. This was inferring that information may not be fully possible from the
possible due to the goal being straightforward, to evaluate pro- observation space. Finally, the game is designed to last tens
gression in the different careers. With such, the requirements of millions of timesteps, taking the problem of training a
of skill and style for the agent were achievable and simple functional agent in such an environment outside of the domain
to model. Over the next use cases, we analyze scenarios that of previously explored problems.
call for different approaches as consequence of having more We study game progression while taking only valid actions.
complex requirements and subjective agent goals. As we already mentioned, the set of valid actions α may not
be fully determined by the current observation, and hence,
we deal with a partially observable Markov decision process
B. Measuring competent player progression (POMDP). Given the practical constraints outlined above, it is
In the next case study, we consider a real-time multi-player infeasible to apply deep reinforcement learning to train agents
mobile game, with a stochastic environment, with sequential in the game in its entirety. In this game, we show progress
actions. The game is designed to engage a large number of toward training an artificial agent that takes valid actions and
players for months. The game dynamics are governed by a progresses in the game like a competent human player. To this
complex physics engine, which makes it impractical to apply end, we wrap this game in the game environment and connect
planning methods. This game is much more complex than it to our training pipeline with DQN and Rainbow agents. In

Fig. 4. This plot belongs to Section IV-B. Average cumulative reward (return) in training and evaluation for the agents as a function of the number of
iterations. Each iteration is worth ∼60 minutes of game-play. The trained agents are: (1) a DQN agent with complete state space, (2) a Rainbow agent with
complete state space, (3) a DQN agent with augmented observation space, and (4) a Rainbow agent with augmented observation space.

the agent environment, we use a feedforward neural network We trained four types of agents as shown in Fig. 4, where we
with two fully connected hidden layers, each with 256 neurons are plotting the average undiscounted return in each training
followed by ReLU activation. episode. By design, this quantity is upper bounded by ‘+1’,
As a first step in measuring game progression, we define which is achieved if the agent keeps taking valid actions until
an episode by setting an early goal state in the game that takes reaching the final goal state. In reality, this may not always
an expert human player ∼5 minutes to reach. We let the agent be achievable as there are periods of time where no action is
submit actions to the game server every second. We may have available and the agent has to choose the “do nothing” action
to revisit this assumption for longer episodes where the human and be rewarded with ‘-0.1’. Hence, the best a competent
player is expected to interact with the game more periodically. human player would achieve on these episodes would be
We use a simple rewarding mechanism, where we reward the around zero.
agent with ‘+1’ when they reach the goal state, ‘-1’ when they We see that after a few iterations, both the Rainbow and
submit an invalid action, ‘0’ when they take a valid action, and DQN agents converge to their asymptotic performance values.
‘-0.1’ when they choose the “do nothing” action. The game is The Rainbow agent converges to a better asymptotic perfor-
such that at times the agent has no other valid action to choose, mance level as compared to the DQN agent. However, in the
and hence they should choose “do nothing”, but such periods the transient behavior we observe that the DQN agent achieves
do not last more than a few seconds in the early stages of the the asymptotic behavior faster than the Rainbow agent. We
game, which is the focus of this case study. believe this might be due to the fact that we did not tune
hyperparameters of prioritized experience replay [60], and
We consider two different versions of the observation space, distributional RL [61].6 We used the default values that worked
both extracted from the game engine (state abstraction). The best on Atari games with frame buffer as state space. Extra
first is what we call the “complete” state space. The complete hyperparameter tuning would have been costly in terms of
state space contains information that is not straightforward to cloud infrastructure for this particular problem as the game
infer from the real observation in the game and is only used as server does not allow speedup and training once already takes
a baseline for the agent. In particular, the complete state space a few hours.
also includes the list of available actions at each state. The As expected, we see in Fig. 4 that the augmented obser-
polar opposite of this state space could be called the “naive” vation space makes the training slower and also results in
state space, which only contains straightforward information a worse performance on the final strategy. In addition, at
that is always shown on the screen of the player. The second evaluation time, the agent keeps attempting invalid actions in
state space we consider is what we call the “augmented” ob- some cases as the state remains mostly unchanged after each
servation space, which contains information from the “naive” attempt and the policy is (almost) deterministic. These results
state space and information the agent would reasonably infer in accumulating large negative returns in such episodes which
and retain from current and previous game observations. For account for the dips in the right-hand-side panel in Fig. 4 at
example, this includes the amount of resources needed for evaluation time.
an upgrade after the agent has checked a particular building
The observed behavior drew our attention to the question of
for an upgrade. The augmented observation space does not
whether it is too difficult to discern and keep track of the set of
include the set of all available actions, and hence, we rely
valid actions for a human player as well. In fact, after seeking
on the game server to validate whether a submitted action is
more extensive human feedback the game designers concluded
available because it is not possible to encode and pass the set
that better visual cues were needed for a human player on
α of available actions. Hence, if an invalid action is chosen
by the agent, the game server will ignore the action and will 6 This is consistent with the results of Section V-C, where Rainbow with
flag the action so that we can provide a ‘-1’ reward. default hyperparameters does not outperform DQN either.

8

information about valid actions at each state so that the human     are also other cases where the preferred solution for training
players could progress more smoothly without being blocked          agents would utilize a few relatively short demonstration
by invalid actions. As next steps, we intend to experiment          episodes played by the software developers or designers at
with shaping the reward function for achieving different play       the end of the current development cycle [67].
styles to be able to better model different player clusters.           In this application, we consider training artificial agents
We also intend to investigate augmenting the replay buffer          in an open-world video game, where the game designer is
with expert demonstrations for faster training and also for         interested in training non-player characters that exhibit certain
generative adversarial imitation learning [62] once the game        behavioral styles. The game we are exploring is a shooter with
is released and human play data is available.                       contextual game-play and destructible environment. While the
   We remark that without state abstraction (streamlined access     game can run in multi-player mode, we focus on single-player,
to the game state), the neural network function approximator        which provides us with an environment tractable yet rich
used for Q-learning would have needed to discern all such           enough to test the approach we discuss in this section. An
information from the pixels in the frame buffer, and hence we       agent in such a game would have its state composed of a
would not have been able to get away with such a simple two-        3D-vector location, velocity, animation pose, health, weapon
layer feedforward function approximator to solve this problem.      type, ammunition, scope on-off, cone of sight, collision state,
However, we observe that the training within the paradigm           distance to the obstacles in the principle directions, etc. Overall
of model-free RL remains costly. Specifically, even using the       the dimensionality of the agent state can grow to several
complete state space, it takes several hours to train an agent      dozens of variables with some of them continuous and the
that achieves a level of performance expected of a competent        other categorical. We construct similar vectors for NPCs with
human player on this relatively short episode of ∼5 minutes.        which the agent needs to engage.
This calls for the exploration of complementary approaches             The NPC state variables account for partial observability
to augment the training process. In particular, we also would       and line-of-site constraints imposed by the level layout, its
like to streamline this process by training reusable agents and     destruction state, and the current location of the agent rel-
capitalizing on existing human data through imitation learning.     ative to the NPCs. The NPCs in this environment represent
                                                                    adversarial entities, trying to eliminate the agent by attack-
                    V. G AME - PLAYING AI                           ing it until the agent runs out of health. Additionally, the
                                                                    environment can contain objects of interest, like health and
   We have shown the value of simulated agents in a fully
                                                                    ammo boxes, dropped weapons, etc. The environment itself
modeled game, and the potential of training agents in a com-
                                                                    is non-deterministic stochastic, i.e., there is no single random
plex game to model player progression for game balancing.
                                                                    seed which we can set to control all random choices in the
We can take these techniques a step further and make use of
                                                                    environment. Additionally, frequent saving and reloading game
agent training to help build the game itself. Instead of applying
                                                                    state is not practical due to relatively long loading times.
RL to capture player behaviors, we consider an approach
                                                                       The main objective for us in this application of the agents
to game-play design where the player agents learn behavior
                                                                    training is to provide designer with a tool to playtest the game
policies from the game designers. The primary motivation
                                                                    by interacting with the game in a number of particular styles
of that is to give direct control into the designer hands and
                                                                    to emulate different players. The styles can include:
enable easy interactive creation of various behavior types,
                                                                       • Aggressive, when an agent tries to find, approach and
e.g., aggressive, exploratory, stealthy, etc. At the same time,
                                                                          defeat any adversarial NPC,
we aim to complement organic demonstrations with bootstrap
                                                                       • Sniper, when an agent finds a good sniping spot and waits
and heuristics to eliminate the need for a human to train an
                                                                          for adversarial NPCs to appear in the cone of sight to
agent on the states normally not encountered by humans, e.g.,
                                                                          shoot them,
unblocking an agent using obstacle avoidance.
                                                                       • Exploratory, when an agent attempts to visit as many
                                                                          locations and uncover as many objects of interest in the
A. Human-Like Exploration in an Open-World Game                           level as possible without getting engaged into combat
   To bridge the gap between the agent and the designer,                  unless encountering an adversarial NPC,
we introduce imitation learning (IL) to our system [62],               • Sneaky, when an agent actively avoids combat while
[63], [64], [65]. In the present application, IL allows us to             trying to achieve its objectives like reaching a particular
translate the intentions of the game designer into a primer               points on the map.
and a target for our agent learning system. Learning from              Additionally, combat style can vary with deploying different
expert demonstrations has traditionally proved very helpful in      weapons, e.g., long-range or hand-to-hand.
training agents, including in games [66]. In particular, the           Obviously, manual test and evaluation of its results while
original Alpha Go [11] used expert demonstrations in training       following outlined styles is very time consuming and tedious.
a deep Q network. While it is argued in subsequent work             Having an agent that can replace a designer in this process
that learning via self-play could achieve a better asymptotic       would be a great time saver. Conceivably, an agent trained
return compared to relying on expert demonstrations, the            as the design helper can also be playing as a stand-in or an
better performance comes with significantly higher cost in          “avatar” of an actual human player to replace the player when
terms of training computational resources and the superhuman        she is not online or the connection drops out for short period
performance is not what we are seeking in this work. There          of time. An agent trained in a specific style can also fill a

Fig. 5. Model performance measures the probability of the event that the Markov agent finds at least one previous action from human-played demonstration
episodes in the current game state. The goal of interactive learning is to add support for new game features to the already trained model or improve its
performance in underexplored game states. Plotted is the model performance during interactive training from demonstrations in a proprietary open-world game
as a function of time measured in milliseconds (with the total duration around 10 minutes). The corresponding section of the paper offers additional details
for the presented experiment.

vacant spot in a squad, or to help with the cold start problem [8] .
for multi-player games. 1) Markov Decision Process with Extended State: Follow-
ing the established treatment of training artificial agents, we
In terms of the skill/style tradeoff laid out earlier in the
place the problem into the standard MDP framework and aug-
paper, these agents are not designed to have any specific level
ment it as follows. Firstly, we ignore the difference between
of performance (e.g., a certain kill-death ratio) and they may
the observation and the actual state s of the environment.
not necessarily follow any long-term goals. These agents are
The actual state may be observable by the teacher but may
intended to explore the game and also be able to interact with
be impractical to expose to the agent. To mitigate the partial
human players at a relatively shallow level of engagement.
observability, we extended observations with a short history
Hence, the problem boils down to efficiently training an agent
of previously taken actions. In addition to implicitly encoding
using demonstrations capturing only certain elements of the
the intent of a teacher and her reactions to potentially richer
game-play. The training process has to be computationally
observations, it also helps to preserve the stylistic elements of
inexpensive and the agent has to imitate the behavior of the
human demonstrations.
teacher(s) by mimicking their relevant style (in a statistical
Concretely, we assume the following. The interaction of the
sense) for implicit representation of the teacher’s objectives.
agent and the environment takes place at discrete moments
Casting this problem directly into the RL framework is t = 1, . . . , T with the value of t trivially observable by the
complicated by two issues. First, it is not straightforward how agent. The agent, after receiving an observation st at time
to design a rewarding mechanism for imitating the style of t, can take an action at ∈ A(s, t) from the set of allowed
the expert. While inverse RL aims at solving this problem, its actions A(s, t) using policy π : s → a. Executing an action in
applicability is not obvious given the reduced representation the environment results in a new state st+1 . Since we focus
of the huge state-action space that we deal with and the ill- on the stylistic elements of the agent behavior, the rewards
posed nature of the inverse RL problem [68], [69]. Second, the are are inconsequential for the model we build, and we drop
RL training loop often requires thousands of episodes to learn them from the discussion. Next, we consider the episode-
useful policies, directly translating to a high cost of training in based environment, i.e., after reaching a certain condition, the
terms of time and computational resources. Hence, rather than execution of the described state-action loop ends. A complete
using more complex solutions such as generative adversarial episode is a sequence E = {(st , at )}t∈1,...,T . The fundamental
imitation learning [62] which use an RL network as their assumption regarding the described decision process is that it
generator, we propose a solution to the stated problem based has the Markov property.
on an ensemble of multi-resolution Markov models. One of Besides the most recent action taken before time t, i.e.,
the major benefits of the proposed model is the ability to action at−1 , we also consider a recent history of the past n
perform an interactive training within the same episode. As actions, where 1 ≤ n < T , αt,n := at−1 t−n = {at−n , . . . , at−1 },
useful byproduct of our formulation, we can also sketch a whenever it is defined in an episode E. For n = 0, we define
mechanism for numerical evaluation of the style associated at,0 as the empty sequence. We augment the directly observed
with the agents we train. We outline the main elements of the state st with the action history αt,n , to obtain an extended state
approach next and for additional details point at [6], [2], [5], St,n = (st , αt,n ).

The purpose of including the action history is to capture For illustration purposes, it is sufficient to consider only the
additional information (i.e., stylistic features and the elements uniform quantization Qσ . In practice, most variables have
of the objective-driven behavior of the teacher) from human naturally defined limits which are at least approximately
controlling the input during interactive demonstrations. An known. Knowing the environment scale gives an estimate of
extended policy πn , which operates on the extended states the smallest step size σ0 at which we will have complete
πn : St,n → at , is useful for modeling human actions in a information loss, i.e., all observed values map to a single
manner similar to n-grams text models in natural language bin. For each continuous variable in the state-action space,
processing (NLP) (e.g., [70], [71], [72]). Of course, the we consider a sequence of quantizers with decreasing step
analogy with n-gram models in NLP works only if both state size Q = {Qσj }j=0,...,K , σj > σj+1 , which naturally gives
and action spaces are discrete. We will address this restriction a quantization sequence Q̄ for the entire state-action space,
in the next subsection using multi-resolution quantization. provided K is fixed across the continuous dimensions. To
For a discrete state-action space and various n, we can simplify notation, we collapse the sub index and write Qj
compute probabilities P {at |St,n } of transitions St,n → at to stand for Qσj . For more general quantization schemes, the
occurring in (human) demonstrations and use them as a main requirement is the decreasingly smaller reconstruction
Markov model Mn of order n of (human) actions. We say error for Qj+1 in comparison to Qj .
that the model Mn is defined on an extended state S.,n if For an episode E, we compute its quantized representation
the demonstrations contain at least one occurrence of S.,n . in an obvious component-wise manner:
When a model Mn is defined on S, we can use P {at |St,n } Ej = Q̄j (E) = {(Q̄j (st ), Q̄j (at ))}t∈1,...,T (1)
to sample the next action from all ever observed next actions
in state S.,n . Hence, Mn defines a partial stochastic mapping which defines a multi-resolution representation of the episode
Mn : S.,n → A from extended states to action space A. as a corresponding ordered set {Ej }j∈{0,...,K} of quantized
2) Stacked Markov models: We call a sequence of Markov episodes, where Q̄ is the vector version of quantization Q.
models Mn = {Mi }i=0,...,n a stack of models. A (partial) In the quantized Markov model Mn,j = Q̄j (Mn ), which
policy defined by Mn computes the next action at a state st , we construct from the episode Ej , we compute extended states
see [6] for the pseudo-code of the corresponding algorithm. using the corresponding quantized values. Hence, the extended
Such policy performs a simple behavior cloning. The policy state is Q̄j (St,n ) = (Q̄(st ), Q̄(αt,n )). Further, we define the
is partial since it may not be defined on all possible extended model Q̄j (Mn ) to contain probabilities P {at |Q̄j (St,n )} for
states and needs a fallback policy π∗ to provide a functional the original action values. In other words, we do not rely on
agent acting in the environment. the reconstruction mapping Q̄−1 j to recover action but store
Note that it is possible to implement sampling from a the original actions explicitly. In practice, continuous action
Markov model using an O(1) complexity operation with hash values tend to be unique and the model samples from the set
tables, making the inference very efficient and suitable for real- of values observed after the occurrences of the corresponding
time execution in a video game or other interactive application extended state. Our experiments show that replaying the orig-
where expected inference time has to be on the scale of 1 ms inal actions instead of their quantized representation provides
or less 7 . better continuity and natural true-to-the-demonstration look of
the cloned behavior.
3) Quantization: Quantization (aka discretization) allows
4) Markov Ensemble: Combining together stacking and
us to work around the limitation of discrete state-action space
multi-resolution quantization of Markov models, we obtain
enabling the application of the Markov Ensemble approach
Markov Ensemble E as an array of Markov models param-
to environments with continuous dimensions. Quantization is
eterized by the model order n and the quantization schema
commonly used in solving MDPs [73] and has been exten-
Qj :
sively studied in the signal processing literature [74], [75]. Us-
ing quantization schemes that have been optimized for specific
EN,K = Mi,j , i = 0, . . . , N, j = 0, . . . , K (2)
objectives can lead to significant gains in model performance,
improving various metrics vs. ad-hoc quantization schemes, The policy defined by the ensemble (2) computes each
e.g., [73], [76]. next action in an obvious manner (). The Markov Ensemble
Instead of trying to pose and solve the problem of optimal technique, together with the policy defined by it, are our
quantization, we use a set of quantizers covering a range of primary tools for cloning behavior from demonstrations.
schemes from coarse to fine. At the conceptual level, such Note, that with the coarsest quantization σ0 present in the
an approach is similar to multi-resolution methods in image multi-resolution schema, the policy should always return an
processing, mip-mapping and Level-of-Detail (LoD) represen- action sampled using one of the quantized models, which at
tations in computer graphics [77]. The simplest quantization the level 0 always finds a match. Hence, such models always
is a uniform one with step σ: “generalize” by resorting to simple sampling of actions when
jxk no better match found in the observations. Excluding too
Qσ (x) = σ coarse quantizers and Markov order 0 will result in executing
σ some “default policy” π∗ , which we discuss in the next section.
7 A modern video game runs at least at 30 frames per second with lots
The agent execution with the outlined ensemble of quantized
computations happening during about 33 ms allowed per frame, drastically stacked Markov models is easy to express as an algorithm,
limiting the “budget” allocated for inference. which in essence boils down to a look-up tables [6].

You can also read