Encoding Human Domain Knowledge to Warm Start Reinforcement Learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Encoding Human Domain Knowledge to Warm Start Reinforcement Learning Andrew Silva, Matthew Gombolay Institute for Robotics and Intelligent Machines Georgia Institute of Technology andrew.silva@gatech.edu, matthew.gombolay@cc.gatech.edu arXiv:1902.06007v4 [cs.LG] 23 Sep 2020 Abstract Nets (P RO L O N ETS), a new approach to directly encode do- Deep reinforcement learning has been successful in a variety main knowledge as a set of propositional rules into a neural of tasks, such as game playing and robotic manipulation. How- network, as depicted in Figure 1. Our approach leverages de- ever, attempting to learn tabula rasa disregards the logical cision tree policies from humans to directly initialize a neural structure of many domains as well as the wealth of readily network (Figure 2). We use decision trees to allow humans to available knowledge from domain experts that could help specify behaviors to guide the agent through a given domain, “warm start” the learning process. We present a novel rein- such as high-level instructions for keeping a pole balanced on forcement learning technique that allows for intelligent ini- the cart pole problem. Importantly, this policy specification tialization of a neural network weights and architecture. Our does not require the human to demonstrate the balancing act approach permits the encoding domain knowledge directly in all possible states, nor does it require the human to label into a neural decision tree, and improves upon that knowl- actions as being “good” or “bad.” edge with policy gradient updates. We empirically validate our approach on two OpenAI Gym tasks and two modified By directly imbuing logical propositions from the tree into StarCraft 2 tasks, showing that our novel architecture outper- neural network weights, an RL agent can immediately begin forms multilayer-perceptron and recurrent architectures. Our learning productive strategies. This approach leverages read- knowledge-based framework finds superior policies compared ily available domain knowledge while still retaining the abil- to imitation learning-based and prior knowledge-based ap- ity to learn and improve over time, eventually outperforming proaches. Importantly, we demonstrate that our approach can the expertise with which it was initialized. By exploiting the be used by untrained humans to initially provide > 80% in- structural and logical rules inherent to many tasks to which crease in expected reward relative to baselines prior to training RL is applied, we can bypass early random exploration and (p < 0.001), which results in a > 60% increase in expected expedite an agent’s learning in a new domain. reward after policy optimization (p = 0.011). We demonstrate that our approach can outperform standard deep RL across two OpenAI gym domains (Brockman et al. 1 Introduction 2016) and two modified StarCraft II domains (Vinyals et al. As reinforcement learning (RL) is applied to increasingly 2017), and that our framework is superior to state-of-the-art, complex domains, such as real-time strategy games or robotic IL-based RL, even with observation of that same domain manipulation RL and imitation learning (IL) approaches fail expert knowledge. Finally, in a wildfire simulation domain, to quickly capture the wealth of expert knowledge that al- we show that our framework can work with untrained human ready exists for many domains. Existing approaches to using participants. Our three primary contributions include: IL as a warm start require large datasets or tedious human 1. We formulate a novel approach for capturing human do- labeling as the agent learns everything, from vision to control main expertise in a trainable RL framework via our archi- to policy, all at once. Unfortunately, these large datasets often tecture, P RO L O N ETS, which we show outperforms base- do not exist, as collecting these data is impractical or expen- line RL approaches, including IL-based (Cheng et al. 2018) sive, and humans will not patiently label data for IL-based and knowledge-based techniques (Humbird, Peterson, and agents (Amershi et al. 2014). While humans may not label McClarren 2018), obtaining > 100% more average reward enough state-action pairs to train IL-based agents , there is on a StarCraft 2 mini-game. an opportunity to improve warm starts by soliciting exper- tise from a human once, and then leveraging this expertise 2. We introduce dynamic growth to P RO L O N ETS, enabling to initialize an RL agent’s neural network architecture and greater expressivity over time to surpass original initial- policy. With this approach, we circumvent the need for IL izations and yielding twice as much average reward in the and instead directly imbue human expertise into an RL agent. lunar lander domain. To achieve this blending of human domain knowledge 3. We conduct a user study in which non-expert humans with the strengths of RL, we propose Propositional Logic leveraged P RO L O N ETS to specify policies that resulted in Copyright c 2021, Association for the Advancement of Artificial higher cumulative rewards, both before and after training, Intelligence (www.aaai.org). All rights reserved. relative to all baselines (p < 0.05).
Figure 1: A visualization of our approach as it applies to our user study. Participants interact with a UI of state-checks and actions to construct a decision tree policy that is then used to directly initialize a P RO L O N ET’s architecture and parameters. The P RO L O N ET can then begin reinforcement learning in the given domain, outgrowing its original specification. 2 Related work ever, DJINN does not explicitly initialize rules, nor does it leverage rules solicited from humans. This distinction means Warm starts have been used for RL (Cheng et al. 2018; that DJINN creates an architecture for routing information Zhang and Sridharan 2019; Zhu and Liao 2017) as well appropriately, but the decision-criteria in each layer must be as in supervised learning for many tasks (Garcez, Broda, learned from scratch. Our work, on the other hand, directly and Gabbay 2012; Hu et al. 2016; Kontschieder et al. 2015; initializes both the structure and the rules of a neural network, Wang et al. 2017). While these warm start or knowledge- meaning that the human’s expertise is more completely lever- based systems have provided an interesting insight into the aged for a more useful warm start in RL domains. We build efficacy of warm starts or human-in-the-loop learning in on decades of research demonstrating the value of human- various domains, these systems typically involve either large in-the-loop learning (Towell and Shavlik 1994; Zhang et al. labeled datasets with tedious human labeling and feedback, 2019) to leverage logical rules solicited from humans in the or they require some automated oracle to label actions as form of a decision tree to intelligently initialize the structure “good” or “bad.” In highly challenging domains or problems, and rules of a deep network. building such an oracle is rarely feasible. Moreover, it is not Our work is related to IL and to knowledge-based or always possible to acquire a large labeled dataset for new human-in-the-loop RL frameworks (Zhang et al. 2019; Zhang domains. However, it is often possible to solicit a policy from and Sridharan 2019; MacGlashan et al. 2017) and apprentice- a human in the form of a high-level series of if-then checks in ship learning and IRL (Abbeel and Ng 2004; Knox and Stone critical states. These decisions can be collected as a decision 2009). Importantly, however, our approach does not require tree. Our research seeks to convert decision tree into a neural demonstrations or datasets to mimic human behavior. While network for RL. our approach directly initializes with a human-specified pol- Researchers have previously sought to bridge the gap be- icy, IL methods require large labeled datasets (Edwards et al. tween decision trees and deep networks (Humbird, Peterson, 2018) or an oracle to label data before transitioning to RL, as and McClarren 2018; Kontschieder et al. 2015; Laptev and in the LOKI (Cheng et al. 2018) framework. Our approach Buhmann 2014). This work has focused on either partitioning translates human expertise directly into an RL agent’s policy a subspace of the data for more efficient inference (Tanno and begins learning immediately, sidestepping the IL and et al. 2018), to enable more explicit interpretability by visu- labeling phase. alizing a network’s classification policy (Frosst and Hinton 2017; Silva et al. 2020), or for warm starting through super- vised pre-training on labeled data. As discussed, this data 3 Preliminaries may not be available thus creating a need for methods which Within RL, we consider problems presented as a Markov can solicit this initialization tree directly from a human. decision process (MDP), which is a 5-tuple hS, A, T, R, λi Most closely related to our work is deep jointly-informed where s ∈ S are states drawn from the state space or domain, neural networks (DJINN) (Humbird, Peterson, and McClar- a ∈ A are possible actions drawn from the action space, ren 2018), which is the latest in a long line of knowledge- T (s0 , a, s) is the transition function representing the likeli- based neural network research (França, Zaverucha, and hood of reaching a next state s0 by taking some action a in a Garcez 2014; Garcez, Broda, and Gabbay 2012; Maclin and given state s, R(s) is the reward function which determines Shavlik 1996; Richardson and Domingos 2006; Towell and the reward for each state, and λ is a discount factor. In this Shavlik 1994). DJINN uses a decision tree learned over a work, we examine discrete action spaces and semantically training set in order to initialize the structure of a network’s meaningful state spaces– intelligent initialization for continu- hidden layers and to route input data appropriately. How- ous outputs and unstructured inputs is left to future work. The
(Move Right). Finally, we set the paths Z(l~0 ) = D0 and Z(l~1 ) = (¬ D0 ). The resulting probability distribution over the agent’s actions is a softmax over (D0 ∗ l~0 + (1 − D0 ) ∗ l~1 ). Algorithm 1 Intelligent Initialization 1: Input: Expert Propositional Rules Rd 2: Input: Input Size IS , Output Size OS 3: W, C, L = {} 4: for r ∈ Rd do Figure 2: A traditional decision tree and a P RO L O N ET. De- 5: if r is a state check then cision nodes become linear layers, leaves become action 6: s = feature index in r weights, and the final output is a sum of the leaves weighted 7: w = ~0IS , w[s] = 1 by path probabilities. 8: c = comparison value in r 9: W = W ∪ w, C = C ∪ c 10: end if goal of our RL agent is to find a policy, π(a|s), that selects 11: if r is an action then actions in states to maximize the agent’s expected long-term 12: a = action index in r cumulative reward. IL approaches, such as ILPO (Edwards 13: l = ~0OS , l[a] = 1 et al. 2018), operate under a similar framework, though they 14: L=L∪l do not make use of the reward signal and instead perform 15: end if supervised learning according to oracle data. 16: end for 17: Return: W , C, L 4 Approach We provide a visual overview of the P RO L O N ET architecture in Figure 2. To intelligently initialize a P RO L O N ET, a human Algorithm 2 Dynamic Growth user first provides a policy in the form of some hierarchical set of decisions. These policies are solicited through simple 1: Input: P RO L O N ET Pd user interactions for specifying instructions, as in Section 6. 2: Input: Deeper P RO L O N ET Pd+1 The user’s decision-making process is then translated into a 3: Input: = minimum confidence set of weights w~n ∈ W and comparator values cn ∈ C rep- 4: H(~li ) = Entropy of leaf ~li , resenting each rule, shown in Algorithm 1. Each weight w~n 5: for li ∈ L ∈ Pd do determines which input features to consider, and, optionally, 6: Calculate H(li ) how to weight them, as there is a unique weight value for 7: Calculate H(ld1 ), H(ld2 ) each input feature (i.e. |w~n | == |S| for an input space S). for leaves under li in Pd+1 The comparator cn is used as a threshold for the weighted 8: if H(li ) > (H(ld1 ) + H(ld2 ) + ) then features. 9: Deepen Pd at li using ld1 and ld2 Each decision node Dn throughout the network is repre- 10: Deepen Pd+1 at ld1 and ld2 randomly 11: end if sented as Dn = σ[α(w~n T ∗ X ~ − cn )], where X ~ is the input 12: end for data, σ is the sigmoid function, and α serves to throttle the confidence of decision nodes. Less confidence in the tree allows for more uncertainty in decision making (Yuan and After all decision nodes are processed, the values of Dn Shaw 1995), leading to more exploration, even from an ex- from each node represent the likelihood of that condition pert initialization. High values of α emphasize the difference being T RU E. In contrast, (1−Dn ) represents the likelihood between the comparator and the weighted input, thus pushing of the condition being F ALSE. With these likelihoods, the the tree to be more boolean. Lower values of α encourage network then multiplies out the probabilities for different a smoother tree, with α = 0 producing uniformly random paths to all leaf nodes. Every leaf ~l ∈ L contains a path decisions. We allow α to be a learned parameter. z ∈ Z, a set of decision nodes which should be T RU E or Example 1 (P RO L O N ET Initialization). Assume we are in F ALSE in order to reach ~l, as well as a prior set of weights the cart pole domain (Barto, Sutton, and Anderson 1983) and for each output action a ∈ ~a. For example, in Figure 2, have solicited the following from a human: “If the cart’s x z1 = D1 ∗ D2 , and z3 = (1 − D1 ) ∗ D3 . The likelihood position is right of center, move left; otherwise, move right,” of each action a in leaf ~li is determined by multiplying the and that the user indicates x position is the first input fea- probability of reaching leaf ~li by the prior weight of the ture of four and that the center is at 0. We therefore initialize our primary node D0 with w~0 = [1, 0, 0, 0] and c0 = 0, fol- outputs within leaf ~li . After calculating the outputs for every lowing lines 5-8 in Alg. 1. Following lines 11-13, we create leaf, the leaves are summed and passed through a softmax function to provide the final output distribution. a new leaf l~0 = [1, 0] (Move Left) and a new leaf l~1 = [0, 1]
pole agent’s shallower actor has found a local minimum with l1 = [0.5, 0.5], while the deeper actor has l3 = [0.9, 0.1] and l4 = [0.1, 0.9]. Seeing that l1 is offering little benefit to the current policy, and D2 in the deeper actor is able to make a decision about which action offers the most reward, the agent would dynamically deepen at l1 , copying over the deeper actor’s parameters and becoming more decisive in that area of its policy. The deeper actor would also grow with a random set of new parameters, as shown in Figure 3. Figure 3: The dynamic growth process with a deeper P RO - 5 Experimental evaluation L O N ET shown in paler colors and dashed lines. When We conduct two complementary evaluations of the P RO - H(L3 ) + H(L4 ) < H(L1 ), the agent replaces L1 with D2 , L O N ET as a framework for RL with human initialization. L3 , and L4 and adds a new level to the deeper actor. The first is a controlled investigation with expert initializa- tion in which an author designs heuristics for a set of domains with varying complexity; this allows us to confirm that our ar- Example 2 (P RO L O N ET Inference). Consider an example chitecture is competitive with baseline learning frameworks. cart pole state, X=[2, 1, 0, 3] passed to the P RO L O N ET We also perform an ablation of intelligent initialization and from Example 1. Following Dn = σ[α(w~n T ∗ X ~ − cn )], the dynamic growth in this set of experiments. The second evalu- network arrives at σ([1, 0, 0, 0] ∗ [2, 1, 0, 3] − 0) = 0.88 for ation is a user study to support our claim that untrained users D0 , meaning “mostly true.” This probability propagates to can specify policies that serve to improve RL. the two leaf nodes using their respective paths, making the In our first evaluation, we assess our algorithm in StarCraft output of the network a probability given by (0.88 ∗ [1, 0] + II (SC2) for macro and micro battles as well as the OpenAI (1 − 0.88) ∗ [0, 1]) = [0.88, 0.12]. Accordingly, the agent Gym (Brockman et al. 2016) lunar lander and cart pole envi- selects the first action with probability 0.88 and the second ronments. Optimization details, hyperparameters, and code action otherwise. An algorithmic expression of the forward- are all provided in the supplementary material. pass is provided in the supplementary material. To evaluate the impact of dynamic growth and intelligent initialization, we perform an ablation study and include re- Dynamic Growth – P RO L O N ETS are able to follow expert sults from these experiments in Table 1. For each N -mistake strategies immediately, but they may lack the expressive ca- agent, weights, comparators, and leaves are randomly negated pacity to learn more optimal policies once they are deployed according to N , up to a maximum of 2N for each category. into a domain. If an expert policy involves a small number of decisions, the network will have a small number of weight 5.1 Agent formulations vectors and comparators to use for its entire existence. To We compare several agents across our experimental domains. enable the P RO L O N ET architecture to continue to grow be- The first is a P RO L O N ET agent as described above and yond its initial definition, we introduce a dynamic growth with expert initialization. We also evaluate a multi-layer procedure, which is outlined in Algorithm 2 and Figure 3. perceptron (MLP) agent and a long short-term memory Upon initialization, a P RO L O N ET agent maintains two (LSTM) (Hochreiter and Schmidhuber 1997) agent, both us- copies of its actor. The first is the shallower, unaltered ini- ing ReLU activations (Nair and Hinton 2010). We include tialized version, and the second is a deeper version in which comparisons to a P RO L O N ET with random initialization each leaf is transformed into a randomly initialized decision (Random P RO L O N ET) as well as the Heuristic used to ini- node with two new randomly initialized leaves (line 1 of tialize our agents. We compare to an IL agent trained with Alg. 2). This deeper agent has more parameters to potentially the LOKI framework, in which the agent imitates for the learn more complex policies, but at the cost of added random- first N episodes (Cheng et al. 2018), where N is a tuned hy- ness and uncertainty, reducing the utility of the intelligent perparameter, and then transitions to RL. The LOKI agent initialization. supervises with the same heuristic that is used to initialize As the agent interacts with its environment, it relies on the P RO L O N ET agent. Finally, although the original DJINN the shallower network to generate actions, as the shallow framework (Humbird, Peterson, and McClarren 2018) re- network represents the human’s domain knowledge. After quires a decision tree learned over a labeled dataset, we ex- each episode, the off-policy update is run over the shallower tend the DJINN architecture to allow for initialization with a and deeper networks. Finally, after the off-policy updates, the hand-crafted decision tree in order to compare to a DJINN agent compares the entropy of the shallower actor’s leaves agent that is initialized using the same heuristic as LOKI and to the entropy of the deeper actor’s leaves and selectively P RO L O N ET, but built with the DJINN architecture. deepens when the leaves of the deeper actor are less uniform than those of the shallower actor (lines 3-7). We find that this 5.2 Environments dynamic growth mechanism improves stability and average We consider four environments to empirically evaluate P RO - cumulative reward. L O N ETS: cart pole, lunar lander, the FindAndDefeatZer- Example 3 (P RO L O N ET Dynamic Growth). Assume the cart glings minigame from the SC2LE (Vinyals et al. 2017), and
(a) Cart Pole (b) Lunar Lander (c) FindAndDefeatZerglings Figure 4: A comparison of architectures on cart pole, lunar lander, and FindAndDefeatZerglings (Vinyals et al. 2017). As the domain complexity increases, we see that intelligent initialization is increasingly important, and P RO L O N ETS are the most effective method for leveraging domain expertise, and perform well even when domain expertise is unnecessary, as in cart pole. a full game of SC2 against the in-game artificial intelligence hacks and tricks (Engstrom et al. 2019), we observe that the (AI). These environments provide us with a steady increase P RO L O N ET approaches are able to succeed with the same in difficulty, from a toy problem to the challenging game PPO implementation and learning environment. of full SC2. These evaluations also showcase the ability of the P RO L O N ET framework to compete with state-of-the-art approaches in simple domains and excel in more complex StarCraft II: FindAndDefeatZerglings – For this prob- domains. For the SC2 and SC2LE problems, we use the SC2 lem, we assign an agent to each individual allied unit. The API1 to manufacture 193D and 37D state spaces, respectively, best-performing initialization in this domain has 6 decision and 44D and 10D action spaces, respectively. In the full SC2 nodes and 7 leaves. Running reward is depicted in Figure domain, making the right parameter update is a significant 4c, again averaged over 5 runs. Intelligent initialization is challenge for RL agents. As such, we verify that the agent’s crucial in this more complex domain, and the Random P RO - parameter updates increase its probability of victory, and if a L O N ET fails to find much success despite having the same new update has decreased the agent’s chances of success, then architecture as the P RO L O N ET. LOKI performs on par with the update is rolled back, and the agent gathers experience the Heuristic used to supervise actions, but LOKI is unable for a different step, similar to the checkpointing approach in to generalize beyond the Heuristic. MLP and LSTM agents Hosu and Rebedea (2016). use a 7-layer architecture after a hyperparameter search, and we extend this to the full game of SC2. Importantly, this result (Figure 4c) shows user-initialized P RO L O N ETS can OpenAI Gym – As depicted in Figure 4a and 4b, P RO - outperform our baselines and that this initialization is key L O N ETS are able to either match or exceed performance to efficient exploration and learning. The importance of the of standard reinforcement and imitation learning based RL initialization policy is again shown in Table 1, where even architectures. Furthermore, we find that the P RO L O N ET negating 10% of the agent’s parameters results in a signifi- architecture–even without intelligent-initialization–is com- cantly lower average reward. petitive with baseline architectures in the OpenAI Gym. Run- ning reward in these domains is averaged across five runs, as recommended by Henderson et al. (2018). MLP and LSTM StarCraft II: Full Game – After 5,000 episodes, no agent agents use 1-layer architectures which maintain input dimen- other than the P RO L O N ET is able to win a single game sion until the output layer. We find success with intelligent against the in-game AI at the easiest setting. Even the LOKI initializations using as few as three nodes for the cart pole and DJINN agents, which have access to the same heuristics domain and as few as 10 nodes for the lunar lander. These used by the P RO L O N ET, are unable to win one game. The results show that P RO L O N ETS can leverage user knowledge P RO L O N ET, on the other hand, is able to progress to the to achieve superior results, and our ablation study results “hard” in-game AI, achieving 100% win rates against easier (Table 1) show that the architecture is robust to sub-optimal opponents as it progresses. Even against the “hard” in-game initialization in these domains. AI, the P RO L O N ET agent is able to double its win rate from Even where intelligent initialization is not always neces- initialization. This result demonstrates the importance of an sary or where high-level instruction is difficult to provide, intelligent initialization in complex domains, where only a as in cart pole, it does not hinder RL from finding solutions very narrow and specific set of actions yield successful re- to the problem. Further, while baselines appear unstable in sults. Access to oracle labeling (LOKI) or a knowledge-based these domains, potentially owing to missing implementation architecture (DJINN) does not suffice; the agent requires the actual warm start of having intelligent rules built-in. Thus, we 1 https://github.com/Blizzard/s2client-api believe these results demonstrate that our novel formulation
R ANDOM . S HALLOW D OMAIN P RO L O N ET P RO L O N ET P RO L O N ET N = 0.05 N = 0.1 N = 0.15 C ART P OLE 449±15 401±26 415± 27 426± 30 369± 28 424± 29 L UNAR 86 ± 33 55±19 49± 20 50± 22 45± 22 45± 22 Z ERGLINGS 8.9±1.5 -1.3±0.6 8.8±1.5 5.1±1.1 5.9±1.2 4.1±1.1 Table 1: P RO L O N ET ablation study of average cumulative reward. Units are in thousands. P RO L O N ET P RO L O N ET AT A LL network weights are shared. The reward function is the nega- AI D IFFICULTY (O URS ) I NITIALIZATION OTHERS tive distance between drones and fire centroids, encouraging V ERY E ASY 100% 14.1 % 0% drones to follow the fire as closely as possible. E ASY 100% 10.9 % 0% M EDIUM 82.2% 11.3 % 0% 6.1 Study details H ARD 26% 10.7 % 0% To solicit policy specifications from users, we designed a user interface that enabled participants to select from a set of pre- Table 2: Win rates against the StarCraft II in-game AI. “All made state checks and actions. Participants were first briefed Others” includes all agents in Section 5.1. on the domain and shown a visualization and then asked to talk-through a strategy for monitoring the fires with two in- dependent drones. After describing a solution and seeing the is singularly capable of harnessing domain knowledge. domain, participants were presented with the UI to build out their policies. As the participant selected options, those rules 6 User study with non-experts were composed into a decision tree. Once participants com- Our second evaluation investigates the utility of our frame- pleted the study, we leveraged their policy specifications to work with untrained humans providing the expert initializa- initialize the structure and parameters of a P RO L O N ET. The tion for P RO L O N ETS. As presented in Section 6.2, our user P RO L O N ET was then deployed to the wildfire domain, where study shows that untrained users can leverage P RO L O N ETS it further improved through RL. Our results are presented in to train RL policies with superior performance. These results Figure 6 and described below. We present both the highest provide evidence that our approach can help democratize RL. performing participant (“Best”), as well as the median over all participants (“Median”), and compare against the agents presented in Section 5.1. LOKI and DJINN agents use the Hypotheses – We seek to investigate whether an untrained “Best” participant policy specification as a heuristic. user can provide a useful initial policy for P RO L O N ETS. Hy- pothesis 1 (H1): Expert initializations may be solicited from 6.2 Study results average users, requiring no particular training of the user, and these initializations are superior to random initializations. Our IRB-approved study involved 15 participants (nine male, Hypothesis 2 (H2): RL can improve significantly upon these six female) between 21 and 29 years old (M = 24, SD = 2). initializations, yielding superior policies after training. The study took approximately 45 minutes, and participants were compensated for their time. Our pre-study survey re- vealed varying degrees of experience with robots and games, Metrics – To test H1, we measure the reward over time for though we note that our participants were mostly computer our best participant, all participants, and baseline methods. science students. Importantly, we found that their prior expe- Testing H2, we measure the average reward for the first 50 rience with robots, learning from demonstration, or strategy and final 50 episodes for all agents specified by participants games did not impact their ability to specify useful policies and our strongest baseline. Our metrics allow us to effectively for our agents. examine our hypotheses in the context of expert initialization Nearly all participants provided policy specifications that in our study domain. were superior to random exploration. After performing RL over participant specifications, we can see in Figure 6 that Domain: Wildfire Tracking – We develop a Python simu- intelligent initialization yields the most successful RL lator for a domain that is both suited to RL and of relevance agents, even from non-experts. We compare to the best per- to a wider audience: wildfire tracking. We randomly instanti- forming baseline, Random P RO L O N ET in Figure 5. We can ate two fires and two drones in a 500x500 grid. The drones again see that the participants’ initializations are not only receive a 6D state as input, containing distances to fire cen- better than random initialization, but are also better than the troids and Boolean flags for which a drone is the closest to trained RL agent. A Wilcoxon signed-rank test shows that our each fire. The action space for drones is a 4D discrete deci- participants’ initializations (Median = -23, IQR = 19) were sion of which cardinal direction to move into. Pre-made state significantly better than a baseline initialization (Median = checks include statements such as “If I am the closest drone -87, IQR = 26), W (15) = 1.0, p < .001. Our participants’ to Fire 2” and “If Fire 1 is to my west.” The two drones are agents (Median = -7.9 , IQR = 29) were also significantly bet- controlled by separate agents without communication, and ter than a baseline (Median = -52, IQR = 7.9) after training,
Figure 5: Initial and final distance between drones and wild- fire centroids in our user study domain, where lower distance is better. Participant initializations are significantly better at Figure 6: Wildfire tracking results, again demonstrating the tracking fires than random, showing that untrained users can importance of direct intelligent initialization (P RO L O N ET) leverage our approach to provide useful warm starts. rather than IL or random initialization. W (15) = 15.0, p = 0.011. These results are significant after knowledge to initialize rules as well as structure, rather than applying a Bonferroni correction to test the relative perfor- simple architecture and routing information, as in DJINN, is mance both before and after training. This result supports a key difference that enables the success of our approach. hypothesis H1, showing that average users can specify use- Through our user study, we demonstrated the practicality ful policies for RL agents to explore more efficiently than of our approach and shown that average participants, even random search and significantly outperform baselines. those with no prior experience in the given domain, can pro- duce policy specifications which significantly exceed random Furthermore, our participants’ agents are significantly bet- initialization (p < 0.05). Furthermore, we have demonstrated ter post-training than at initialization, as shown by a Wilcoxon that RL can significantly improve upon these policies, learn- signed-rank test (W (15) = 4.0, p < 0.01). This finding sup- ing to refine “good enough” solutions into optimal ones for a ports hypothesis H2, showing that RL improves on human given domain. This result shows us that our participants did specifications, not merely repeating what the humans have not simply provide our agents with optimal solutions iterated demonstrated. By combining human intuition and expertise upon needlessly. Instead, our participants provided good but with computation and optimization for long-term expected sub-optimal starting points for optimization. These starting reward, we are able to produce agents that outperform both policies were then refined into a solution that was more robust humans and traditional RL approaches. than either the human’s solution or the best baseline solution. Finally, we qualitatively demonstrate the utility of intel- Our study confirms that our approach can leverage readily ligent initialization and the P RO L O N ET architecture by de- available human initializations for success in deep RL, and ploying the top performing agents from each method to two moreover, that the combination of human initialization and drones with simulated fires to track. Videos of the top four RL yields the best of both worlds. agents are included as supplementary material. 8 Conclusion 7 Discussion We present a new architecture for deep RL agents, P RO - We proposed two complementary evaluations of our proposed L O N ETS, which permits intelligent initialization of agents. architecture, demonstrating the significance of our contribu- P RO L O N ETS grant agents the ability to grow their network tion. Through our first set of experiments on an array of RL capacity as necessary, and are surprisingly capable even with benchmarks with a domain expert building heuristics, we random initialization. We show that P RO L O N ETS permit ini- empirically validated that P RO L O N ETS are competitive with tialization from average users and achieve a high-performing baseline methods when initialized randomly and, with a hu- policy as a result of the blend of human instruction and RL. man initialization, outperforms state-of-the-art imitation and We demonstrate, first, that our approach is superior to imita- RL baselines. As we see in Figure 4, P RO L O N ETS are as fast tion and reinforcement learning on traditional architectures or faster than baseline methods to learn an optimal policy and, second, that intelligent initialization allows deep RL over the same environments and optimization frameworks. agents to explore and learn in environments that are too In our more complex domains, we identify the importance of complex for randomly initialized agents. Further, we have an intelligent initialization. While the IL baseline performs confirmed that we can solicit these useful warm starts from well in the FindAndDefeatZerglings minigame, LOKI cannot average participants and still develop policies superior to improve on the imitated policy. In the full game of SC2, no baseline approaches in the given domains, paving the way approach apart from our intelligently-initialized P RO L O N ET for reinforcement learning to become a more collaborative wins even a single game. The ability to leverage domain enterprise across a variety of complex domains.
Ethical Considerations Amershi, S.; Cakmak, M.; Knox, W. B.; and Kulesza, T. Our work is a contribution targeted at democratizing rein- 2014. Power to the people: The role of humans in interactive forcement learning in complex domains. The current state of machine learning. AI Magazine 35(4): 105–120. the art in reinforcement learning in complex domains requires Barto, A. G.; Sutton, R. S.; and Anderson, C. W. 1983. Neu- compute time and power beyond the capacity of many labs, ronlike adaptive elements that can solve difficult learning hand-engineering which is rarely explained publicly, or large control problems. IEEE transactions on systems, man, and labeled datasets which are not always shared. By providing cybernetics SMC-13(5): 834–846. a means for intelligent initialization by practitioners and im- proved exploration in many domains, we attempt to lower Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; the barrier to entry for research in reinforcement learning Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI and to broaden the number of potential applications of rein- Gym. forcement learning to more grounded, real-world problems. Cheng, C.-A.; Yan, X.; Wagener, N.; and Boots, B. 2018. While there are risks with any technology being misused, Fast Policy Learning through Imitation and Reinforcement. we believe the benefits of democratizing RL outweigh the arXiv preprint arXiv:1805.10413 . risks. We posit that giving everyone the ability to use RL rather than just large corporations and select universities is a Edwards, A. D.; Sahni, H.; Schroeker, Y.; and Isbell, C. L. positive contribution to society. 2018. Imitating Latent Policies from Observation. arXiv preprint arXiv:1805.07914 . Beneficiaries – Our work seeks to improve and simplify Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.; reinforcement learning research for all labs and to take steps Rudolph, L.; and Madry, A. 2019. Implementation Matters in toward democratizing reinforcement learning for non-experts. Deep RL: A Case Study on PPO and TRPO. In International We feel that the computational and dataset savings of our Conference on Learning Representations. work stand to benefit all researchers within reinforcement França, M. V.; Zaverucha, G.; and Garcez, A. S. d. 2014. Fast learning. relational learning using bottom clause propositionalization with artificial neural networks. Machine learning 94(1): 81– Negatively affected parties – We do not feel that any 104. group of people or research direction is negatively impacted Frosst, N.; and Hinton, G. 2017. Distilling a neural network by this work. Our work is complementary to other explo- into a soft decision tree. arXiv preprint arXiv:1711.09784 . rations within reinforcement learning, and insights from imi- tation learning translate naturally into insights on the qualities Garcez, A. S. d.; Broda, K. B.; and Gabbay, D. M. 2012. of useful or harmful intelligent initializations. Neural-symbolic learning systems: foundations and applica- tions. Springer Science & Business Media. Implications of failure – While our method seeks to sim- Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, plify reinforcement learning, in the worst case the initializa- D.; and Meger, D. 2018. Deep reinforcement learning that tion falls back to random and the learning agent is again faced matters. In Thirty-Second AAAI Conference on Artificial with an intractable random exploration problem. Adversar- Intelligence. ial agents using our approach would be able to instantiate Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term a worse-than-random agent, though our results imply that memory. Neural computation 9(8): 1735–1780. it is possible to overcome such an initialization in simple domains. Hosu, I.-A.; and Rebedea, T. 2016. Playing atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:1607.05077 . Bias and fairness – Our work does rely on the “bias” of its initialization–that is, it is biased towards the actions which a Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; and Xing, E. 2016. Har- human has pre-specified. While this biased exploration may nessing deep neural networks with logic rules. arXiv preprint fail to accurately explore or understand the intricacies of a arXiv:1603.06318 . complex domain, the alternative (years of compute with ran- Humbird, K. D.; Peterson, J. L.; and McClarren, R. G. 2018. dom exploration) is simply unavailable to many researchers. Deep Neural Network Initialization With Decision Trees. This bias may be overcome through diversification of intel- IEEE transactions on neural networks and learning systems . ligent initializations which may lead to a diversity of final strategies. However, the unification of such diverse policies Knox, W. B.; and Stone, P. 2009. Interactively shaping agents into a single agent and the thorough study of diverse initial- via human reinforcement: The TAMER framework. In Pro- izations is left to future work. ceedings of the fifth international conference on Knowledge capture, 9–16. References Kontschieder, P.; Fiterau, M.; Criminisi, A.; and Rota Bulo, Abbeel, P.; and Ng, A. Y. 2004. Apprenticeship learning S. 2015. Deep neural decision forests. In Proceedings of via inverse reinforcement learning. In Proceedings of the the IEEE international conference on computer vision, 1467– twenty-first international conference on Machine learning, 1. 1475.
Laptev, D.; and Buhmann, J. M. 2014. Convolutional deci- sion trees for feature learning and segmentation. In German Conference on Pattern Recognition, 95–106. Springer. MacGlashan, J.; Ho, M. K.; Loftin, R.; Peng, B.; Wang, G.; Roberts, D. L.; Taylor, M. E.; and Littman, M. L. 2017. Inter- active learning from policy-dependent human feedback. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2285–2294. JMLR. org. Maclin, R.; and Shavlik, J. W. 1996. Creating advice-taking reinforcement learners. Machine Learning 22(1-3): 251–281. Nair, V.; and Hinton, G. E. 2010. Rectified linear units im- prove restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML- 10), 807–814. Richardson, M.; and Domingos, P. 2006. Markov logic net- works. Machine learning 62(1-2): 107–136. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 . Silva, A.; Killian, T.; Rodriguez, I. D. J.; Son, S.-H.; and Gombolay, M. 2020. Optimization Methods for Interpretable Differentiable Decision Trees in Reinforcement Learning. In International Conference on Artificial Intelligence and Statistics. Tanno, R.; Arulkumaran, K.; Alexander, D. C.; Criminisi, A.; and Nori, A. V. 2018. Adaptive Neural Trees. arXiv preprint arXiv:1807.06699 . Tieleman, T.; and Hinton, G. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent mag- nitude. COURSERA: Neural networks for machine learning 4(2): 26–31. Towell, G. G.; and Shavlik, J. W. 1994. Knowledge-based artificial neural networks. Artificial intelligence 70(1-2): 119– 165. Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhn- evets, A. S.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.; Schrittwieser, J.; et al. 2017. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782 . Wang, J.; Wang, Z.; Zhang, D.; and Yan, J. 2017. Combining knowledge with deep convolutional neural networks for short text classification. In Proceedings of IJCAI, volume 350. Yuan, Y.; and Shaw, M. J. 1995. Induction of fuzzy decision trees. Fuzzy Sets and systems 69(2): 125–139. Zhang, R.; Torabi, F.; Guan, L.; Ballard, D. H.; and Stone, P. 2019. Leveraging human guidance for deep reinforcement learning tasks. arXiv preprint arXiv:1909.09906 . Zhang, S.; and Sridharan, M. 2019. AAAI Tutorial: Knowledge-based Sequential Decision-Making under Un- certainty. AAAI Workshop: Knowledge-based Sequential Decision-Making under Uncertainty . Zhu, F.; and Liao, P. 2017. Effective warm start for the online actor-critic reinforcement learning based mhealth interven- tion. arXiv preprint arXiv:1704.04866 .
A P RO L O N ET Forward Pass C Experimental Domain Details An algorithmic step-through of the forward pass for the P RO - C.1 Cart Pole L O N ET is provided in Algorithm 3. The example from the Cart pole is an RL domain (Barto, Sutton, and Anderson main paper is included here: 1983) where the object is to balance an inverted pendulum on Example 4 (P RO L O N ET Inference). Consider an example a cart that moves left or right. The state space is a 4D vector cart pole state, X=[2, 1, 0, 3]. Following the equation in representing {cart position, cart velocity, pole angle, pole Line 3 of Algorithm 3, the network arrives at σ([1, 0, 0, 0] ∗ velocity}, and the action space is is {left, right}. We use the [2, 1, 0, 3] − 0) = 0.88 for D0 , meaning “mostly true.” This cart pole domain from the OpenAI Gym (Brockman et al. decision probability propagates to the two leaf nodes using 2016). their respective paths (Lines 9-15 in Algorithm 3), making the For the cart pole domain, we set all agent’s learning rates output of the network a probability given by (0.88 ∗ [1, 0] + to 0.01, the batch size is set to dynamically grow as there (1 − 0.88) ∗ [0, 1]) = [0.88, 0.12]. Accordingly, the agent is more replay experience available, we initialized α = 1, selects the first action with probability 0.88 and the second and each agent trains on all data gathered after each episode, action otherwise. then empties its replay buffer. All agents train on 2 sim- ulations concurrently, pooling replay experience after each Algorithm 3 P RO L O N ET Forward Pass episode, and updating their policy parameters. For the LOKI agent, we set N =200. All agents are updated according to Input: Input Data X, P RO L O N ET P the standard PPO loss function. We selected all parameters for dn ∈ D ∈ P do empirically to produce the best results for each method. σn = σ[α(w~n T ∗ X ~ − cn )] end for C.2 Lunar Lander A~ OU T = Output Actions Lunar lander is the second domain we use from the OpenAI for ~li ∈ L do Gym (Brockman et al. 2016), and is based on the classic Path to ~li = Z(L) Atari game of the same name. Lunar lander is a game where z=1 the player attempts to land a small ship (the lander) safely on for σi ∈ Z(L) do the ground, keeping the lander upright and touching down if σi should be T RU E ∈ Z(L) then slowly. The 8D state consists of the lander’s {x, y} position z = z ∗ σi and velocity, the lander’s angle and angular velocity, and two else binary flags which are true when the left or right legs have z = z ∗ (1 − σi ) touched down. end if We use the discrete lunar lander domain, and so the 4D end for action space contains {do nothing, left engine, main engine, A~ OU T = A ~ OU T + ~li ∗ z right engine}. For the lunar lander domain, we set most end for hyperparameters to the same values as in the cart pole domain. Return: A ~ OU T The two exceptions are the number of concurrent processes, which we set to 4, and the LOKI agent’s N , which is set to 300. All agents use the standard PPO loss function. B Hyperparameters and Optimization Details C.3 FindAndDefeatZerglings All actors are updated with proximal policy optimization FindAndDefeatZerglings is a minigame from the SC2LE (PPO) (Schulman et al. 2017). Notably, for the two SC2 designed to challenge RL agents to learn how to effectively domains, we find that multiplying the PPO update by the micromanage their individual attacking units in SC2. The Kullback-Leibler divergence between old and new policies agent controls three attacking units on a small, partially- yields superior performance. The critic’s loss function is observable map, and must explore the map while killing the mean-squared error between the output of the critic and enemy units. The agent receives +1 reward for each enemy the reward from the state-action pair. All approaches are unit that is killed, and -1 for each allied unit that is killed. trained with RMSProp (Tieleman and Hinton 2012). We set Enemy units respawn in random locations, and so the best our reward discount factor to 0.99, learning rates to 1e-2 for agents are ones that continuously explore and kill enemy Gym environments, and 1e-4 for the SC2 domains, following units until the three minute timer has elapsed. a hyperparameter search between 1-e2 and 1e-5. Update We leverage the SC2 API 2 to manufacture a batch sizes dynamically grow as more replay experience is 37D state which contains {x position, y position, health, available. In all domains, the P RO L O N ET α parameter is weapon cooldown} for three allied units, and {x position, initialized to 1. Our agents utilize two separate networks: one y position, health, weapon cooldown, is baneling} the five for the actor and one for the critic. For our approach, the nearest visible enemy units. Missing information is filled with critic network is initialized as a copy of the actor as we do -1. Our action space is 10D, containing move commands for not solicit intelligent value predictions, only policies. Our north, east, south, west, attack commands for each of the five dynamic growth hyperparameter is set to = 0.1 based 2 upon experimental observation. https://github.com/Blizzard/s2client-api
nearest visible enemies, and a “do nothing” command. For AI, then move up to successive levels of difficulty as they this problem, we assign an agent to each individual allied achieve > 80% win-rates. The agents in this domain update unit, which generates actions for only that unit. Experience according to the loss function in Equation 1. from each agent stops accumulating when the unit dies. All experience is pooled for policy updates after each episode, C.5 User Study Domain: Wildfire Tracking and parameters are shared between agents. The objective in the wildfire tracking domain is to keep two For the SC2LE minigame, we set all agents’ learning rates drones on top of two fire centroids as they progress through to 0.001, we again initialized α = 1, and the batch size to the map. The task is complicated by the fact that the two 4. Each agent trains on replay data for 50 update iterations drones do not communicate, and do not have complete access per episode, and pools experience from 2 concurrent pro- to the state of the world. Instead, they have access to a 6D cesses. The LOKI agent’s N , is set to 500. The agents in vector containing { DN (F1 ), DW (F1 ), DN (F2 ), DW (F1 ), this domain update according to the loss function in Equation C(F1 ), C(F2 ) } where DN is the “distance to the north“ 1. function and C(F1 ) is the “closer to fire 1” boolean flag. The actions available to the drones include move com- (A ∗ log(a|πnew )) mands in four directions: north, east, south, and west. L(a, s, πnew , πold ) = KL(P (~a|πnew , s), P (~a|πold , s)) (1) D Initialization Heuristics in Experimental Where A is the advantage gained by taking action a in Evaluation state s, πnew is the current set of model parameters, and πold is the set of model parameters used during the episode D.1 Cart Pole Heuristics which generated this state-action pair. ~a is the probability We use a simple set of heuristics for the cart pole problem, distribution over all actions that a policy π yields given state s. visualized in Figure 7. If the cart is close enough to the center, As in prior work, the advantage A is calculated by subtracting we move in the direction opposite to the lean of the pole, as the reward (obtained by taking action a in state s) from the long as that motion will not push us too far from the center. value prediction for taking action a in state s, given by a If the cart is close to an edge, the agent attempts to account critic network. for the cart’s velocity and recenter the cart, though this is often an unrecoverable situation for the heuristic. The longest C.4 SC2 Full Game run we saw for a P RO L O N ET with no training was about 80 Our simplified StarCraft 2 state contains: timesteps. • Allied Unit Counts: A 36x1 vector in which each index cor- D.2 Lunar Lander Heuristics responds to a type of allied unit, and the value corresponds to how many of those units exist. For the lunar lander problem, the heuristic rules are split into two primary phases. The first phase is engaged at the • Pending Unit Counts: As above, but for units that are cur- beginning of an episode while the lander is still high above rently in production and do not exist yet. the surface. In this phase, the lander focuses on keeping the • Enemy Unit Counts: A 112x1 vector in which each index lander’s angle as close to 0 as possible. Phase two occurs corresponds to a type of unit, and the value corresponds to when the lander gets closer to the surface, and the agent then how many of those types are visible. focuses on keeping the y velocity lower than 0.2. As is de- • Player State: A 9x1 vector of specific player state informa- picted in Figure 8, there are many checks for both lander legs tion, including minerals, vespene gas, supply, etc. being down. We found that both LOKI and P RO L O N ETS were prone to landing successfully, but continuing to fire The disparity between allied unit counts and enemy unit their left or right boosters. In an attempt to ameliorate this counts is due to the fact that we only play as the Protoss race, problem, we added the extra “legs down” checks. but we can play against any of the three races. The number of actions in SC2 can be well into the thou- D.3 FindAndDefeatZerglings Heuristics sands if one considers every individual unit’s abilities. As we seek to encode a high-level strategy, rather than rules for For the SC2LE minigame, the overall strategy of our heuristic moving every individual unit, we restrict the action space is to stay grouped up and fight or explore as a group. As such, for our agent. Rather than using exact mouse and camera the first four checks are all in place to ensure that the marines commands for individual units, we abstract actions out to are all close to each other. After they pass the proximity simply: “Build Pylon.” As such, our agents have 44 available checks, they attack whatever is nearest. If nothing is nearby, actions, including 35 building and unit production commands, they will move in a counter-clockwise sweep around the 4 research commands, and 5 commands for attack, defend, periphery of the map, searching for more zerglings. Our harvest resources, scout, and do nothing. heuristic is shown in Figure 9. For the full SC2 game, we set all agents’ learning rates to 0.0001, we again initialized α = 1, set the batch size to D.4 SC2 Full Game Heuristics 4, and updates per episode to 8. We run 4 episodes between The SC2 full game heuristic first checks for important actions updates, and set the LOKI N =1000. Agents train for as long that should always be high priority, such as attacking, defend- as necessary to achieve a > 80% win-rate against the easiest ing, harvesting resources, and scouting. Once initial checks
for these are all passed, the heuristic descends into the build E.3 FindAndDefeatZerglings order, where it simply uses building or unit count checks to We failed to find a M LP architecture that succeeded in this determine when certain units should be built or trained. After task, and so we choose one that compromised between the enough attacking units are trained, the heuristic indicates that depth of the P RO L O N ET and the simplicity that M LP agents it is time to attack. The SC2 full game heuristic is depicted seemed to prefer for toy domains. The final network is a 7- in Figure. 10. layer network with the following sequence: 37x37 – 37x37 – 37x37 – 37x37 – 37x37 – 37x37 – 37x10. E Architectures for Algorithms in We choose to keep the size to 37 after testing 37 and Experimental Evaluation 64 as sizes, and deciding that trying to get as close to the In this section we briefly overview the M LP and LST M P RO L O N ET architecture was the best bet. action network information. The LOKI agent maintained The LST M network for FindAndDefeatZerglings features the same architecture as the M LP agent. more hidden layers than the LST M for lunar lander and cart pole. The hidden size is set to 37, and the LSTM unit is E.1 Cart Pole followed by 5 layers. The final sequence is: The cart pole M LP network is a 3-layer network following 37x37 – LSTM(37x37) – 37x37 – 37x37 – 37x37 – 37x37 the sequence: – 37x10. 4x4 – 4x4 – 4x2. We experimented with hidden-sizes for the LSTM unit We experimented with sizes ranging from 4-64 and num- from 37 to 64 and varied the number of successive layers bers of hidden layers from 1 to 10, and found that the small from 4-10. network performed the best. The P RO L O N ET agent for FindAndDefeatZerglings fea- The LST M network for cart pole is the same as the M LP tured 10 nodes and 11 leaves. We tested architectures from network, though with an LSTM unit inserted between the 6 to 15 nodes and from 7 to 13 leaves, and found that the first and second layers. The LSTM unit’s hidden size is 4, so initialized policy and architecture had more of an immediate the final sequence is: impact for this task. The 7 node policy allowed agents to 4x4 – LSTM(4x4) – 4x4 – 4x2. spread out too much, and they died quickly, whereas the 15 We experimented with hidden-sizes for the LSTM unit node policy had agents moving more than shooting, and they from 4 to 64, though none were overwhelmingly successful, would walk around while being overrun. and we varied the number of layers after the LSTM unit from 1-10. E.4 SC2 Full Game The P RO L O N ET agent for this task used 9 decision nodes We again failed to find a M LP architecture that succeeded and 11 leaves. For the deepening experiment, we tested an in this task, and so used a similar architecture to that of the agent with only a single node and 2 leaves, and found that it FindAndDefeatZerglings task. The 7-layer network is of the still solved the task very quickly. We tested randomly initial- sequence: ized architectures from 1 to 9 nodes and from 2 to 11 leaves, 193x193 – 193x193 – 193x193 – 193x193 – 193x193 – and we found that all combinations successfully solved the 193x193 – 193x44. task. We again experimented with a variety of shapes and num- ber of layers, though none succeeded. E.2 Lunar Lander Again, the LST M network shadows the M LP network The lunar lander M LP network is a 3-layer network, follow- for this task. As in the FindAndDefeatZerglings task, we ex- ing the sequence: perimented with a variety of LSTM hidden unit sizes, hidden 8x8 – 8x8 – 8x4. layer sizes, and hidden layer numbers. The final architecture We again experimented with sizes from 8-64 and number reflects the FindAndDefeatZerglings sequence: of hidden layers from 2 to 11. 193x193 – LSTM(193x193) – 193x193 – 193x193 – The LST M network for lunar lander mimics the architec- 193x193 – 193x193 – 193x44. ture from cart pole. The LSTM unit’s hidden size is 8, so the The P RO L O N ET agent for the SC2 full game featured 10 final sequence is: nodes and 11 leaves. We tested architectures from 10 to 16 8x8 – LSTM(8x8) – 8x4. nodes and from 1 to 17 leaves, and found that the initialized We experimented with hidden-sizes for the LSTM unit policy and architecture was not as important for this task as from 8 to 64, and again we varied the number of layers it was for the FindAndDefeatZerglings task. As long as we succeeding the LSTM unit from 1 to 10. included a basic build order and the “attack” command, the The P RO L O N ET agent for this task featured 14 decision agent would manage to defeat the VeryEasy in-game AI at nodes and 15 leaves. We experimented with intelligent ini- least 10% of the time. We found that constraining the policy tialization architectures ranging from 10 nodes to 14 and to fewer nodes and leaves provided less noise as updates from 10 to 15 leaves, and found little difference between progressed, and kept the policy close to initialization while their performances. The additional nodes were an attempt also providing improvements. An initialization with too many to encourage the agent to “do nothing” once successfully parameters often seemed to degrade quickly, presumably landing, as the agent had a tendency to continue shuffling due to small changes over many parameters having a larger left-right after successfully touching down. impact than small changes over few parameters.
You can also read