Leveraging class abstraction for commonsense reinforcement learning via residual policy gradient methods
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Leveraging class abstraction for commonsense reinforcement learning via residual policy gradient methods Niklas Höpner1∗ , Ilaria Tiddi 2 and Herke van Hoof 1 1 University of Amsterdam 2 Vrije Universiteit Amsterdam {n.r.hopner, h.c.vanhoof}@uva.nl, i.tiddi@vu.nl, arXiv:2201.12126v2 [cs.AI] 1 May 2022 Abstract or world knowledge is an important step towards improved human-machine interaction [Akata et al., 2020], as inter- Enabling reinforcement learning (RL) agents to esting interactions demand machines to access prior knowl- leverage a knowledge base while learning from ex- edge not learnable from experience. Commonsense games perience promises to advance RL in knowledge in- [Jiang et al., 2020; Murugesan et al., 2021] have emerged as tensive domains. However, it has proven difficult a testbed for methods that aim at integrating commonsense to leverage knowledge that is not manually tailored knowledge into a RL agent. Prior work has focused on aug- to the environment. We propose to use the sub- menting the state by extracting subparts of ConceptNet [Mu- class relationships present in open-source knowl- rugesan et al., 2021]. Performance only improved when the edge graphs to abstract away from specific objects. extracted knowledge was tailored to the environment. Here, We develop a residual policy gradient method that we focus on knowledge that is automatically extracted and is able to integrate knowledge across different ab- should be useful across a range of commonsense games. straction levels in the class hierarchy. Our method Humans abstract away from specific objects using classes, results in improved sample efficiency and general- which allows them to learn behaviour at class-level and gen- isation to unseen objects in commonsense games, eralise to unseen objects [Yee, 2019]. Since commonsense but we also investigate failure modes, such as ex- games deal with real-world entities, we look at the problem cessive noise in the extracted class knowledge or of leveraging subclass knowledge from open-source KGs to environments with little class structure1 . improve sample efficiency and generalisation of an agent in commonsense games. We use subclass knowledge to for- 1 Introduction mulate a state abstraction, that aggregates states depending on which classes are present in a given state. This state Deep reinforcement learning (DRL) has enabled us to opti- abstraction might not preserve all information necessary to mise control policies in MDPs with high-dimensional state act optimally in a state. Therefore, a method is needed that and action spaces such as in game-play [Silver et al., 2016] learns to integrate useful knowledge over a sequence of more and robotics [Lillicrap et al., 2016]. Two main hindrances and more fine-grained state representations. We show how a in bringing deep reinforcement learning to the real world are naive ensemble approach can fail to correctly integrate in- the sample inefficiency and poor generalisation performance formation from imperfectly abstracted states, and design a of current methods [Kirk et al., 2021]. Amongst other ap- residual learning approach that is forced to learn the differ- proaches, including prior knowledge in the learning process ence between policies over adjacent abstraction levels. The of the agent promises to alleviate these obstacles and move re- properties of both approaches are first studied in a toy set- inforcement learning (RL) from a tabula rasa method to more ting where the effectiveness of class-based abstractions can human-like learning. Depending on the area of research, prior be controlled. We then show that if a commonsense game is knowledge representations can vary from pretrained embed- governed by class structure, the agent is more sample efficient dings or weights [Devlin et al., 2019] to symbolic knowledge and generalises better to unseen objects, outperforming em- representations such as logics [Vaezipoor et al., 2021] and bedding approaches and methods augmenting the state with knowledge graphs (KGs) [Zhang et al., 2020b]. While the subparts of ConceptNet. However, learning might be ham- former are easier to integrate into deep neural network-based pered, if the extracted class knowledge aggregates objects in- algorithms, they lack specificity, abstractness, robustness and correctly. To summarise, our key contributions are: interpretability [van Harmelen and ten Teije, 2019]. One type of prior knowledge that is hard to obtain for • we use the subclass relationship from open-source KGs to purely data-driven methods is commonsense knowledge. formulate a state abstraction for commonsense games; Equipping reinforcement learning agents with commonsense • we propose a residual learning approach that can be inte- ∗ Contact Author grated with policy gradient algorithms to leverage imper- 1 Code can be found at https://github.com/NikeHop/CSRL fect state abstractions;
• we show that in environments with class structure our animal building animal animal entity method leads to more sample efficient learning and better generalisation to unseen objects. vertebrate farm building vertebrate invertebrate animal building vertebrate invertibrate farm 2 Related Work mammal barn mammal arthropod building mammal anthropod barn We introduce the resources available to include commonsense knowledge and the attempts that have been made by prior ungulate barn ungulate spider ungulate spider work to leverage these resources. Since the class knowledge goat sheep is extracted from knowledge graphs, work on including KGs goat barn sheep spider in deep neural network based architectures is discussed. The setting considered here also offers a new perspective on state Figure 1: (Left) Visualisation of a state and its abstractions in Word- abstraction in reinforcement learning. craft based on classes extracted from WordNet. Colours indicate that an object is mapped to a class, where the same colour refers to the Commonsense KGs. KGs store facts in form of entity- same class within an abstraction. (Right) Subgraph of the class tree relation-entity triplets. Often a KG is constructed to cap- that is used to determine the state abstractions. ture either a general area of knowledge such as common- sense [Ilievski et al., 2021], or more domain specific knowl- edge like the medical domain [Huang et al., 2017]. While al., 2020b]. Most methods are based on an attention mech- ConceptNet [Speer et al., 2017] tries to represent all com- anism over parts of the knowledge base [Moon et al., 2019; monsense knowledge, others focus on specific parts such as Gou et al., 2021]. Two key differences are that the knowledge cause and effect relations [Sap et al., 2019]. Manually de- graphs used are curated for the task, and therefore contain lit- signed KGs [Miller, 1995] are less error prone, but provide tle noise. Additionally, most tasks are framed as supervised less coverage and are more costly to design, making hybrid learning problems where annotation about correct reasoning approaches popular [Speer et al., 2017]. Here, we focus on patterns are given. Here, we extract knowledge from open- WordNet [Miller, 1995], ConceptNet and DBpedia [Lehmann source knowledge graphs and have to deal with errors in the et al., 2015] and study how the quality of their class knowl- class structure due to problems with entity reconciliation and edge affects our method. Representing KGs in vector form incompleteness of knowledge. can be achieved via knowledge graph embedding techniques State abstraction in RL. State abstraction aims to partition [Nickel and Kiela, 2017], where embeddings can be trained the state space of a base Markov Decision Process (MPD) from scratch or word embeddings are finetuned [Speer and into abstract states to reduce the complexity of the state space Lowry-Duda, 2017]. Hyperbolic embeddings [Nickel and on which a policy is learnt [Li et al., 2006]. Different crite- Kiela, 2017] capture the hierarchical structure of WordNet ria for aggregating states have been proposed [Givan et al., given by the hypernym relation between two nouns and are 2003]. They guarantee that an optimal policy learnt for the investigated as an alternative method to include class prior abstract MDP remains optimal for the base MDP. To leverage knowledge. state abstraction an aggregation function has to be learned Commonsense Games. To study the problem of integrat- [Zhang et al., 2020a], which either needs additional samples ing commonsense knowledge into an RL agent, common- or is performed on-policy leading to a potential collapse of the sense games have recently been introduced [Jiang et al., aggregation function [Kemertas and Aumentado-Armstrong, 2020; Murugesan et al., 2021]. Prior methods have fo- 2021]. The case in which an approximate state abstraction is cused on leveraging knowledge graph embeddings [Jiang given as prior knowledge has not been looked at yet. The ab- et al., 2020] or augmenting the state representation via an straction given here must not satisfy any consistency criteria extracted subpart of ConceptNet [Murugesan et al., 2021]. and can consist of multiple abstraction levels. A method that While knowledge graph embeddings improve performance is capable of integrating useful knowledge from each abstrac- more than GloVe word embeddings [Pennington et al., 2014], tion level is needed. the knowledge graph they are based on is privileged game information. Extracting task-relevant knowledge from Con- 3 Problem Setting ceptNet automatically is challenging. If the knowledge is Reinforcement learning enables us to learn optimal behaviour manually specified, sample efficiency improves but heuris- in an MDP M = (S, A, R, T, γ) with state space S, action tic extraction rules hamper learning. No method that learns space A, reward function R : S × A → R, discount factor to extract useful knowledge exists yet. The class knowledge γ and transition function T : S × A → ∆S , where ∆S we use here is not tailored to the environment, and therefore represents the set of probability distributions over the space should hold across a range of environments. S. The goal is to learn from experience a policy π : S → ∆A Integrating KGs into deep learning architectures. The that optimises the objective: problem of integrating knowledge present in a knowledge "∞ # graph into a learning algorithm based on deep neural net- X works has mostly been studied by the natural language com- J(π) = Eπ γ t Rt = Eπ [G], (1) t=0 munity [Xie and Pu, 2021]. Use-cases include, but are not limited to, open-dialog [Moon et al., 2019], task-oriented di- where G is the discounted 0 return. A state abstraction func- alogue [Gou et al., 2021] and story completion [Zhang et tion φ : S → S aggregates states into abstract states with
the goal to reduce the complexity of the state space. Given an the game state and construct a class tree from open-source arbitrary weighting function w : S → [0, 1] s.t. ∀s0 ∈ S 0 , KGs. If the game state is not a set of entities but rather text, we use spaCy2 to extract all nouns as the set of objects. The P s∈φ−1 (s0 ) w(s) = 1, one can define an abstract reward function R0 and transition function T 0 on the abstract state class tree is extracted from either DBpedia, ConceptNet or space S 0 : WordNet. For the detailed algorithms of each KG extraction, X we refer to Appendix A. Here, we discuss some of the caveats R0 (s0 , a) = w(s)R(s, a) (2) that arise when extracting class trees from open-source KGs, s∈φ−1 (s0 ) and how to tackle them. Class tree can become imbalanced, X X i.e. the depths of the leaves, representing the objects, differs T 0 (s̄0 |s0 , a) = w(s)T (s̄|s, a), (3) (Figure 2). As each additional layer with a small number of s̄∈φ−1 (s̄0 ) s∈φ−1 (s0 ) classes adds computational overhead but provides little ab- straction, we collapse layers depending on their contribution to obtain an abstract MDP M 0 = (S 0 , A, R0 , T 0 , γ). If the ab- towards abstraction (Figure 2). While in DBpedia or Word- straction φ satisfies consistency criteria (see section 2, State Net the found entities are mapped to a unique superclass, enti- abstraction in RL) a policy learned over M 0 allows for opti- ties in ConceptNet are associated with multiple superclasses. mal behaviour in M , which we reference from now on as base To handle the case of multiple superclasses, each entity is MDP [Li et al., 2006]. Here, we assume that we are given ab- mapped to the set of all i-step superclasses for i = 1, ..., n. straction functions φ1 , ..., φn with φi : Si−1 → Si , where To obtain representations for these class sets, the embeddings S0 corresponds to the state space of the base MDP. Since φi of each element of the set are averaged. must not necessarily satisfy any consistency criteria, learning a policy over one of the corresponding abstract MDPs Mi can Learning policies over a hierachy of abstract states. result in a non-optimal policy. The goal is to learn a policy π Since prior methods in commonsense games are policy or action value function Q, that takes as input a hierarchy of gradient-based, we will focus on this class of algorithms, state abstractions s = (s1 , ..., sn ). Here, we want to make use while providing a similar analysis for value-based methods in of the more abstract states s2 , ...sn for more sample efficient Appendix B. First, we look at a naive method to learn a policy learning and better generalisation. in our setting, discuss its potential weaknesses and then pro- pose a novel gradient update to overcome these weaknesses. A simple approach to learning a policy π over s is to take 4 Methodology an ensemble approach by having a network with separate pa- The method can be separated into two components: (i) con- rameters for each abstraction level to predict logits, that are structing the abstraction functions φ1 , ..., φn from the sub- then summed up and converted via the softmax operator into class relationships in open source knowledge graphs; (ii) a final policy π. Let si,t denote the abstract state on the i-th learning a policy over the hierarchy of abstract states s = level at timestep t, then π is computed via: (s1 , ..., sn ) given the abstraction functions. n ! X Constructing the abstraction functions φi . A state in a π(at |st ) = Softmax NNθi (si,t ) , (6) commonsense game features real world entities and their re- i=1 lations, which can be modelled as a set, sequence or graph of where NNθi is a neural network processing the abstract state entities. The idea is to replace each entity with its superclass, si parameterised by θi . This policy can then be trained via so that states that contain objects with the same superclass are any policy gradient algorithm [Schulman et al., 2015; Mnih aggregated into the same abstract state. Let E be the vocab- et al., 2016; Espeholt et al., 2018]. From here on, we will ulary of symbols that can appear in any of the abstract states refer to this approach as sum-method. si , i.e. si = {e1 , ..., ek |el ∈ E}. The symbols that refer to There is no mechanism that forces the sum approach to real-world objects are denoted by O ⊆ E and Ctree repre- learn on the most abstract level possible, potentially leading sents their class tree. A class tree is a rooted tree in which to worse generalisation to unseen objects. At train time mak- the leaves are objects and the parent of each node is its su- ing all predictions solely based on the lowest level (ignoring perclass. The root is a generic entity class of which every all higher levels) can be a solution that maximises discounted object/class is a subclass (see Appendix A for an example). return, though it will not generalise well to unseen objects. To To help define φi , we introduce an entity based abstraction circumvent this problem, we adapt the policy gradient so that φEi : E → E. Let Ck represent objects/classes with depth k the parameters θi at each abstraction level are optimised to in Ctree and L be the depth of Ctree , then we can define φE i approximate an optimal policy for the i-th abstraction level and φi : given the computed logits on abstraction level i − 1. Let Pa(e), if e ∈ CL+1−i sni,t = (si,t , ..., sn,t ) denote the hierarchy of abstract states φEi (e) = (4) at timestep t down to the i-th level. Define the policy on the e, else, i-th level as n ! φi (s) = {φE i (e)|e ∈ s}, (5) X πi (a|sni,t ) = Softmax NNθk (sk,t ) . (7) where Pa(e) are the parents of entity e in the class tree Ctree k=i This abstraction process is visualised in Figure 1. In practice, 2 we need to be able to extract the set of relevant objects from https://spacy.io/
700 Number of objects/classes 600 Toy environment. We start with a rooted tree, where each 1 500 2 3 1 node is represented via a random embedding. The leaves 400 Collapsing 300 4 5 layer 4 & 5 2 3 of the tree represent the states of a base MDP. Each depth 200 6 7 4 5 6 7 8 level of the tree represents one abstraction level. Every in- 100 8 ner node of the tree represents an abstract state that aggre- 0 gates the states of its children. The optimal action for each 0.0 2.5 5.0 7.5 10.0 12.5 15.0 Abstraction level leaf (base state) is determined by first fixing an abstraction Figure 2: (Left) The number of different objects/classes that can ap- level l and randomly sampling one of five possible actions pear for each state abstraction level in the Wordcraft environment for each abstract state on that level. Then, the optimal action with the superclass relation extracted from Wordnet. (Right) Visual- for a leaf is given by the sampled action for its correspond- isation of collapsing two layers in a class tree. ing abstract state on level l. The time horizon is one step, i.e. for each episode a leaf state is sampled, and if the agent chooses the correct action, it receives a reward of one. We and notice that π = π1 . To obtain a policy gradient expres- test on new leaves with unseen random embeddings, but keep sion that contains the abstract policies πi , we write π as a the same abstraction hierarchy and the same pattern of opti- product of abstract policies: mal actions. The sum and residual approach are compared ! n−1 Y πi (a|sni,t ) to a policy trained only given the base states and a policy π(a|st ) = πn (a|sn,t ). (8) given only the optimal abstract states (oracle). To study the π (a|sni+1,t ) i=1 i+1 effect of noise in the abstraction, we replace the optimal ac- and plug it into the policy gradient expression for an episodic tion as determined by the optimal state with noise probability task with discounted return G: σ (noise setting) or ensure that the abstraction step only ag- n " T ! # gregates a single state (ambiguity setting). All policies are X X πi (a|sni,t ) trained via REINFORCE [Williams, 1992] with a value func- ∇θ J(θ) = Eπ ∇θ log G , tion as baseline and entropy regularisation. More details on i=1 t=1 πi+1 (a|sni+1,t ) the chosen trees and the policy network architecture can be (9) found in Appendix C. where πn+1 ≡ 1. Notice that in Equation 9, the gradient of the parameters θi depend on the values of all policies of the Textworld Commonsense. In text-based games, the state level equal or lower than i. The idea is to take the gradient for and action space are given as text. In Textworld Common- each abstraction level i only with respect to θi and not the full sense (TWC), an agent is located in a household and has to set of parameters θ. This optimises the parameters θi not with put objects in their correct commonsense location, i.e. the respect to their effect on the overall policy, but their effect on location one would expect these objects to be in a normal the abstract policy on level i. The residual policy gradient is household. The agent receives a reward of one for each ob- given by: ject that is put into the correct location. The agent is evaluated Xn " T X # by the achieved normalised reward and the number of steps ∇θ Jres (θ) = Eπ ∇θi log(πi (a|sni,t ))G . (10) needed to solve the environment, where 50 steps is the max- i=1 t=1 imum number of steps allowed. To make abstraction neces- sary, we use a large number of objects per class and increase Each element of the first sum resembles the policy gradient the number of games the agent sees during training from 5 loss of a policy over the abstract state si . However, the sam- to 90. This raises the exposure to different objects at train- pled trajectories are from the overall policy π and not the pol- ing time. The agent is evaluated on a validation set where icy πi and the policy πi inherits a bias from previous layers it encounters previously seen objects and a test set where it in form of logit-levels. We refer to the method based on the does not. The difficulty level of a game is determined by update in Equation 10 as residual approach. An advantage the number of rooms, the number of objects to move and the of the residual and sum approach is that the computation of number of distractors (objects already in the correct location), the logits from each layer can be done in parallel. Any se- which are here two, two and three respectively. To study the quential processing of levels would have a prohibitively large effect of inaccuracies in the extratcted class trees from Word- computational overhead. Net, ConceptNet and DBpedia, we compare them to a man- ual aggregation of objects based on their reward and transi- 5 Experimental Evaluation tion behaviour. Murugesan et al. [2021] benchmark differ- Our methodology is based on the assumption that abstrac- ent architectures that have been proposed to solve text-based tion via class knowledge extracted from open-source KGs is games [He et al., 2016]. Here, we focus on their proposed useful in commonsense game environments. This must not method that makes use of a recurrent neural network architec- necessarily be true. It is first sensible to study the workings ture and numberbatch embeddings [Speer and Lowry-Duda, of our methodology in an idealised setting, where we con- 2017], which performed best in terms of normalised reward trol whether and how abstraction is useful for generalisation and number of steps needed. For more details on the archi- and sample efficiency. Then, we evaluate the method on two tecture and the learning algorithm, we refer to the initial pa- commonsense games, namely a variant of Textworld Com- per. As baselines, we choose: (i) adding class information monsense [Murugesan et al., 2021] and Wordcraft. via hyperbolic embeddings trained on WordNet [Nickel and
1.0 1.0 without abstraction; on par with the sample efficiency of the 0.8 0.8 oracle. The same holds for the generalisation performance. Reward Train Reward Test 0.6 0.6 As the base method faces unseen random embeddings at test 0.4 residual sum 0.4 time, there is no possibility for generalisation. In the noise 0.2 base oracle 0.2 setting (Figure 3 middle), the policy trained on the abstract 0 10 20 30 40 50 0 10 20 30 40 50 state reaches its performance limit at the percentage of states Number of episodes (in 100) Number of episodes (in 100) whose action is determined via their abstract state; the base, 1.0 1.0 residual and sum method reach instead optimal performance. The achieved reward at generalisation time decreases for both 0.8 0.8 Reward Train Reward Test the sum and residual method, but the sum approach suffers 0.6 0.6 residual more from overfitting to the noise at training time when com- 0.4 sum 0.4 pared to the residual approach. In the case of ambiguous ab- base 0.2 oracle 0.2 straction, we see that the residual approach outperforms the 0 20 40 60 80 100 0 20 40 60 80 100 sum approach. This can be explained by the residual gradient Number of episodes (in 100) Number of episodes (in 100) update, which forces the residual method to learn everything 1.0 1.0 on the most abstract level, while the sum approach distributes 0.8 0.8 the contribution to the final policy over the abstraction layers. Reward Train Reward Test 0.6 0.6 At test time, the sum approach puts too much emphasis on residual 0.4 sum 0.4 the uninformative base state causing the policy to take wrong base 0.2 oracle 0.2 actions. 0 5 10 15 0 5 10 15 Number of episodes (in 100) Number of episodes (in 100) 6.2 Textworld Commonsense The results suggest that both the sum and residual approach Figure 3: Training and generalisation results in the toy environment. are able to leverage the abstract states. It remains an open The abstract state perfectly determines the action to take in the base question whether this transfers to more complex environ- state (top). Noise is added to this relation, so that in 50% of the time the optimal action is determined randomly beforehand (middle). In ments with a longer time-horizon, where the abstraction is the ambiguous setting, not every abstract state has multiple substates derived by extracting knowledge from an open-source KG. (bottom). Experiments are run over five seeds. 1.0 WordNetresidual Kiela, 2017] by concatenating them to numberbatch embed- WordNetsum Normalized Reward 0.8 base dings of objects; (ii) extracting the LocatedAt relation from ConceptNet for game objects, encoding it via graph attention 0.6 and combining it with the textual embedding [Murugesan et 0.4 al., 2021]. Wordcraft. An agent is given a goal entity and ingredient 0.2 entities, and has to combine the ingredient entities to obtain 0 20 40 60 80 the goal entity. A set of recipes determines which combina- Number of episodes tion of input objects leads to which output entity. One can create different settings depending on the number of distrac- Figure 4: Training performance of the baseline (base), the sum and tors (irrelevant entities in the set of recipe entities) and the residual abstraction with knowledge extracted from WordNet in the set of recipes available at training time. Generalisation to un- difficult environment. Experiments are run over ten seeds. seen recipes guarantees that during training not all recipes are available, and generalisation to unseen goals ensures that at test time the goal entity has not been encountered before Sample efficiency. From Figure 4, one can see that the as a goal entity. Jiang et al. [2020] trained a policy using residual and sum approach learn with less samples com- the IMPALA algorithm [Espeholt et al., 2018]. We retain the pared to the base approach. The peak mean difference is training algorithm and the multi-head attention architecture 20 episodes from a total of 100 training episodes. While [Vaswani et al., 2017] used for the policy network to train our the sum approach learns towards the beginning of training policy components NNθ1 , ..., NNθn . as efficiently, it collapses for some random seeds towards the end of training, resulting in higher volatility of the achieved 6 Results normalised reward. This is similar to what was previously observed in experiments in the noisy setting of the ideal en- After checking whether the results in the ideal environment vironment. That the performance of the sum approach col- are as expected, we discuss results in TWC and Wordcraft. lapses for certain seeds suggests, that the learned behaviour for more abstract states depends on random initialisation of 6.1 Toy Environment weights. This stability issue is not present for the residual ap- From Figure 3 top, one can see that the sum and residual proach. Figure 5 (right) shows that adding hyperbolic embed- approach are both more sample efficient than the baseline dings does not improve sample efficiency and even decreases
Reward Steps Method 0.6 No abstraction random Valid. Test Valid. Test No abstraction GloVe residual random Base 0.91 (0.04) 0.85 (0.05) 25.98 (3.14) 29.36 (3.42) 0.5 residual GloVe Average Reward sum random Base-H 0.83 (0.06) 0.75 (0.14) 30.59 (5.39) 33.92 (6.08) 0.4 sum GloVe Base-L 0.90 (0.04) 0.86 (0.06) 24.83 (2.31) 28.71 (3.46) 0.3 M-R 0.96 (0.03) 0.96 (0.02) 21.26 (2.65) 20.00 (1.57) 0.2 M-S 0.97 (0.02) 0.96 (0.02) 20.95 (1.38) 19.55 (1.69) W-R 0.93 (0.02) 0.94 (0.02) 23.25 (1.46) 24.04 (1.75) 0.1 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 W-S 0.88 (0.10) 0.87 (0.17) 25.69 (4.78) 26.21 (6.81) Environment Steps 1e6 Table 1: Generalisation results for the easy and difficult level w.r.t Figure 6: Generalisation results for the Wordcraft environment with the validation set and the test set, in terms of mean number of steps respect to unseen goal entities. Experiments are run over ten seeds. taken (standard deviation in brackets) for each type of knowledge graph (M=manual, W=WordNet) and each approach (R=residual, S=Sum). Base refers to the baseline with no abstraction, Base-H ple inefficiency. The abstractions derived from DBpedia and refers to baseline with hyperbolic embeddings and Base-L for base- ConceptNet hamper learning particularly for the residual ap- line added LocatedAt relations. In bold, the best performing method without manually specified knowledge. We highlight in red the con- proach (Figure 5). By investigating the DBpedia abstraction, ditions in which the manual class graph outperforms the other meth- we notice that many entities are not resolved correctly. There- ods. Experiments are run over ten seeds. fore objects with completely different semantics get aggre- gated. The poor performance from the ConceptNet abstrac- 1.0 1.0 tion hints at a problem with averaging class embeddings over 0.8 0.8 multiple superclasses. Although two objects may have an Normalized Reward Normalized Reward overlap in their set of superclasses, the resulting embeddings 0.6 0.6 could still differ heavily due to the non-overlapping classes. 0.4 WordNet 0.4 WordNetresidual manual base 0.2 ConceptNet 0.2 hyperbolic DBpedia LocatedAt 6.3 Wordcraft 0 20 40 60 80 0 20 40 60 80 Number of episodes Number of episodes Figure 6 shows the generalisation performance in the Word- Figure 5: Training performance in the difficult environment using craft environment. For the case of random embeddings, ab- the residual approach when the class knowledge comes from differ- straction can help to improve generalisation results. Replac- ent knowledge graphs trained (left) and when the baseline is aug- ing random embeddings with GloVe embeddings improves mented with hyperbolic embeddings or the LocatedAt relation from generalisation beyond the class abstraction, and combining ConceptNet (right). Experiments are run over ten seeds. class abstraction with GloVe embeddings does not result in any additional benefit. Looking at the recipe structure in Wordcraft, objects are combined based on their semantic sim- performance towards the end of training. Adding the Locate- ilarity, which can be better captured via word embeddings dAt relations improves sample efficiency to a similar degree rather than through classes. Although no generalisation gains as adding class abstractions. can be identified, adding abstraction in the presence of GloVe Generalisation. Table 1 shows that, without any additional embeddings leads to improved sample efficiency. A more knowledge the performance of the baseline, measured via detailed analysis discussing the difference in generalisation mean normalised reward and mean number of steps, drops between class abstraction and pretrained embeddings in the when faced with unseen objects at test time. Adding hyper- Wordcraft environment can be found in Appendix D. bolic embeddings does not alleviate that problem but rather hampers generalisation performance on the validation and test set. When the LocatedAt relation from ConceptNet is 7 Conclusion added to the state representation the drop in performance is We show how class knowledge, extracted from open-source reduced marginally. Given a manually defined class abstrac- KGs, can be leveraged to learn behaviour for classes instead tion, generalisation to unseen objects is possible without any of individual objects in commonsense games. To force the drop in performance for the residual and the sum approach. RL agent to make use of class knowledge even in the pres- This remains true for the residual approach when the man- ence of noise, we propose a novel residual policy gradient up- ually defined class knowledge is exchanged with the Word- date based on an ensemble learning approach. If the extracted Net class knowledge. The sum approach based on WordNet class structure approximates relevant classes in the environ- performs worse on the validation and training set due to the ment, the sample efficiency and generalisation performance collapse of performance at training time. to unseen objects are improved. Future work could look at KG ablation. From Figure 5 and Table 1 it is evident that other settings where imperfect prior knowledge is given and for the residual approach, the noise and additional abstrac- whether this can be leveraged for learning. Finally, KGs con- tion layers introduced by moving from the manually defined tain more semantic information than classes, which future classes to the classes given by WordNet, only cause a small work could try to leverage for an RL agent in commonsense drop in generalisation performance and no additional sam- games.
Acknowledgements [Mnih et al., 2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, and J. Veness et al. Human-level control through deep This research was (partially) funded by the Hybrid In- reinforcement learning. Nat., 2015. telligence Center, a 10-year programme funded by the [Mnih et al., 2016] V. Mnih, A. P. Badia, M. Mirza, A. Graves, and Dutch Ministry of Education, Culture and Science through T. P. Lillicrap et al. Asynchronous methods for deep reinforce- the Netherlands Organisation for Scientific Research, ment learning. In ICML, 2016. https://hybrid-intelligence-centre.nl. [Moon et al., 2019] S. Moon, P. Shah, A. Kumar, and R. Subba. Opendialkg: Explainable conversational reasoning with References attention-based walks over knowledge graphs. In ACL, 2019. [Akata et al., 2020] Z. Akata, D. Balliet, M. d. Rijke, F. Dignum, [Murugesan et al., 2021] K. Murugesan, M. Atzeni, P. Kapanipathi, and V. Dignum et al. A research agenda for hybrid intelli- P. Shukla, and S. Kumaravel et al. Text-based RL agents with gence: Augmenting human intellect with collaborative, adaptive, commonsense knowledge: New challenges, environments and responsible, and explainable artificial intelligence. 2020. baselines. In AAAI, 2021. [Côté et al., 2018] M. Côté, A. Kádár, X. Yuan, B. Kybartas, and [Nickel and Kiela, 2017] M. Nickel and D. Kiela. Poincaré embed- T. Barnes et al. Textworld: A learning environment for text-based dings for learning hierarchical representations. NeurIPS, 2017. games. In Workshop on Computer Games, 2018. [Pennington et al., 2014] J. Pennington, R. Socher, and C. D Man- [Devlin et al., 2019] J. Devlin, M. Chang, K. Lee, and ning. Glove: Global vectors for word representation. In EMNLP, K. Toutanova. BERT: pre-training of deep bidirectional 2014. transformers for language understanding. In NAACL-HLT, 2019. [Sap et al., 2019] M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula, [Espeholt et al., 2018] L. Espeholt, H. Soyer, R. Munos, K. Si- and N. Lourie et al. ATOMIC: an atlas of machine commonsense monyan, and V. Mnih et al. Impala: Scalable distributed deep-rl for if-then reasoning. In AAAI, 2019. with importance weighted actor-learner architectures. In ICML, [Schulman et al., 2015] J. Schulman, S. Levine, P. Abbeel, M. I. 2018. Jordan, and P. Moritz. Trust region policy optimization. In ICML, [Givan et al., 2003] R. Givan, T. Dean, and M. Greig. Equivalence 2015. notions and model minimization in markov decision processes. [Silver et al., 2016] D. Silver, A. Huang, C. J. Maddison, A. Guez, Artificial Intelligence, 2003. and L. Sifre et al. Mastering the game of go with deep neural [Gou et al., 2021] Y. Gou, Y. Lei, L. Liu, Y. Dai, and C. Shen. Con- networks and tree search. Nat., 2016. textualize knowledge bases with transformer for end-to-end task- [Speer and Lowry-Duda, 2017] R. Speer and J. Lowry-Duda. Con- oriented dialogue systems. In EMNLP, 2021. ceptnet at semeval-2017 task 2: Extending word embeddings [He et al., 2016] J. He, J. Chen, X. He, J. Gao, and L. Li et al. Deep with multilingual relational knowledge. In SemEval@ACL, 2017. reinforcement learning with a natural language action space. In [Speer et al., 2017] R. Speer, J. Chin, and C. Havasi. Conceptnet ACL, 2016. 5.5: An open multilingual graph of general knowledge. In AAAI, [Huang et al., 2017] Z. Huang, J. Yang, F. v. Harmelen, and Q. Hu. 2017. Constructing knowledge graphs of depression. In International [Vaezipoor et al., 2021] P. Vaezipoor, A. C. Li, R. T. Icarte, and Conference on Health Information Science. Springer, 2017. S. A. McIlraith. Ltl2action: Generalizing LTL instructions for [Ilievski et al., 2021] F. Ilievski, P. A. Szekely, and B. Zhang. multi-task RL. In ICML, 2021. CSKG: the commonsense knowledge graph. In ESWC, 2021. [van Harmelen and ten Teije, 2019] F. van Harmelen and A. ten [Jiang et al., 2020] M. Jiang, J. Luketina, N. Nardelli, P. Minervini, Teije. A boxology of design patterns for hybrid learning and rea- and P. HS Torr et al. Wordcraft: An environment for bench- soning systems. In BNAIC, 2019. marking commonsense agents. arXiv preprint arXiv:2007.09185, [Vaswani et al., 2017] A. Vaswani, N. Shazeer, N. Parmar, 2020. J. Uszkoreit, and L. Jones et al. Attention is all you need. In [Kemertas and Aumentado-Armstrong, 2021] M. Kemertas and NeurIPS, 2017. T. Aumentado-Armstrong. Towards robust bisimulation metric [Watkins and Dayan, 1992] C. JCH. Watkins and P. Dayan. Q- learning. CoRR, 2021. learning. Machine learning, 1992. [Kirk et al., 2021] R. Kirk, A. Zhang, E. Grefenstette, and [Williams, 1992] R. J Williams. Simple statistical gradient- T. Rocktäschel. A survey of generalisation in deep reinforcement following algorithms for connectionist reinforcement learning. learning. CoRR, abs/2111.09794, 2021. Machine learning, 1992. [Lehmann et al., 2015] J. Lehmann, R. Isele, M. Jakob, [Xie and Pu, 2021] Y. Xie and P. Pu. How commonsense knowl- A. Jentzsch, and D. Kontokostas et al. Dbpedia–a large- edge helps with natural language tasks: A survey of recent re- scale, multilingual knowledge base extracted from wikipedia. sources and methodologies. CoRR, 2021. Semantic web, 2015. [Yee, 2019] E. Yee. Abstraction and concepts: when, how, where, [Li et al., 2006] L. Li, T. J Walsh, and M. L Littman. Towards a what and why?, 2019. unified theory of state abstraction for MDPs. ISAIM, 4:5, 2006. [Zhang et al., 2020a] A. Zhang, C. Lyle, S. Sodhani, A. Filos, and [Lillicrap et al., 2016] T. P. Lillicrap, J. J. Hunt, A. Pritzel, M. Kwiatkowska et al. Invariant causal prediction for block- N. Heess, T. Erez, and et al. Continuous control with deep re- MDPs. In ICML, 2020. inforcement learning. In ICLR, 2016. [Zhang et al., 2020b] M. Zhang, K. Ye, R. Hwa, and A. Kovashka. [Miller, 1995] G. A Miller. Wordnet: a lexical database for english. Story completion with explicit modeling of commonsense knowl- Communications of the ACM, 1995. edge. In CVPR, 2020.
A Extracting class trees from commonsense the rdf:type relation recursively until the URI owl:Thing is KGs reached. This gives us again a chain of classes connected via the superclass relation for each object, which can be merged To construct the abstraction function φi presented in Section 4 into a class tree as described in Section A.1. Any entity that from open-source knowledge graphs, we need a list of objects can not be mapped to a URI in DBpedia is directly mapped that can appear in the game. We collect trajectories via ran- to the owl:Thing node of the class tree. dom play and extract the objects out of each state appearing in these trajectories. The extraction method depends on the rep- A.3 ConceptNet resentation of the state. In Wordcraft [Jiang et al., 2020] (see We use ConceptNet’s API6 to query the IsA relation for a list Section 5), the state is a set of objects, such that no extraction of objects. In ConceptNet, it can happen that there are mul- is needed. In Textworld [Côté et al., 2018] the state consists tiple nodes with different descriptions for the same semantic of text. We use spaCy3 for POS-tagging and take all nouns to concept connected to an object via the IsA relation. For exam- be relevant objects in the game. Next, we look at how to ex- ple, the object airplane is connected among other things via tract class knowledge from WordNet [Miller, 1995], DBpedia the IsA relation to heavier-than-air craft, aircraft and fixed- [Lehmann et al., 2015] and ConceptNet [Speer et al., 2017] wing aircraft. This causes two problems: (i) sampling only given a set of relevant objects. An overview of the charac- one superclass could lead to a class tree where objects of the teristics of the extracted class trees for each knowledge graph same superclass are not aggregated due to minor differences and each environment is presented in Table 2. in the class description; (ii) recursively querying the super- class relation until a general entity class is reached leads to Env. KG Miss. Ent. #Layers a prohibitively large class tree, and is not guaranteed to ter- WordNet 2/148 9 minate due to possible cycles in the IsA relation. To tackle TWC DBpedia 2/148 6 the first problem, we maintain a set of superclasses for each ConceptNet 6/148 2 object per abstraction level, instead of a single superclass per Wordcraft WordNet 13/700 16 level. The representative embedding of the superclass of an object is then calculated by averaging the embeddings of the Table 2: Characteristics of the extracted class trees for each KG in the difficult TWC games and Wordcraft. Missing entities (Miss. elements in the set of superclasses. To tackle the second prob- Ent.) represents the number of entities that could not be matched lem, we restrict the superclasses to the two step neighbour- to an entity in the KG, relative to the total number of game entities. hood of the IsA relation, i.e. we have two abstraction levels. The number of layers (#Layers) does not include the base level. Further simplifications are that only IsA relations are consid- ered with a weight above or equal to one7 and superclasses with a description length of less than or equal to three words. A.1 WordNet We use WordNet’s hypernym relation between two nouns as B Residual Q-learning the superclass relation. To query WordNet’s synsets and the First, the adaption of Q-learning to the setting described in hypernym relation between them, we use NLTK4 . For each section 3 is explained. Then results of the algorithm in the object, we query recursively the superclass relation until the toy environment from section 5 are presented. entity synset is reached. Having obtained this chain of super- classes for each object, we merge it into a class tree by aggre- B.1 Algorithm gating superclasses appearing in multiple superclass chains In Section 4 we focused on how to adapt policy gradient of different objects into one node. An example of the re- methods to learn a policy over a hierarchy of abstract states sulting class tree can be found in Figure 7. One problem s = (s1 , ..., sn ). Here we provide a similar adaption of Q- that arises is entity resolution, i.e. picking the semantically learning [Watkins and Dayan, 1992] and present results from correct synset for a given word in case of multiple available the toy environment. The sum approach naturally extends semantics. Here, we simply take the first synset/hypernym re- to Q-learning but instead of summing over logits from each turned. More complex resolution methods are possible. Ob- layer, we sum over action values. Let st = (s1,t , ..., sn,t ) be jects that have no corresponding synsets in WordNet have the the hierarchy of abstract states at time t, then the action value noun ”entity” as a parent in the class tree. is given by: A.2 DBpedia n X 5 We use the SPARQL language to first query all the URI’s Q(st , a) = N Nθi (si , a), (11) i=1 that have the objects name as a label (using the rdfs:label property). If a URI that is part of DBpedia’s ontology can where N Nθi is a neural network parameterized by θi . The be found, we query the rdfs:subClassOf relation recursively idea of the residual Q-learning method is similar to the resid- until the URI owl:Thing is reached. If the URI correspond- ual policy-gradient, i.e. we want to learn an action value func- ing to the object name is not part of the ontology, we query tion on the most abstract level and then adapt this on lower 3 6 https://spacy.io/ https://conceptnet.io/ 4 7 https://www.nltk.org/ The weight of a relation in ConceptNet is a measure of confi- 5 https://www.w3.org/TR/rdf-sparql-query/ dence of whether the relation is correct.
pan pot kettle spatula counter desk cooking_utensil.n.01 grater whisk table.n.02 chair kitchen_utensil.n.01 knife button candle computer bedroom cupboard corn spinach furniture.n.01 implement.n.01 device.n.01 vase phone tv area.n.05 wall trousers scarf jeans herb.n.01 tuna instrumentality.n.03 structure.n.01 teddy book garment.n.01 fruit cinnamon vascular_plant.n.01 salmon west corner milk chocolate mind kind artifact.n.01 natural_object.n.01 organism.n.01 region.n.03 place beverage.n.01 syrup salad cognition.n.01 deal whole.n.02 location.n.01 chest head peeler hanger glass food.n.01 psychological_feature.n.01 things board one nightmare object.n.01 body_part.n.01 person.n.01 dining matter.n.03 abstraction.n.06 physical_entity.n.01 entity.n.01 Figure 7: Visualisation of the class tree constructed from a subset of entities appearing in the Textworld Commonsense environment. The subclass relations are extracted from WordNet. levels as more information on the state becomes available. 1.0 1.0 Let st = (s1,t , ..., sn,t ) be a hierarchy of abstract states at 0.8 0.8 Reward Train Reward Test time t and Ri the reward function of the abstract MDP on the 0.6 0.6 i-th level as defined in Section 3. Further define Rn+1 ≡ 0. 0.4 residual sum 0.4 Then one can write the action value function for an episodic base oracle 0.2 0.2 task with discount factor γ as: 0 10 20 30 40 50 0 10 20 30 40 50 Number of episodes (in 100) Number of episodes (in 100) T X Q(s, a) = E[ γ t R1,t |S = s, A = a] 1.0 1.0 0.8 0.8 Reward Train t=0 Reward Test T n 0.6 0.6 residual X X = E[ γt (Ri,t − Ri+1,t ) |S = s, A = a] 0.4 sum 0.4 base t=0 i=1 0.2 oracle 0.2 n 0 20 40 60 80 100 0 20 40 60 80 100 X Number of episodes (in 100) Number of episodes (in 100) = Qres i (si,t , si+1,t , a), i=1 1.0 1.0 0.8 0.8 Qres Reward Train Reward Test where i is defined as: 0.6 0.6 T residual X 0.4 sum 0.4 Qres i (s , s i,t i+1,t , a) = E[ γ t (Ri,t − Ri+1,t ) |S = s, A = a]. 0.2 base oracle 0.2 t=0 0 5 10 15 0 5 10 15 Number of episodes (in 100) Number of episodes (in 100) The dependency of the reward function Ri,t on the state si,t and action a is omitted for brevity. At each layer of the ab- Figure 8: Training and generalisation results in the ideal environ- straction the residual action value function Qres i learns how ment. The abstract state perfectly determines the action to take in to adapt the action values of the previous layer to approximate the base state (top). Noise is added to this relation, such that in 40% the discounted return for the now more detailed state-action of the time the optimal action is determined randomly beforehand pair. This information difference is characterised by the dif- (middle). In the ambiguous setting not every abstract state has mul- ference in the reward functions of both layers Ri − Ri+1 . tiple substates (bottom). Experiments are run over five seeds. To learn Qres i we use an adapted version of the Q-learning update rule. Given a transition (st , a, rt , st+1 ), where s rep- resents the full hierarchy over abstract states, the update is i.e. rt is a sample of Ri and Ri+1 . As a result for every given by: transition the difference would be 0. To be able to apply the learning rule, one solution would be to maintain averages of Qres i = (1 − α)Qres i + α(Ri − Ri+1 + γ max Qres i ) the reward functions for each state. This is infeasible in high- a∈A dimensional state spaces. Instead, we approximate Ri+1 via This can be seen as performing Q-learning over the following the difference between the current and next state action value residual MDP Mires = (Si × Si+1 , A, Rires , Tires , γ) with of the previous abstraction level, i.e. Rires = Ri (si,t , a) − Ri+1 (si+1,t , a) and Ri+1 = Qi+1 (si+1,t , a) − γQi+1 (si+1,t+1 , a). (12) Tires = Ti (si,t+1 |si,t , a) · Ti+1 (si+1,t+1 |si+1,t , a). The adaptions made by Mnih et al. [2015] to leverage deep We cannot apply the update rule directly, since each transi- neural networks for Q-learning can be used to scale residual tion only contains a single sample for both reward functions, Q-learning to high-dimensional state spaces.
B.2 Results in the Toy Environment - The results resemble the results already discussed for the pol- icy gradient case (cfr. Section 6.1). Without any noise or am- biguity in the state abstraction the residual and sum approach 3 2 learn as efficiently with respect to the number of samples and generalise as well as if the perfect abstraction is given (Fig- ure 8 top). The sum approach struggles to learn the action 3 2 2 values on the correct abstraction level if the abstraction is am- biguous, hampering its generalisation performance (Figure 8 bottom). Both, the sum and residual approach, generalise 3 3 2 2 2 2 2 worse in the presence of noise (Figure 8 middle). However, Agent the achieved reward at test time is less influenced by noise for the residual approach. We did not evaluate the residual Q-learning approach on the more complex environments as - policy gradient methods have shown better performance in previous work [Murugesan et al., 2021] 3 2 C Toy environment Here, a visualisation of the toy environment, details on the tree representing the environment and the used network ar- 3 2 2 chitecture for the policy are given. C.1 Environment Details 3 3 2 3 2 4 2 As explained in Section 5 the environment is determined by a tree where each node represents a state of the base MDP Agent (leaves) or an abstract state (inner nodes). In a noise free set- ting the optimal action for a base state is determined by a cor- responding abstract state, while with noise the optimal action - is sampled at random with probability σ or else determined by the corresponding abstract state. A visualisation of the environment and what changes for 3 2 each condition can be seen in Figure 9 and is summarised in Table 3. Setting Branching factor σ Abstraction ambiguity 3 2 2 Basic [7,10,8,8] 0 No Noise [7,10,8,8] 0.5 No Ambig. [7,10,8,1] 0 Yes 3 2 2 Agent Table 3: Properties of the environment tree for the three settings: basic, noise and ambiguous with noise probability σ. Figure 9: Simplified version of the tree used to define the toy en- vironment. In the base method the agent can only observe the leaf C.2 Network architecture and Hyperparameter states. The optimal action for each state is displayed by the number in each node. At top the environment in the basic setting is dis- The policy is trained via REINFORCE [Williams, 1992] with played, in the middle with noise and at the bottom with ambiguous a value function as baseline and entropy regularisation. The abstraction. value function and policy are parameterised by two separate networks and the entropy regularisation parameter is β. For Hyperparameter Value the baseline and the optimal abstraction method the value net- number of episodes 5000 work is a one layer MLP of size 128 and the policy is a three- evaluation frequency 100 layer MLP of size (256, 128, 64). In case abstraction is avail- number of evaluation samples 20 able the value network is kept the same but the policy network batch size 32 is duplicated (with a separate set of weights) for each abstrac- state embedding size 300 tion level to parameterise the N Nθi (Section 4). learning rate 0.0001 β 1 D Wordcraft environment D.1 Network architecture and Hyperparameter Table 4: Hyperparameters for training in the toy environment. We build upon the network architecture and training algo- rithm used by Jiang et al. [2020]. Here we only present the
changes necessary to compute the abstract policies πi defined in Section 4 and the chosen hyperparameters. 0.6 To process a hierarchy of states st = (st,1 , ..., st,n ) one Average Reward can simply add a dimension on the state tensor representing 0.4 No abstraction random the abstraction level. Classical network layers such as multi- No abstraction GloVe 0.2 residual random layer perceptron (MLP) can be adapted to process such a state residual GloVe sum random tensor by adding a weight dimension representing the weights sum GloVe for each abstraction level and using batched matrix multipli- 0.0 0.0 0.5 1.0 1.5 2.0 2.5 cation. We tuned the learning rate and entropy regularisation Environment Steps 1e6 parameter (β) by performing a grid search over possible pa- rameter values. We decided to collapse layers based on the Figure 10: Generalisation results for the Wordcraft environment additional abstraction each layer contributed (Section 4). For with respect to unseen recipes. Experiments are run over 5 seeds. the sum and residual approach a learning rate of 0.0001 and β = 0.1 were used across experiments and layer 9-15 of the 0.6 class hierarchy were collapsed. For the baseline a learning rate of 0.001 and the β = 0.1 were used. Only in the unseen 0.5 Average Reward goal condition the baseline with Glove embedding had to be 0.4 trained with a learning rate of 0.0001. 0.3 No abstraction GloVe D.2 Additional Results Wordnet residual random 0.2 Wordnet sum random In the Wordcraft environment the results with respect to gen- 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 eralisation to unseen recipes (Figure 10) show similar pat- Environment Steps 1e6 terns to the generalisation results presented in Section 6.3. With random embeddings adding class abstraction improves Figure 11: Generalisation results for the Wordcraft environment generalisation, but not as much if random embeddings are re- with respect to unseen goal entities trained based on 10% of all avail- placed with Glove embeddings [Pennington et al., 2014]. The able recipes instead of 80%. Experiments are run over 5 seeds. combination of class abstraction with Glove embeddings does not lead to additional improvements. The objects that need tional overhead. If the method is tested in an environment to be combined in Wordcraft rarely follow any class rules, where sampling is not the computational bottleneck it re- i.e. objects are not combined based on their class. Word em- mains an open question whether the improvement in sample beddings capture more fine-grained semantics rather than just efficiency also translates into higher reward after a fixed time class knowledge [Pennington et al., 2014], therefore it is pos- budget. To test this we measured the averaged time needed sible that class prior knowledge is less prone to overfitting. In for one timestep over multiple seeds and fixed hardware and a low data setting it is possible that word embeddings contain plotted the reward after time passed for the baseline and for spurious correlations which a policy can leverage to perform the residual method with abstract states based on WordNet. well on during training but do not generalize well. All experiments were run on an a workstation with an Intel To test this hypothesis we created a low data setting in Xeon W2255 processor and a GeForce RTX 3080 graphics which training was performed when only 10% of all possible card. From Figure 12 we can see that the baseline as well as goal entities were part of the training set instead of the usual the extension via abstraction achieve the same reward after a 80%. In Figure 11 one can see that all methods have a drop fixed time. Even if computation time is an issue the residual in performance (cfr. Section 6.3), but the drop is substanially approach takes no more time than the baseline and its gener- smaller for the methods with abstraction and random embed- alisation performance in terms of normalised reward is better. dings compared to the baseline with Glove embeddings. The generalisation performance reverses, i.e. the baseline with Glove embeddings now performs the worst. To understand whether and how the learned policy makes use of the class hierarchy we visualise a policy trained with 1.0 WordNetresidual the residual approach, random embeddings and access to the 0.8 base Normalized Reward the full class hierarchy extracted from WordNet. For the vi- sualisation we choose recipes that follow a class rule. In 0.6 Wordcraft one can create an airplane by combining a bird and 0.4 something that is close to a machine. From Figure 13 one can see that the policy learns to pick the correct action on level 0.2 nine by recognising that a vertebrate and machine need to be combined. 0 2000 4000 6000 8000 Time in seconds E Runtime Analysis Often an improved sample efficiency is achieved via an ex- Figure 12: Normalised reward after time in seconds for the baseline tension of an existing baseline, which introduces a computa- and for the residual approach if abstraction is added from WordNet.
AbstractionLevel 16 AbstractionLevel 15 AbstractionLevel 14 State: 1.00 State: 1.00 State: 1.00 entity entity entity 0.75 physical_entity physical_entity abstraction 0.75 causal_agent object psychological_feature 0.75 entity entity entity physical_entity physical_entity physical_entity causal_agent causal_agent object entity entity entity 0.50 physical_entity physical_entity physical_entity 0.50 object object object 0.50 entity physical_entity matter Selection: 1 0.25 Selection: 1 0.25 Selection: 1 0.25 Goal: entity Goal: physical_entity Goal: object 0.00 0.00 0.00 AbstractionLevel 13 AbstractionLevel 12 AbstractionLevel 11 State: 1.00 State: 1.00 State: 1.00 person whole cognition 0.75 head_of_state artifact content 0.75 tyrannosaurus rex structure belief 0.75 person person whole expert worker artifact wizard skilled_worker instrumentality whole whole whole 0.50 artifact living_thing artifact 0.50 instrumentality organism instrumentality 0.50 solid glass glass Selection: 1 0.25 Selection: 1 0.25 Selection: 1 0.25 Goal: whole Goal: artifact Goal: instrumentality 0.00 0.00 0.00 AbstractionLevel 10 AbstractionLevel 9 AbstractionLevel 8 State: 1.00 State: 1.00 State: 1.00 tyrannosaurus rex building philosophy 0.75 tyrannosaurus rex hospital philosophy 0.75 tyrannosaurus rex hospital philosophy 0.75 wizard cook container wizard cook container wizard cook container implement animal device 0.50 tool vertebrate machine 0.50 tool bird machine 0.50 glass glass glass Selection: 1 0.25 Selection: 1 0.25 Selection: 1 0.25 Goal: conveyance Goal: vehicle Goal: craft 0.00 0.00 0.00 AbstractionLevel 7 AbstractionLevel 6 AbstractionLevel 5 State: 1.00 State: 1.00 State: 1.00 tyrannosaurus rex hospital philosophy 0.75 tyrannosaurus rex hospital philosophy 0.75 tyrannosaurus rex hospital philosophy 0.75 wizard cook container wizard cook container wizard cook container tool bird_of_prey machine 0.50 tool vulture machine 0.50 tool vulture machine 0.50 glass glass glass Selection: 1 0.25 Selection: 1 0.25 Selection: 1 0.25 Goal: heavierthanair_craft Goal: airplane Goal: airplane 0.00 0.00 0.00 AbstractionLevel 4 AbstractionLevel 3 AbstractionLevel 2 State: 1.00 State: 1.00 State: 1.00 tyrannosaurus rex hospital philosophy 0.75 tyrannosaurus rex hospital philosophy 0.75 tyrannosaurus rex hospital philosophy 0.75 wizard cook container wizard cook container wizard cook container tool vulture machine 0.50 tool vulture machine 0.50 tool vulture machine 0.50 glass glass glass Selection: 1 0.25 Selection: 1 0.25 Selection: 1 0.25 Goal: airplane Goal: airplane Goal: airplane 0.00 0.00 0.00 AbstractionLevel 1 AbstractionLevel 0 State: 1.00 State: 1.00 tyrannosaurus rex hospital philosophy 0.75 tyrannosaurus rex hospital philosophy 0.75 wizard cook container wizard cook container tool vulture machine 0.50 tool vulture machine 0.50 glass glass Selection: 1 0.25 Selection: 1 0.25 Goal: airplane Goal: airplane 0.00 0.00 Figure 13: Visualisation of the policy trained via the residual approach, random embeddings and a class hierarchy extracted from WordNet. For each abstraction level the state representation is shown, which entity has been selected (-1 means no entity has been selected yet) and the abstracted goal entity. The probability with which each entity is picked is plotted and the red bars indicate the optimal actions.
You can also read