Meta-Reinforcement Learning for Mastering Multiple Skills and Generalizing across Environments in Text-based Games

 
CONTINUE READING
Meta-Reinforcement Learning for Mastering Multiple Skills and
               Generalizing across Environments in Text-based Games
                         Zhenjie Zhao                                       Mingfei Sun
                Nanjing University of Information                        University of Oxford
                    Science and Technology                           mingfei.sun@cs.ox.ac.uk
                 zzhaoao@nuist.edu.cn

                                              Xiaojuan Ma
                             The Hong Kong University of Scinece and Technology
                                           mxj@cse.ust.hk

                           Abstract                                text-based games requires non-trivial natural lan-
                                                                   guage understanding/generalization and sequential
        Text-based games can be used to develop task-              decision making. Developing agents that can play
        oriented text agents for accomplishing tasks               text-based games automatically is promising for en-
        with high-level language instructions, which
                                                                   abling task-oriented, language-based human-robot
        has potential applications in domains such as
        human-robot interaction. Given a text instruc-             interaction (HRI) experience (Scheutz et al., 2011).
        tion, reinforcement learning is commonly used              Supposing that a text agent can reason a given com-
        to train agents to complete the intended task              mand and generate a sequence of text actions for
        owing to its convenience of learning policies              accomplishing the task, we can then use text as a
        automatically. However, because of the large               proxy and connect text inputs and outputs of the
        space of combinatorial text actions, learning a            agent with multi-modal signals, such as vision and
        policy network that generates an action word
                                                                   physical actions, to allow a physical robot operate
        by word with reinforcement learning is chal-
        lenging. Recent research works show that
                                                                   in the physical space (Shridhar et al., 2021).
        imitation learning provides an effective way                  Given a text instruction or goal, reinforcement
        of training a generation-based policy network.             learning (RL) (Sutton and Barto, 2018) is com-
        However, trained agents with imitation learn-              monly used to train agents to finish the intended
        ing are hard to master a wide spectrum of task             task automatically. In general, there are two ap-
        types or skills, and it is also difficult for them         proaches to train a policy network to obtain the
        to generalize to new environments. In this pa-
                                                                   corresponding text action: generation-based meth-
        per, we propose a meta-reinforcement learn-
        ing based method to train text agents through              ods that generate a text action word by word and
        learning-to-explore. In particular, the text               choice-based methods that select the optimal ac-
        agent first explores the environment to gather             tion from a list of candidates (Côté et al., 2018).
        task-specific information and then adapts the              The list of action candidates in a choice-based
        execution policy for solving the task with this            method may be limited by pre-defined rules and
        information. On the publicly available testbed             hard to generalize to a new environment. In con-
        ALFWorld, we conducted a comparison study                  trast, generation-based methods can generate more
        with imitation learning and show the superior-
                                                                   possibilities and potentially have a better general-
        ity of our method.
                                                                   ization ability. Therefore, to allow a text agent to
    1    Introduction                                              fully explore in an environment and obtain best
                                                                   performance, a generation-based method is needed
    A text-based game, such as Zork (Infocom, 1980),               (Yao et al., 2020). However, the combinatorial ac-
    is a text-based simulation environment that a player           tion space precludes reinforcement learning from
    uses text commands to interact with. For example,              working well on a generation-based policy network.
    given the current text description of a game envi-             Recent research shows that imitation learning (Ross
    ronment, users need to change the environmental                et al., 2011) provides an effective way to train
    state by inputting a text action, and the environment          a generation-based policy network using demon-
    returns a text description of the next environmental           strations or dense reinforcement signals (Shridhar
    state. Users have to take text actions to change the           et al., 2021). However, it is still difficult for the
    environmental state iteratively until an expected              trained policy to master multiple task types or skills
    final state is achieved (Côté et al., 2018). Solving         and generalize across environments (Shridhar et al.,

                                                               1
Proceedings of the 1st Workshop on Meta Learning and Its Applications to Natural Language Processing (MetaNLP 2021), pages 1–10
                   Bangkok, Thailand (online), August 5, 2021. ©2021 Association for Computational Linguistics
2021). For example, an agent trained on the task               2     Related Work
type of slicing an apple cannot work on a task of
                                                               2.1    Language-based Human-Robot
pouring water. Such lack of the ability to gener-
                                                                      Interaction
alize precludes the agent from working on a real
interaction scenario. To achieve real-world HRI ex-            Enabling a robot to accomplish tasks with language
perience with text agents, two requirements should             goals is a long-term study of human-robot interac-
be fulfilled: 1) a trained agent should master multi-          tion (Scheutz et al., 2011), where the core problem
ple skills simultaneously and work on any task type            is to ground language goals with multi-modal sig-
that it has seen during training; 2) a trained agent           nals and generate an action sequence for the robot
should also generalize to unseen environments.                 to accomplish the task. Because of the character-
   Meta-reinforcement learning (meta-RL) is a                  istic of sequential decision making, reinforcement
commonly used technique to train an agent that                 learning (Sutton and Barto, 2018) is commonly
generalizes across multiple tasks through summa-               used. Previous research works using reinforcement
rizing experience over those tasks. The underlying             learning have studied the problem on simplified
idea of meta-RL is to incorporate meta-learning                block worlds (Janner et al., 2018; Bisk et al., 2018),
into reinforcement learning training, such that the            which could be far from being realistic. The re-
trained agent, e.g., text-based agents, could master           cent interests on embodied artificial intelligence
multiple skills and generalize across different envi-          (embodied AI) have contributed to several realistic
ronments (Finn et al., 2017; Liu et al., 2020). In             simulation environments, such as Gibson (Xia et al.,
this paper, we propose a meta-RL based method                  2018), Habitat (Savva et al., 2019), RoboTHOR
to train text agents through learning-to-explore. In           (Deitke et al., 2020), and ALFRED (Shridhar et al.,
particular, a text agent first explores an environment         2020). However, because of physical constraints in
to gather task-specific information. It then updates           a real environment, gap between a simulation envi-
the agent’s policy towards solving the task with               ronment and a real world still exists (Deitke et al.,
this task-specific information for better generaliza-          2020; Shridhar et al., 2021). Researchers have also
tion performance. On a publicly available testbed,             explored the idea of finding a mapping between
ALFWorld (Shridhar et al., 2021), we conducted ex-             vision signals of a real robot and language signals
periments on all its six task types (i.e., pick & place,       directly (Blukis et al., 2020), but this mapping re-
examine in light, clean & place, heat & place, cool            quires detailed annotated data and it is usually ex-
& place, and pick two & place), where for each                 pensive to obtain physical interaction data. An
task type, there is a set of unique environments               alternative method of deploying an agent on a real
sampled from the distribution defined by their task            robot is to train the agent on abstract text space,
type (see Section 5.1 for statistics). Results suggest         such as TextWorld (Côté et al., 2018), and then
that our method generally masters multiple skills              connect text with multi-modal signals of the robot
and enables better generalization performance on               (Shridhar et al., 2021). For example, by connect-
new environments compared to ALFWorld (Shrid-                  ing text with the simulated environment ALFRED
har et al., 2021). We provide further analysis and             (Shridhar et al., 2020), researchers have shown that
discussion to show the importance of task diversity            the trained text agent has better generalization abil-
for meta-RL. The contributions of this paper are:              ity than training an embodied agent end-to-end
                                                               directly (Shridhar et al., 2021). However, how to
                                                               make a text agent generalize across different tasks
   • From the perspective of human-robot interac-              so that one robot can work on tasks of different
     tion, we identify the generalization problem of           types and in unseen environments is still a chal-
     training an agent to master multiple skills and           lenging problem, which is the focus of this paper.
     generalize on new environments. We propose                2.2    Text-based Games
     to use meta-RL methods to achieve it.
                                                               The success of deep reinforcement learning (RL)
                                                               on Atari games (Mnih et al., 2015) inspires the
   • We design an efficient learning-to-explore                use of RL on text-based games. There are a va-
     approach which enables a generation-based                 riety of ways to use deep reinforcement learn-
     agent to master multiple skills and generalize            ing on text-based games. For example, using the
     across a wide spectrum of environments.                   deep Q-learning (DQN) framework, Narasimhan

                                                           2
et al. (2015) leverage the Long Short-Term Mem-            learning-to-explore. For memory-based methods,
ory (LSTM) as the policy network to predict action         researchers have proposed RL2 (Duan et al., 2016),
for each state. In (He et al., 2016), researchers          which uses a recurrent neural network (RNN) to
propose the deep reinforcement relevance network           encode a “fast” RL algorithm, and the RNN mod-
(DRRN), which encodes states and actions sepa-             ule is trained with another “slow” RL algorithm.
rately and then calculates Q-values by integrating         Memory-based methods are usually hard to opti-
the information of the two channels. However, the          mize and suffer from the sample efficiency problem
compositional and combinatorial properties of nat-         (Duan et al., 2016). For optimization-based meth-
ural language lead to large state and action spaces,       ods, in (Finn et al., 2017), researchers propose a
which makes solving text-based games with deep             model-agnostic meta-reinforcement learning algo-
reinforcement learning very challenging. To deal           rithm that uses a nested optimization procedure to
with this problem, in fiction-style text games, Ad-        obtain maximal rewards with limited number of
hikari et al. (2020) use a graph-aided transformer         sample trajectories. Optimization-based methods
(GATA) to capture game dynamics so that it can             usually require on-policy reinforcement learning
plan well and select text actions more effectively.        algorithms and are hard to use value-based meth-
Ammanabrolu and Riedl (2019) learn a knowledge             ods (Finn et al., 2017), which also leads to the
graph during the exploration of an agent, and use          sample efficiency problem. Learning-to-explore
it to prune the action space. Furthermore, Muruge-         is a newly proposed meta-reinforcement learning
san et al. (2021) show that incorporating common           approach that can potentially leverage any rein-
sense knowledge also helps reduce the action space         forcement learning method with good optimiza-
and allows an agent to choose an action more effec-        tion properties by decoupling an episode into two
tively. Recently, Yao et al. (2020) show that given        stages: exploration and execution (Rakelly et al.,
a text state, a fine-tuned language model GPT can          2019; Liu et al., 2020). The exploration stage is
generate a corresponding text action set, which            used to recognize task-specific information, which
significantly reduces the action space and also im-        could be useful for the execution stage for fast and
proves the performance. Previous research works            efficient adaptation.
mainly focus on learning an agent to solve one text           For embodied AI, using meta-reinforcement
game effectively. However, in reality, we usually          learning, researchers have explored to improve
hope an agent can learn a wide spectrum of tasks           generalization ability of an agent to unseen envi-
and generalize well to unseen environments. In             ronments (Wortsman et al., 2019). However, as
(Adolphs and Hofmann, 2020), in terms of environ-          aforementioned, deploying such an agent on a real
ments and task descriptions, researchers show that         robot is still a challenging problem owing to the
an actor-critic framework with action space pruning        domain gap between a simulation environment and
can learn an agent to generalize to unseen games           a physical environment. In this paper, we instead
that belongs to the same family when training. In          try to use the learning-to-explore method of meta-
this paper, with meta-reinforcement learning, we           reinforcement learning to increase the generaliza-
investigate if an agent can master multiple task           tion ability of a text agent so that it can master mul-
types and generalize to unseen environments.               tiple skills and work on new environments, which
                                                           can potentially facilitate real-world human-robot
2.3   Meta-reinforcement Learning                          interaction applications.

Meta-learning is a machine learning paradigm that          3     Problem Formulation
tries to leverage common knowledge among tasks
to generalize to new data (Thrun and Pratt, 1998;          3.1    Text-based Game Preliminary
Vilalta and Drissi, 2002). Meta-reinforcement              Given a language goal g, playing a text-based game
learning, in particular, augments Markov decision          can be modeled as a partially observable Markov
processes with particular task labels, and tries to        decision process (POMDP) (S, P, A, Ω, O, R, γ)
use shared experience of interacting with differ-          (Côté et al., 2018), where S is the set of environ-
ent tasks to adapt to a new task efficiently (Liu          mental states, P is the set of transition probabilities,
et al., 2020). In general, there are three ways of         A is the set of actions, Ω is the set of observations,
conducting meta-reinforcement learning: memory-            O is the set of observation probabilities, R is the
based methods, optimization-based methods, and             reward function, and γ is the discount factor. If

                                                       3
we input an action at to the environment, it will                                -. (() |/)
                                                                 /                                ()
transition from the current state st to a new state                            Task Identifier
st+1 with probability P (st+1 |st , at ), output an ob-
servation ot+1 based on the new state with proba-                              12 (%& |!, +& )                     Environment
                                                                     !         12 (() |!, +& )
bility O(ot+1 |st+1 ), and get a reward R(g, at , st )                                                 ()0
depending on the goal g, the current action at ,                             Exploration Policy
                                                                                                                                 %&
and the current state st . Given the initial environ-                                                  "# (%& |() , !, +& )
                                                                                                       Execution Policy
ment state s0 and a goal g, we want to learn a pol-
icy π(a|o, g) that can generate an action sequence              Figure 1: Overview of our method, where g is the lan-
(a0 , a1 , . . . , aT ) to accomplishPthe task and obtain       guage goal, µ denotes a task index, zµ and zµ0 are hid-
maximal discounted reward Tt=0 γ t R(g, at , st ).              den feature vectors of a task, and at is the generated
In text-based games, o and a refer to text sentences.           text action. The dotted line box is only used during
                                                                training. For simplicity, we did not draw the inputs of
3.2   Learning-to-Explore in Text-based                         roll-out trajectories.
      Games
In meta-reinforcement learning, we consider a fam-
ily of POMDPs {(Sµ , Aµ , Ωµ , γ, Oµ , R, Pµ )} in-             policy neural network πψ , a task identifier neural
dexed by µ, where µ ∈ M denotes a task and                      network qθ , and an exploration policy neural net-
M denotes the family of POMDPs or tasks. Here,                  work pφ , where ψ, θ, and φ denote parameters of
we consider that the reward function is indepen-                the three neural networks, respectively. As shown
dent of tasks and can be applied for all POMDPs.                in Figure 1, an exploration policy pφ is trained to
The tasks in the family have task-dependent set of              generate a task-specific feature vector zµ0 , which is
states Sµ , actions Aµ , observations Ωµ , observa-             then input to an execution policy πψ for generating
tion probabilities Oµ , and dynamics Pµ . Following             actions. During training, a task identifier is used
the setting in (Liu et al., 2020), given a goal g, a            to generate supervised signals zµ of zµ0 , and is not
task-based meta-reinforcement learning problem                  used during testing. Because of zµ0 , πψ can adapt
consists of sampling a task µ ∼ p(µ) and running a              quickly and generalize well in a new task.
trial, where a trial contains an exploration episode,              The πψ , qθ and pφ are all encoder-decoder
followed by several execution episodes. We also                 architectures.       For πψ , it takes a goal
call a goal as a task type or a skill because it usu-           g and a K-step roll-out trajectory τt =
ally constrains how an agent solves a task µ. We                (o0 , at−K , ot−K+1 , . . . , at−1 , ot ) from time t−K+
call a POMDP without the reward function as an                  1 to time t as inputs, and outputs the current ac-
environment, contextualized with the task speci-                tion at , where o0 is obtained by executing the
fier µ, since it defines a game environment that an             “look” action at the beginning. o0 is used be-
agent can interact with. A task denoted by µ then               cause it is the only observation that lists the dif-
contains a task type, an environment, and a reward              ferent areas of the room. qθ takes a task in-
function. Given a set of training tasks Mtrain , we             dex µ as an input and outputs the task-specific
want to train a policy π(a|o, g) that can generalize            feature zµ , which is only used during training.
well across a set of testing tasks Mtest . For training,        pφ takes a goal and a K-step roll-out trajectory
we first fit a task-specific feature vector zµ0 using           τt = (o0 , at−K , ot−K+1 , . . . , at−1 , ot ) as inputs
the exploration episode, and then use it to adapt to            and outputs an estimated task-specific feature zµ0 .
the task quickly during execution. The task-specific
adaptation helps the policy π to recognize which                   Our goal is to make an execution policy
task type it works on and generalize well on a new              π(at |g, τt ) generalizable across tasks. If we train
unseen environment.                                             π using imitation learning, it is critical to have
                                                                enough training samples of {(g, τt , at )} following
4     Method                                                    some distributions to have good generalization per-
                                                                formance. But because of the combinatorial com-
We use neural networks to map observations                      plexity of τt , it is hard to obtain enough data of at .
to actions. Given the general setting of meta-                  Learning from conditional variational auto-encoder
reinforcement learning through learning-to-explore,             (CVAE) (Sohn et al., 2015), we factorize π with a
our method contains three modules: an execution                 task-specific hidden variable z and use z to facili-

                                                            4
tate the generation of at , namely,                               where ⊕ denotes the concatenation operation,
                                                                  W ∈ Rde ×2de is a weight matrix, b ∈ Rde is a
                     Z                                            bias vector, hRNN ∈ Rde , ht ∈ Rdh , de is the di-
   π(at |g, τt ) =           p(z|g, τt )π(at |z, g, τt )dz,       mension of zµ , dh is the dimension of ht , GRU
                       z∈Z
                                                    (1)           denotes a gated recurrent unit, and ReLU denotes
where Z ∼ N (zµ , σ 2 I) is assumed to follow a                   a ReLU activation function. Compared to selecting
Gaussian distribution, the aforementioned task-                   text actions from a set of valid actions, generat-
specific feature vector zµ is the mean vector and σ 2             ing text actions word by word is more likely to
is the variance. During testing, we can then gener-               explore multiple possibilities for performing ac-
ate actions by first generating a task-specific hidden            tions to achieve higher rewards (Yao et al., 2020).
variable z with p(z|g, τt ) and then generating the               However, Shridhar et al. (2021) show that when
action with π(at |z, g, τt ). Because z encodes task-             trained from a sparse reinforcement learning sig-
specific features, it helps π generate more proper                nal in ALFWorld, generation-based methods are
actions for the current task µ.                                   hard to get good performance. Because it is rela-
   Optimizing (1) amounts to maximise evidence                    tively easy to get demonstrations from a text-based
lower bound (ELBO) (Sohn et al., 2015):                           game, similar to (Shridhar et al., 2021), the imita-
                                                                  tion learning method DAgger (Ross et al., 2011) is
                                                                  used to train a generation-based execution policy
ELBO(at , g, τt ) = Eq(z|at ,g,τt ) [log π(at |z, g, τt )]
                                                                  πψ . In this case, optimizing the execution policy
                     − KL(q(z|at , g, τt ))||p(z|g, τt )),        network is to optimize the first term of (3).
                                                       (2)
where q(z|at , g, τt )) is the approximate posterior              4.2   Task Identifier
probability of z and p(z|g, τt ) is the prior proba-              We use a task identifier qθ (zµ |µ) to approximate
bility of z. To implement (2), we use the execu-                  the approximate posterior q(z|at , g, τt ). The task
tion policy network πψ (at |zµ , τt ), the task identi-           identifier is used to generate task-specific features
fier qθ (zµ |µ), and the exploration policy network               during training. We implement it as a simple two-
pφ (zµ0 |g, τt ) to approximate the execution policy,             layer fully connected network as:
the posterior, and the prior, respectively, and as-
sume that both qθ (zµ |µ) and pφ (zµ0 |g, τt ) are Gaus-
sian. It is easy to show that the new objective is:                zµ = ReLU(W2 ReLU(W1 e(µ) + b1 ) + b2 ),

                                       ||zµ − zµ0 ||22            where e(µ) is the one-hot encoding of the task
Eqθ (zµ |at ,g,τt ) [log πψ (at |zµ , g, τt )] −       ,          index µ, W2 ∈ Rde ×de , W1 ∈ Rde ×N , b1 , b2 ∈
                                           2σ 2
                                                   (3)            Rde , de is the dimension of the task embedding zµ ,
                     2
where we assume σ is the same for both the poste-                 N is the number of training game environments.
rior and prior. In the following, we introduce the
details of the execution policy network, the task                 4.3   Exploration Policy
identifier, and the exploration policy network.                   We use an exploration policy network pφ (zµ0 |g, τt )
                                                                  to approximate the prior p(z|g, τt ). The explo-
4.1    Execution Policy
                                                                  ration policy needs to explore the environment to
The architecture of the execution policy network                  gather task-specific trajectory within T exp . Be-
is similar to the policy network in (Shridhar et al.,             cause we train the model end-to-end, it will opti-
2021). In particular, a QANet (Yu et al., 2018)                   mize the agent to explore the environment in this
is used to first encode g, τt as a recurrent hidden               fixed number of steps, which also saves time. The
state ht and then decode ht to get at . Different                 architecture is similar to the execution policy net-
from (Shridhar et al., 2021), during encoding, we                 work. An encoder takes g, τt as inputs and gener-
concatenate the initial encoding hRNN and zµ as an                ates a hidden state ht , and the hidden state is then
input to obtain ht , namely,                                      used to obtain zµ0 via a fully connected layer:

 hRNN = Encode(g, τt ),                                                        zµ0 = ReLU(Wht + b),
 ht = GRU(ReLU(W(hRNN ⊕ zµ ) + b), ht−1 ),                        where W ∈ Rde ×dh , b ∈ Rde .

                                                              5
Algorithm 1: The training procedure                                  DQN is an off-policy method that can leverage
    Input: training tasks Mtrain                                      replay buffer to deal with the sample efficiency
    Output: execution policies πψ , exploration policy pφ             problem. Here, we use DQN for its simplicity, but
    initialize hyper-parameters Mstep , B, T exp , T exec             it is possible to use other more sophisticated off-
    initialize πψ , pφ , and qφ                                       policy methods. Because it is generally difficult
                                                                      to train a generation-based text agent with only
   i←0
   while True do                                                      the sparse rewards provided by the environment,
       if i > Mstep then                                              we adopt the choice-based method to train the text
            break
       end                                                            agent. We empirically turn the reward function to
                                                                      be dense by adding the second term in Eq(3) to the
          randomly sample B games MB from Mtrain                      reward function: Rnew = 0.5 × Rold + 0.5 × ||zµ −
          // Evaluate the task identifier                             zµ0 ||22 to encourage per-step optimization, where
          calculate zµ with qφ                                        Rold is the reward provided by the environment.
          // Exploration                                                  The training procedure of the proposed method,
          execute “look” and get o0                                   as presented in Algorithm 1, runs as follows: first,
          for t=1:T exp do
               at ← pφ (at |g, τt )
                                                                      we randomly samples a batch of tasks MB from
               compose τt by adding at and ot                         Mtrain ; second, with task indices, we evaluates qθ
               evaluate MB with at and get ot+1                       to obtain the task-specific features zµ ; third, the
               zµ0 ← pφ (zµ0 |g, τt )
               calculate (4) and update                               exploration agent explores MB by taking actions
          end                                                         with pφ , and updates pφ according to Eq(4) through
                                                                      a DQN learning. zµ0 is also obtained by pφ during
          // Execution
          execute “look” and get o0                                   exploring; fourth, the execution agent takes actions
          for t=1:T exec do                                           with πψ and we update the likelihood (the first
               at ← πψ (at |zµ , g, τt )
               compose τt by adding at and ot
                                                                      term in Eq(3)) with demonstrations of the training
               get demonstrations from MB                             data. The end-to-end training runs iteratively up to
               calculate likelihood of at using                       a maximal step Mstep . In Algorithm 1, B denotes
                 demonstrations and update
               if done then                                           the sampling size of tasks, T exp is the step number
                    break                                             of exploration, and T exec is the step number of
               end                                                    execution.
          end

          i←i+B                                                       5     Experiments
    end
                                                                      To demonstrate the generalization ability of our
                                                                      meta-reinforcement learning algorithm across
   For the exploration policy, in addition to obtain                  tasks, we conducted a set of experiments with the
zµ0 ,we also decode ht to get an exploration ac-                      ALFWorld platform (Shridhar et al., 2021). Text
tion at : pφ (at |g, τt ). In other words, we adopt                   environments of ALFWorld are aligned with 3D
a multi-task learning method to train the explo-                      simulated environments from ALFRED (Shridhar
ration network. In this way, the exploration policy                   et al., 2020), which makes ALFWorld a good proxy
also learns how to solve the problem, which could                     for our human-robot interaction scenario.
help the learning of zµ0 . We optimize the following
multi-task objective:                                                 5.1    Dataset
                                                                      The ALFWorld dataset (Shridhar et al., 2021) con-
                       L = Lµ + Ldqn ,                      (4)
                                                                      tains six task types, including pick & place, ex-
where Lµ is the task embedding loss and Ldqn is                       amine in light, clean & place, heat & place, cool
the DQN loss. In particular, Lµ is the second term                    & place, and pick two & place. While all the task
in (3), except that we do not consider the coefficient                types require some basic common sub-tasks such as
1/2σ 2 . For the DQN loss Ldqn , we use the deep                      finding an object, picking it up, and placing it to a
Q-learning (DQN) method to train the exploration                      particular place; some task types require more com-
policy. Unlike the execution policy network, we do                    plex interactions with certain objects (e.g., heating
not use demonstrations here because we want the                       an object with a heat source). Each task type con-
policy network to explore the environment more.                       tains a set of training environments, and two sets of

                                                                  6
test environments. The first test set (seen) contains         being evaluated, if an agent can finish S tasks, then
environments that are different, but sampled from             the success rate of the agent is sr = |MStest | . Similar
the same game distributions as the training set (e.g.,        to (Shridhar et al., 2021), we evaluate three times
same rooms but with different scene layouts). The             on the testing data and report averaged scores.
second test set (unseen) contains environments that
                                                                                      ALFWorld              Ours
do not appear in the training set (i.e., unseen rooms                task type      seen unseen      seen     unseen
with different receptacles and scene layouts). The                 pick & place     46.7  34.7       51.4      50.0
statistics of the dataset is shown in Table 1. The               examine in light   25.7  22.2       38.5      22.2
                                                                   clean & place    44.4  39.8       48.1      54.8
task types pick & place and pick two & place have                   heat & place    58.3  44.9       50.0      56.5
more training environments than others. Our gener-                 cool & place     38.7  47.6       44.0      76.2
alization goal is to train a text agent on the training          pick two & place   23.6  27.4       12.5      23.5
                                                                      all tasks     39.3  37.6       41.4      49.3
set of all tasks simultaneously, and during testing,
given any task type, the agent can have good perfor-          Table 2: Experiment results of the generalization ability
mance on both seen and unseen environments, i.e.,             on each individual task type and the union of them.
the agent masters all the six task types and general-
izes well on both seen and unseen environments.
                                                              5.4   Results and Analysis
           task type      train   seen   unseen
         pick & place      790     35      24
                                                              We show the performance of our model on both
       examine in light    308     13      18                 seen and unseen test sets in Table 2, compared
         clean & place     650     27      31                 with numbers computed using the code and model
         heat & place      459     16      23                 checkpoint provided by ALFWorld (Shridhar et al.,
         cool & place      533     25      21
       pick two & place    813     24      17                 2021). We observe that in most experiment set-
            all tasks     3553    140     134                 tings, our method outperforms ALFWorld. This is
                                                              especially obvious in the unseen setting, where the
   Table 1: The statistics of the ALFWorld dataset.
                                                              testing environments contain unseen rooms with
                                                              different receptacles and scene layouts, our method
                                                              outperforms ALFWorld by a significant margin.
5.2   Baseline and Implementation Details
                                                              This suggests that the task-specific features gener-
We compare our method (denoted as Ours) with the              ated by our agent indeed enable the agent learning
state-of-the-art generation-based agent (denoted as           from a wide spectrum of task types. The larger
ALFWorld). Transfer learning is another way to                performance gap between our method and ALF-
improve the generalization ability of an agent, but           World on the unseen test set (e.g., 49.3 vs 37.6
it usually considers transferring knowledge from              when testing on the union of all task types) further
a source task to a target task without the setting            advocates that the task-specific features generated
of multiple tasks (Zhuang et al., 2021). We leave             by our method are useful when tackling with com-
it as a future direction to investigate. We adopt             pletely unfamiliar environments.
the implementation of ALFWorld from the origi-                   On the other hand, we observe that our method’s
nal paper (Shridhar et al., 2021) and use their pre-          performance on the pick two & place tasks are
trained model for conducting all comparison ex-               lower than ALFWorld. As mentioned in (Shrid-
periments. For the hyper-parameters in Algorithm              har et al., 2021), the pick two & place task type is
1, T exp is set as 10 empirically, Mstep = 500, 000           unique and is considerably more difficult compared
(50K), B = 10, T exec = 50 are kept as the default            to other tasks, in the sense that it is the only task
values of ALFWorld. The trajectory length K is set            type which requires an agent to grasp and operate
as 3 empirically. Following ALFWorld (Shridhar                more than one object. Intuitively, this aligns with
et al., 2021), we use beam search with width 10 for           the common sense that a person who has learned
decoding. We ran all experiments on a server with             to ride all kinds of bicycles can easily ride a new
Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz,                   bicycle, but does not necessarily know how to drive
32G Memory, Nvidia GPU 2080Ti, Ubuntu 16.04.                  a car. We suspect that the decrease in performance
                                                              may be caused by the agent being overfitting to
5.3   Evaluation Metric                                       the majority of training data in which only single
We use success rate as the evaluation metric for our          object is picked up. Namely, the current developed
experiment. In particular, for |Mtest | text games            method could work better on scenarios where a

                                                          7
text-based game has the same difficulty level. In              6   Conclusion
other words, the current developed method can
                                                               We study the generalization issue of text-based
only work on scenarios where a text-based game
                                                               games, and develop a meta-reinforcement learn-
has the same difficulty level as the majority of train-
                                                               ing method with a learning-to-explore approach. In
ing games, and it is still hard to generalize to tasks
                                                               particular, we first use an exploration policy net-
with a higher difficulty level. As a future direc-
                                                               work to learn a task-specific feature vector, and
tion, we plan to investigate the explainability of
                                                               use this feature vector to help another execution
why an end-to-end trained agent works on certain
                                                               policy network adapt to a new task. To train the
tasks through counterfactuals (Pearl and Macken-
                                                               exploration and execution policy network, we use a
zie, 2018), and improve our method to specifically
                                                               task identifier to embed a task index, and maximize
tackle such problems where a certain dimension of
                                                               the likelihood of the execution policy network end-
task representations is significantly different from
                                                               to-end. To demonstrate the generalization ability
and unbalanced in the majority of training data.
                                                               of our method, we conducted a set of experiments
   Finally, compared to a dedicated model trained              on the publicly available testbed ALFWorld. In
specifically on one task type (Table 2 left in (Shrid-         general, we find that our method has better gener-
har et al., 2021)), the performance of our method              alization performance on a wide spectrum of task
is generally 10% ∼ 20% behind, and there is still              types and environments. We leave the investiga-
a lot of room for improvement to achieve human-                tion of explanability, the unbalance problem of task
level intelligence. However, our method shows                  types, and the training speed as the future research
that learning task-specific features through meta-             directions.
reinforcement learning help an agent generalize
across a wide spectrum of task types, which is vi-             Acknowledgments
tal towards real-world applications of human-robot
interaction.                                                   The authors thank Mohit Shridhar for clarifying
                                                               the experiment section in ALFWorld, Xingdi (Eric)
                                                               Yuan and Marc-Alexandre Côté for their insightful
5.5   Discussion                                               discussions and editing of the initial version of this
                                                               paper, as well as constructive suggestions from all
To investigate whether different task types help im-
                                                               anonymous reviewers. Zhenjie Zhao is supported
prove performance of each other, we experimented
                                                               by the Startup Foundation for Introducing Talent of
with a setting where an agent is trained on the six
                                                               NUIST. This work is also supported by the Hong
task types separately with our method. The results
                                                               Kong General Research Fund (GRF) with grant No.
are shown in Table 3. Compared to the setting
                                                               16204420.
where the agent is trained on the union of all task
types (Table 2), the performance shows a signif-
icant drop in most of the task types. This trend               References
is especially clear in the pick two & place tasks.
                                                               Ashutosh Adhikari, Xingdi Yuan, Marc-Alexandre
When trained solely on this type of tasks, our agent             Côté, Mikuláš Zelinka, Marc-Antoine Rondeau, Ro-
produces a zero success rate. This suggests that                 main Laroche, Pascal Poupart, Jian Tang, Adam
for a meta-reinforcement learning based method                   Trischler, and Will Hamilton. 2020. Learning dy-
like ours, it is essential to have a diverse set of task         namic belief graphs to generalize on text-based
                                                                 games. In Advances in Neural Information Process-
types as well as a large enough training dataset.                ing Systems, volume 33, pages 3045–3057. Curran
                                                                 Associates, Inc.
               task type      seen   unseen
             pick & place     57.1    25.0                     Leonard Adolphs and Thomas Hofmann. 2020.
           examine in light   23.1    11.1                       LeDeepChef deep reinforcement learning agent
             clean & place    51.9    58.1                       for families of text-based games. Proceedings
              heat & place    31.3    30.4                       of the AAAI Conference on Artificial Intelligence,
             cool & place     12.0     9.5                       34(05):7342–7349.
           pick two & place    0.0     0.0
                                                               Prithviraj Ammanabrolu and Mark Riedl. 2019. Play-
Table 3: Testing results of training a separate agent on          ing text-adventure games with graph-based deep re-
each of the six task types.                                       inforcement learning. In Proceedings of the 2019
                                                                  Conference of the North American Chapter of the
                                                                  Association for Computational Linguistics: Human

                                                           8
Language Technologies, Volume 1 (Long and Short              Volodymyr Mnih, Koray Kavukcuoglu, David Silver,
  Papers), pages 3557–3565, Minneapolis, Minnesota.              Andrei A. Rusu, Joel Veness, Marc G. Bellemare,
  Association for Computational Linguistics.                     Alex Graves, Martin Riedmiller, Andreas K. Fid-
                                                                 jeland, Georg Ostrovski, Stig Petersen, Charles
Yonatan Bisk, K. Shih, Yejin Choi, and Daniel Marcu.             Beattie, Amir Sadik, Ioannis Antonoglou, Helen
  2018. Learning interpretable spatial operations in a           King, Dharshan Kumaran, Daan Wierstra, Shane
  rich 3D blocks world. In Proceedings of the AAAI               Legg, and Demis Hassabis. 2015. Human-level con-
  Conference on Artificial Intelligence (AAAI).                  trol through deep reinforcement learning. Nature,
                                                                 518(7540):529–533.
Valts Blukis, Ross A. Knepper, and Yoav Artzi. 2020.
  Few-shot object grounding and mapping for natural            Keerthiram Murugesan, Mattia Atzeni, Pavan Kapani-
  language robot instruction following. In Proceed-              pathi, Pushkar Shukla, Sadhana Kumaravel, Gerald
  ings of the Conference on Robot Learning.                      Tesauro, Kartik Talamadupula, Mrinmaya Sachan,
                                                                 and Murray Campbell. 2021. Text-based RL agents
                                                                 with commonsense knowledge: New challenges, en-
Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben           vironments and baselines. Proceedings of the AAAI
 Kybartas, Tavian Barnes, Emery Fine, J. Moore,                  Conference on Artificial Intelligence, 35(10):9018–
 Matthew J. Hausknecht, Layla El Asri, Mah-                      9027.
 moud Adada, Wendy Tay, and Adam Trischler.
 2018. Textworld: A learning environment for text-             Karthik Narasimhan, Tejas Kulkarni, and Regina Barzi-
 based games. In Computer Games Workshop at                      lay. 2015. Language understanding for text-based
 ICML/IJCAI 2018.                                                games using deep reinforcement learning. In Pro-
                                                                 ceedings of the 2015 Conference on Empirical Meth-
Matt Deitke, Winson Han, Alvaro Herrasti, Anirud-                ods in Natural Language Processing, pages 1–11,
 dha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi               Lisbon, Portugal. Association for Computational
 Salvador, Dustin Schwenk, Eli VanderBilt, Matthew               Linguistics.
 Wallingford, Luca Weihs, Mark Yatskar, and Ali
 Farhadi. 2020. RoboTHOR: An open simulation-to-               Judea Pearl and Dana Mackenzie. 2018. The book of
 real embodied AI platform. In Proceedings of the                why. Basic Books, New York.
 IEEE/CVF Conference on Computer Vision and Pat-
                                                               Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey
 tern Recognition (CVPR).
                                                                 Levine, and Deirdre Quillen. 2019. Efficient off-
                                                                 policy meta-reinforcement learning via probabilistic
Yan Duan, John Schulman, Xi Chen, Peter L Bartlett,              context variables. In Proceedings of the 36th In-
  Ilya Sutskever, and Pieter Abbeel. 2016. RL2 : Fast            ternational Conference on Machine Learning, vol-
  reinforcement learning via slow reinforcement learn-           ume 97 of Proceedings of Machine Learning Re-
  ing. arXiv:1611.02779.                                         search, pages 5331–5340. PMLR.
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.          Stephane Ross, Geoffrey Gordon, and Drew Bagnell.
  Model-agnostic meta-learning for fast adaptation of             2011. A reduction of imitation learning and struc-
  deep networks. In Proceedings of the 34th In-                   tured prediction to no-regret online learning. In Pro-
  ternational Conference on Machine Learning, vol-                ceedings of the Fourteenth International Conference
  ume 70 of Proceedings of Machine Learning Re-                   on Artificial Intelligence and Statistics, volume 15 of
  search, pages 1126–1135. PMLR.                                  Proceedings of Machine Learning Research, pages
                                                                  627–635, Fort Lauderdale, FL, USA. PMLR.
Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Li-
   hong Li, Li Deng, and Mari Ostendorf. 2016. Deep            Manolis Savva, Abhishek Kadian, Oleksandr
   reinforcement learning with a natural language ac-           Maksymets, Yili Zhao, Erik Wijmans, Bha-
   tion space. In Proceedings of the 54th Annual Meet-          vana Jain, Julian Straub, Jia Liu, Vladlen Koltun,
   ing of the Association for Computational Linguistics,        Jitendra Malik, Devi Parikh, and Dhruv Batra. 2019.
   pages 1621–1630, Berlin, Germany. Association for            Habitat: A platform for embodied AI research.
   Computational Linguistics.                                   In 2019 IEEE/CVF International Conference on
                                                                Computer Vision (ICCV), pages 9338–9346.
Infocom. 1980. Zork I. http://ifdb.tads.org/viewgame?          Matthias Scheutz, Rehj Cantrell, and Paul Schermer-
   id=0dbnusxunq7fw5ro.                                         horn. 2011. Toward humanlike task-based dialogue
                                                                processing for human robot interaction. AI Maga-
Michael Janner, Karthik Narasimhan, and Regina                  zine, 32(4):77–84.
  Barzilay. 2018.      Representation learning for
  grounded spatial reasoning. Transactions of the As-          Mohit Shridhar, Jesse Thomason, Daniel Gordon,
  sociation for Computational Linguistics, 6:49–61.             Yonatan Bisk, Winson Han, Roozbeh Mottaghi,
                                                                Luke Zettlemoyer, and Dieter Fox. 2020. ALFRED:
Evan Zheran Liu, Aditi Raghunathan, Percy Liang,                A benchmark for interpreting grounded instructions
  and Chelsea Finn. 2020.        Explore then exe-              for everyday tasks. In Proceedings of the IEEE/CVF
  cute: Adapting without rewards via factorized meta-           Conference on Computer Vision and Pattern Recog-
  reinforcement learning. arXiv:2008.02790.                     nition (CVPR), pages 10740–10749.

                                                           9
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté,
 Yonatan Bisk, Adam Trischler, and Matthew
 Hausknecht. 2021. ALFWorld: Aligning Text and
 Embodied Environments for Interactive Learning.
 In Proceedings of the International Conference on
 Learning Representations (ICLR).
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015.
  Learning structured output representation using
  deep conditional generative models. In Advances in
  Neural Information Processing Systems, volume 28.
  Curran Associates, Inc.
Richard S Sutton and Andrew G Barto. 2018. Rein-
  forcement learning: An introduction. MIT press.
Sebastian Thrun and Lorien Pratt, editors. 1998. Learn-
  ing to learn. Kluwer Academic Publishers.
Ricardo Vilalta and Youssef Drissi. 2002. A perspec-
  tive view and survey of meta-learning. Artificial In-
  telligence Review, 18(2):77–95.
Mitchell Wortsman, Kiana Ehsani, Mohammad Raste-
  gari, Ali Farhadi, and Roozbeh Mottaghi. 2019.
  Learning to learn how to learn: Self-adaptive visual
  navigation using meta-learning. In Proceedings of
  the IEEE/CVF Conference on Computer Vision and
  Pattern Recognition (CVPR), pages 6750–6759.
Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax,
  Jitendra Malik, and Silvio Savarese. 2018. Gibson
  env: Real-world perception for embodied agents.
  In Proceedings of the IEEE Conference on Com-
  puter Vision and Pattern Recognition (CVPR), pages
  9068–9079.
Shunyu Yao, Rohan Rao, Matthew Hausknecht, and
  Karthik Narasimhan. 2020. Keep CALM and ex-
  plore: Language models for action generation in
  text-based games. In Proceedings of the 2020 Con-
  ference on Empirical Methods in Natural Language
  Processing (EMNLP), pages 8736–8754, Online. As-
  sociation for Computational Linguistics.

Adams Wei Yu, David Dohan, Quoc Le, Thang Luong,
  Rui Zhao, and Kai Chen. 2018. Fast and accurate
  reading comprehension by combining self-attention
  and convolution. In International Conference on
  Learning Representations.

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi,
  Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing
  He. 2021. A comprehensive survey on transfer learn-
  ing. Proceedings of the IEEE, 109(1):43–76.

                                                          10
You can also read