Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Mutual Information Based Knowledge Transfer Under State-Action
Dimension Mismatch
Michael Wan Tanmay Gangwani Jian Peng
Computer Science Dept. Computer Science Dept. Computer Science Dept.
UIUC UIUC UIUC
mw3@illinois.edu gangwan2@illinois.edu jianpeng@illinois.edu
arXiv:2006.07041v1 [stat.ML] 12 Jun 2020
Abstract 1 INTRODUCTION
Deep reinforcement learning (RL), which combines the
Deep reinforcement learning (RL) algorithms rigor of RL algorithms with the flexibility of universal
have achieved great success on a wide variety function approximators such as deep neural networks,
of sequential decision-making tasks. However, has demonstrated a plethora of success stories in recent
many of these algorithms suffer from high times. These include computer and board games (Mnih
sample complexity when learning from scratch et al., 2015; Silver et al., 2016), continuous control (Lilli-
using environmental rewards, due to issues crap et al., 2015), and robotics (Rajeswaran et al., 2017),
such as credit-assignment and high-variance to name a few. Crucially though, these methods have
gradients, among others. Transfer learning, in been shown to be performant in the regime where an
which knowledge gained on a source task is agent can accumulate vast amounts of experience in the
applied to more efficiently learn a different but environment, usually modeled with a simulator. For real-
related target task, is a promising approach to world environments such as autonomous navigation and
improve the sample complexity in RL. Prior industrial processes, data generation is an expensive (and
work has considered using pre-trained teacher sometimes risky) procedure. To make deep RL algo-
policies to enhance the learning of the stu- rithms more sample-efficient, there is great interest in
dent policy, albeit with the constraint that the designing techniques for knowledge transfer, which en-
teacher and the student MDPs share the state- ables accelerating agent learning by leveraging either ex-
space or the action-space. In this paper, we isting trained policies (referred to as teachers), or using
propose a new framework for transfer learn- task demonstrations for imitation learning (Abbeel & Ng,
ing where the teacher and the student can have 2004). One promising idea for knowledge transfer in RL
arbitrarily different state- and action-spaces. is policy distillation (Rusu et al., 2015; Parisotto et al.,
To handle this mismatch, we produce embed- 2015; Hinton et al., 2015), where information from the
dings which can systematically extract knowl- teacher policy network is transferred to a student policy
edge from the teacher policy and value net- network to improve the learning process.
works, and blend it into the student networks. Prior work has incorporated policy distillation in a va-
To train the embeddings, we use a task-aligned riety of settings (Czarnecki et al., 2019). Some ex-
loss and show that the representations could amples include the transfer of knowledge from simple
be enriched further by adding a mutual in- to complex agents while following a curriculum over
formation loss. Using a set of challenging agents (Czarnecki et al., 2018), learning a centralized
simulated robotic locomotion tasks involving policy that captures shared behavior across tasks for
many-legged centipedes, we demonstrate suc- multi-task RL (Teh et al., 2017), distilling information
cessful transfer learning in situations when the from parent policies into a child policy for a genetically-
teacher and student have different state- and inspired RL algorithm (Gangwani & Peng, 2017), and
action-spaces. speeding-up large-scale population-based training using
multiple teachers (Schmitt et al., 2018). A common mo-
tif in these approaches is the use of Kullback-Leibler
(KL) divergence between the state-conditional action
Proceedings of the 36th Conference on Uncertainty in Artificial
Intelligence (UAI), PMLR volume 124, 2020.distributions of the teacher and student networks, as the embeddings must be aligned to serve that goal. Secondly,
minimization objective for knowledge transfer. While we would like the embeddings to be correlated with the
simple and intuitive, this restricts learning from teach- states encountered by the student policy. The embed-
ers that have the same output (action) space as the stu- dings are used to deterministically draw out knowledge
dent, since KL divergence is only defined for distribution from the teacher network. Therefore, a high correlation
over a common space. An alternative to knowledge shar- ensures that the most suitable teacher guidance is derived
ing in the action-space is information transfer through for each student state. We achieve this by maximizing
the embedding-space formed via the different layers of a the mutual information between the embeddings and stu-
deep neural network. (Liu et al., 2019) provides an ex- dent states. We evaluate our method on a set of challeng-
ample of this; it utilizes learned lateral connections be- ing robotic locomotion tasks modeled using the MuJoCo
tween intermediate layers of the teacher and student net- simulator. We demonstrate the successful transfer of
works. Although the action-spaces can now be different, knowledge from trained teachers to students, in the sce-
the state-space is still required to be identical between the nario of mismatched state- and action-space. This leads
teacher and the student, since the same input observation to appreciable gains in sample-efficiency, compared to
is fed to both the networks (Liu et al., 2019). RL from scratch using only the environmental rewards.
In our work, we present a transfer learning approach to
accelerate the training of the student policy, by leverag- 2 BACKGROUND
ing teacher policies trained in an environment with differ-
ent state- and action-space. Arguably, there is a huge po- We consider the RL setting where the environment is
tential for data-efficient student learning by tapping into modeled as an infinite-horizon discrete-time Markov De-
teachers trained on dissimilar, but related tasks. For in- cision Process (MDP). The MDP is characterized by the
stance, consider an available teacher policy for locomo- tuple (S, A, R, T , γ, p0 ), where S and A are the con-
tion of a quadruped robot, where the (97-dimensional) tinuous state- and action-space, respectively, γ ∈ [0, 1)
state-space is the set of joint-angles and joint-velocities is the discount factor, and p0 is the initial state distri-
and the (10-dimensional) action-space is the torques to bution. Given an action at ∈ A, the next state is sam-
the joints. If we wish to learn locomotion for a hexa- pled from the transition dynamics distribution, st+1 ∼
pod robot (state-dimension 139, action-dimension 16), T (st+1 |st , at ), and the agent receives a scalar reward
we conjecture that the learning could be kick-started by r(st , at ) determined by the reward function R. A policy
harnessing the information stored in the trained neural πθ (at |st ) defines the state-conditioned distribution over
network for the quadruped, since both the tasks are lo- actions. The RL objective is to learn the policy parame-
comotion for legged robots and therefore share an inher- ters (θ) to maximize theexpected
P∞ t discounted sum of re-
ent structure. However, the dissimilar state- and action- wards, η(πθ ) = Ep0 ,T ,π t=0 γ r(st , at ) .
space preclude the use of the knowledge transfer mecha-
Policy-gradient algorithms (Sutton et al., 2000) are
nisms proposed in prior work.
widely used to estimate the gradient of the RL objec-
Our approach deals with the mismatch in the state- and tive. Proximal policy optimization (PPO, Schulman et al.
action-space of the teacher and student in the following (2017)) is a model-free policy-gradient algorithm that
manner. To handle disparate actions, rather than using serves as an efficient approximation to trust-region meth-
divergence minimization in the action-space, we transfer ods (Schulman et al., 2015a). In each iteration of PPO,
knowledge by augmenting representations in the layers the rollout policy (πθold ) is used to collect sample trajec-
of the student network with representations from the lay- tories τ and the following surrogate loss is minimized
ers of the teacher network. This is similar to the knowl- over multiple epochs:
edge flow in (Liu et al., 2019) using lateral connections, h i
but with the important difference that we do not employ LθPPO = −Eτ min rt (θ) Ât , clip (rt (θ) , 1 − , 1 + ) Ât
learnable matrices to transform the teacher representa-
tion. The mismatch in the observation- or state-space where rt (θ) = ππθθ (a(at |s t)
t |st )
is the ratio of the action prob-
old
has not been considered in prior literature, to the best abilities under the current policy and rollout policy, and
of our knowledge. We manage this by learning an em- Ât is the estimated advantage. Variance in the policy-
bedding space which can be used to extract the neces- gradient estimates is reduced by employing the state-
sary information from the available teacher policy net- value function as a control variate (Mnih et al., 2016).
work. These embeddings are trained to adhere to two This is usually modeled as a neural network Vψ and up-
properties. Firstly, they must be task-aligned. Our RL dated using temporal difference learning:
objective is the maximization of cumulative discounted h i
rewards in the student environment, and therefore, the targ 2
Lψ
PPO = −E τ V ψ (st ) − V twhere Vttarg is the bootstrapped target value obtained with tween the input states of the target MDP and the embed-
TD(λ). To further reduce variance, Generalized Advan- ding vectors produced from them. The embeddings are
tage Estimation (GAE, Schulman et al. (2015b)) is used used to deterministically derive representations from the
when estimating advantage. The overall PPO minimiza- teacher network, and hence a high correlation helps to
tion objective then is: obtain the most appropriate teacher guidance for each of
the states encountered by the target policy. To this end,
LPPO (θ, ψ) = LθPPO + Lψ
PPO (1) we propose a mutual information maximization objec-
tive; this is detailed in subsection 3.2.
Although we use the PPO objective for our experiments,
our method can be readily combined with any on-policy
or off-policy actor-critic RL algorithm. 3.1 TASK-ALIGNED EMBEDDING SPACE
This section describes our approach for training the en-
3 METHOD
coder parameters (φ) such that the generated embeddings
are aligned with the RL objective. We begin by detail-
In this section, we outline our method for distilling
ing the architecture that we use for transfer of knowledge
knowledge from a pre-trained teacher policy to a stu-
from a teacher, pre-trained in source MDP, to a student
dent policy, in the hope that such knowledge sharing im-
policy in the target MDP with different state- and action-
proves the sample-efficiency of the student learning pro-
space. Inspired by the concept of knowledge-flow used
cess. Our problem setting is as follows. We assume that
in (Liu et al., 2019), we employ lateral connections be-
the teacher and the student policies operate in two differ-
tween the student and teacher networks, which augment
ent MDPs. All the MDP properties (S, A, R, T , γ, p0 )
the representations in the layers of the student with useful
could be different, provided that some high-level struc-
representations from the layers of the teacher. A crucial
tural commonality exists between the MDPs, such as the
benefit of this approach is that since information sharing
example of transfer from a quadruped robot to hexapod
happens through the hidden layers, the output (action)
robot introduced in Section 1. Henceforth, for notational
space of the source and target MDPs can be disparate,
convenience, we refer to the MDP of the teacher as the
as is the scenario in our experiments. It is also quite
source MDP, and that of the student as the target MDP.
straightforward to include multiple teachers in this ar-
We assume the availability of a teacher policy network
chitecture to distill diverse knowledge into a student; we
pre-trained in the source MDP. Crucially though, we do
leave this to future work.
not assume access to the source MDP for any further ex-
ploration, or for obtaining demonstration trajectories that We draw out knowledge from both the teacher policy and
could be used for training in the target MDP using cross- state-value networks. We denote the teacher policy and
domain imitation-learning techniques. We instead focus value network with πθ0 and Vψ0 , respectively, where the
on extracting representations from the teacher policy net- parameters (θ0 , ψ 0 ) are held fixed throughout the train-
work which are useful for learning in the target MDP. ing. Analogously, (θ, ψ) are the trainable parameters for
the student policy and value networks. Let Nπ denote
In this work, we address knowledge transfer when Ssrc 6=
the number of hidden layers in the teacher (and student)
Starg , where Ssrc and Starg denote the state-space of the
policy network, and NV be the number of hidden layers
source and target MDPs, respectively. To handle the
in the teacher (and student) value network. In general,
mismatch, we introduce a learned embedding-space pa-
the teacher and student networks could have a different
rameterized by an encoder function φ(·), and defined as
number of layers, but we assume them to be the same for
Semb := {φ(s) | s ∈ Starg }. Data points from this embed-
ease of exposition.
ding space are used to extract useful information from
the teacher policy network. Therefore, we further en- In the target MDP, the student policy observes a state
force that the dimension of the embedding space matches starg ∈ Starg , which is fed to the encoder to produce
the dimension of the state-space in the source MDP, i.e., the embedding φ(starg ) ∈ Semb . Since |Semb | = |Ssrc |,
|Semb | = |Ssrc |. Note that this does not necessitate that this embedding can be readily passed through the teacher
any embedding vector s ∈ Semb be a feasible input state networks to extract {zθj0 , 1 ≤ j ≤ Nπ }, representing
in the source MDP. To learn the encoder function φ(·), the pre-activation outputs of the Nπ hidden layers of the
we consider the following two desiderata. Firstly, the teacher policy network, and {zψj 0 , 1 ≤ j ≤ NV }, rep-
embeddings must be learned to facilitate our objective of resenting the pre-activation outputs of the NV hidden
maximizing the cumulative discount rewards in the tar- layers of the teacher value function network. To ob-
get MDP. In subsection 3.1, we show how to achieve this tain the pre-activation representations in the student net-
by utilizing the policy gradient to update embedding pa- works, we feed in the state starg and perform a weighted
rameters. Secondly, we wish for a high correlation be- linear combination of the appropriate outputs with thecorresponding pre-activations from the teacher networks. be different at different input states. To aid with this,
Concretely, to obtain the hidden layer outputs hjπθ and we utilize a surrogate objective that instead maximizes
hjVψ at layer j in the student networks, we have the fol- the correlation between starg and the embeddings φ(starg ),
lowing: defined using the principle of mutual information (MI).
If we view starg as a stochastic input s, the encoder output
hjπθ = σ pjθ zθj + (1 − pjθ )zθj0 is then also a random variable e, and the mutual informa-
(2) tion between the two is defined as:
hjVψ = σ pjψ zψj + (1 − pjψ )zψj 0
I(s; e) = H(s) − H(s|e)
where σ is the activation function, and pjθ , pjψ
∈ [0, 1]
are layer-specific learnable parameters denoting the mix- where H denotes the differential entropy. Direct maxi-
ing weights. In the target MDP, the student network is mizing of the MI is intractable due to the unknown condi-
optimized for the RL objective LPPO (θ, ψ), mentioned in tional densities. However, it is possible to obtain a lower
Equation 1. The outputs of the student policy and value bound to the MI using a variational distribution qω (s|e)
networks, and hence LPPO , depend on the encoder pa- that approximates the true conditional distribution p(s|e)
rameters (φ) through the representation sharing (Equa- as follows:
tion 2) enabled by the lateral connections stemming from I(s; e) = H(s) − H(s|e)
the pre-trained teacher network. Therefore, an intuitive
= H(s) + Es,e [log p(s|e)]
objective for shaping the embeddings such that they be-
come task-aligned is to optimize them using the original = H(s) + Es,e [log qω (s|e)]
RL loss gradient: φ ← φ − α∇φ LPPO (θ, ψ, φ, θ0 , ψ 0 ).
+ Ee DKL (p(s|e)||qω (s|e))
Note that LPPO (·) now also depends on the fixed teacher ≥ H(s) + Es,e [log qω (s|e)]
parameters (θ0 , ψ 0 ).
where the last inequality is due to the non-negativity of
The learnable mixing weights pjθ , pjψ ∈ [0, 1] control the
the KL divergence. This is known as the variational in-
influence of the teacher’s representation on the student
formation maximization algorithm (Agakov & Barber,
outputs – higher the value, lesser the impact. We ar-
2004). Re-writing in terms of target-MDP states and the
gue that a low value for these coefficients helps in the
encoder parameters, the surrogate objective jointly opti-
early phases of the training process by providing nec-
mizes over the variational and encoder parameters:
essary information to kick-start learning. At the end of
the training, however, we desire that the student becomes max Estarg [log qω (starg |φ(starg ))]
completely independent of the teacher, since this helps ω,φ
in faster test-time deployment of the agent. To encour-
where H(s) is omitted since it is a constant w.r.t the con-
age this, we introduce additional coupling-loss terms that
cerned parameters. In terms of the loss function to mini-
drive pjθ , pjψ towards 1 as the training progresses:
mize, we can succinctly write:
Nπ NV
1 X 1 X
j
LMI (φ, ω) = −Es∼ρπθ [log qω (s|φ(s))] (4)
Lcoupling =− log pθ − log pjψ (3)
Nπ j=1 NV j=1
where ρπθ is the state-visitation distribution of the stu-
Experimentally, we observe that although the student be- dent policy in the target MDP. In our experiments, we use
comes independent in the final stages of training, it is a multivariate Gaussian distribution (with a learned diag-
able to achieve the same level of performance that it onal covariance matrix) to model the variational distribu-
would if it could still rely on the teacher. tion qω . Although this simple model yields good perfor-
mance, more expressive model classes, such as mixture
3.2 ENRICHED EMBEDDINGS WITH MUTUAL density networks and flow-based models (Rezende &
INFORMATION MAXIMIZATION Mohamed, 2015) could be readily incorporated as well,
to learn complex and multi-modal distributions.
As outlined in the previous section, at each timestep of
the discrete-time target MDP, the representation distilled 3.3 OVERALL ALGORITHM
from the teacher networks is a fixed function f of the
embedding vector generated from the current input state: Figure 1 shows the schematic diagram of our complete
f (θ0 , ψ 0 , φ(starg )), where (θ0 , ψ 0 ) are fixed. It is desirable architecture and gradient flows, along with a description
to have a high degree of correlation between starg and of the implemented neural networks. We refer to our al-
f (θ0 , ψ 0 , φ(starg )) because, intuitively, the teacher repre- gorithm as MIKT, for Mutual Information based Knowl-
sentation that is the most useful for the student should edge Transfer. Algorithm 1 outlines the main steps of thePolicy Value-fn
Branch Branch
Val e
Ac i
(F d)
+ +
Figure 1: Schematic diagram of our complete architecture (best viewed in color). The encoder parameters φ (blue)
receive gradients from three sources: the policy-gradient loss LθPPO , the value function loss Lψ PPO , and the mutual
information loss LMI . The teacher networks (yellow) remain fixed throughout training and do not receive any gradients.
In the student networks (green), the pre-activation representations are linearly combined (using learnt mixing weights)
with the corresponding representations from the teacher (Equation 2). Note that this knowledge-flow occurs at all
layers, although we show it only once for clarity of exposition.
Algorithm 1: Mutual Information based Knowledge experience is then used to compute the RL loss (Equa-
Transfer (MIKT) tion 1) and the mutual information loss (Equation 4), en-
Input : θ0 , ψ 0 fixed teacher policy and value networks abling the calculation of gradients for the different pa-
θ, ψ: student policy and value networks rameters (Lines 3–6). Using both the losses to update
{p}: set of coupling parameters for policy and value the encoder (φ) helps us to satisfy the desiderata on the
networks embeddings – that they should be task-aligned and corre-
φ: encoder parameters lated with the states in the target MDP. The coupling pa-
ω: variational distribution parameters rameters {pj }, used for the weighted combination of the
representations in the teacher and student networks, are
for each iteration do updated with the coupling-loss (Equation 3) along with
1 Run πθ in target MDP and collect few trajectories τ the RL loss. In each iteration of the algorithm, the PPO
2 for each minibatch m ∈ τ do update ensures that the state-action visitation distribution
3 Update θ, ψ with ∇θ,ψ LPPO (θ, ψ, φ, θ0 , ψ 0 ) of the policy πθ is modified by only a small amount. This
4 Update φ with is because of the clipping on the importance-sampling ra-
∇φ LMI (φ, ω) + LPPO (θ, ψ, φ, θ0 , ψ 0 ) tio (Section 2) when obtaining the PPO gradient. In ad-
5 Update ω with ∇ω LMI (φ, ω) dition to this, we experimentally found that enforcing an
6 Update {p} using [Lcoupling + LPPO ] explicit KL-regularization on the policy further stabilizes
7 end learning. Let πθ , πθold denote the current and the rollout
8 end policy, respectively. The loss is then formalized as:
LKL (θ, θold ) = Es∼ρπθ DKL (πθ (·|s)||πθold (·|s))
old
training procedure. In each iteration, we run the policy
in the target MDP and collect a batch of trajectories. This(a) CentipedeFour (b) CentipedeSix (c) CentipedeEight (d) CpCentipedeSix (e) CpCentipedeEight (f) Ant
Figure 2: Our MuJoCo locomotion environments. The centipede agents are configured using the details in (Wang et al., 2018),
while Ant-v2 is a popular OpenAI Gym task.
4 RELATED WORK Different from these, our approach handles the mismatch
in the state-space by training an embedding space which
is utilized for efficient knowledge transfer. Rozantsev
The concepts of knowledge transfer and information et al. (2018) employ layer-wise weight regularization and
sharing between deep neural networks have been ex- evaluate on (un-)supervised tasks where the input dis-
tensively researched for a wide variety of tasks in ma- tributions for source and target domains have semantic
chine learning. In the context of reinforcement learning, similarity and are static. For RL tasks, the input distribu-
the popular paradigms for knowledge transfer include tions change dynamically as the student policy updates;
imitation-learning, meta-RL, and policy distillation; each it is unclear if enforcing similarity between the networks
of these being applicable under different settings and for all inputs by coupling the weights is ideal. Gamrian
assumptions. Imitation learning algorithms (Ng et al., & Goldberg (2018) use GANs to learn a mapping from
2000; Ziebart et al., 2008) utilize teacher demonstrations target states to source states. In addition to requiring that
to extract useful information (such as the teacher reward the source and the target domains have the same action-
function in inverse-RL methods) and use that to acceler- space, their method also relies on the exploratory sam-
ate student learning. In meta-RL approaches (Duan et al., ples collected in the source MDP for training the GAN.
2016; Finn et al., 2017), we are generally provided with a In contrast, we handle the action-space mismatch and do
distribution of tasks that share some structural similarity, not assume access to the source MDP for exploration.
and the objective is to discover this generalizable knowl-
edge for accelerating the process of learning on a new Our work also has connections to policy distillation
task. Our work is most closely related to policy distil- methods that use implicit teachers, rather than external
lation methods (Rusu et al., 2015; Parisotto et al., 2015; pre-trained models. In Czarnecki et al. (2018), the au-
Czarnecki et al., 2019), where pre-trained teacher net- thors recommend a curriculum over agents, rather than
works are available and can expedite learning in dissim- the usual curriculum over tasks. Such a curriculum
ilar (but related) student tasks. trains simple agents first, the knowledge of which is then
distilled into more complex agents over time. Akkaya
Prior work has considered teachers in various capac- et al. (2019) iterate on policy architectures by utiliz-
ities. Rusu et al. (2016) and Liu et al. (2019) utilize ing behavior-cloning with DAgger; the new architecture
learned cross-connections between intermediate layers (student) is trained using the old architecture (teacher).
of teacher networks—that have been pre-trained on var- Distillation has been used in multi-task RL (Teh et al.,
ious source tasks—and a student network to effectively 2017) to learn a centralized policy that captures gen-
transfer knowledge and enable more efficient learning on eralizable information from policies trained on individ-
a target task. Ahn et al. (2019) use an objective based on ual tasks. (Gangwani & Peng, 2017) combine ideas from
the mutual information between the corresponding lay- the genetic-algorithms literature and distillation to train
ers of teacher and student networks, and show gains in offspring policies that inherit the best traits of both
image classification tasks. In Hinton et al. (2015), in- the parent policies. Since all these approaches trans-
formation from a large model (teacher) is compressed fer information in the action-space by minimizing the
into a smaller model (student) using a distillation process KL-divergence between state-conditional action distribu-
that uses the temperature-regulated softmax outputs from tions, they share the limitation that the student can only
the teacher as targets to train the student. Schmitt et al. leverage a teacher with the same output (action) space.
(2018) propose a large-scale population-based training Our approach avoids this by using the representations in
pipeline that allows a student policy to leverage multiple the different layers of the neural network for knowledge
teachers specialized in different tasks. All these afore- sharing, enabling transfer-learning in many diverse sce-
mentioned methods work in the setting where the teacher narios as shown in our experiments.
and student share the input state (observation) space.CentipedeFour to CentipedeEight CentipedeSix to CentipedeEight CentipedeFour to CpCentipedeSix
3500 3500
2500 3000 3000
2000 2500 2500
2000 2000
1500
Reward
Reward
Reward
1500 1500
1000
1000 1000
500
500 500
MIKT MIKT MIKT
0 MLPP MLPP MLPP
0
VPG VPG 0 VPG
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Timesteps 1e6 Timesteps 1e6 Timesteps 1e6
(a) (b) (c)
CentipedeSix to CpCentipedeEight CentipedeFour to Ant CentipedeSix to Ant
3000 3000
3000
2500 2500
2500
2000 2000
2000
Reward
Reward
Reward
1500 1500
1500
1000 1000
1000
500 500
500
MIKT 0 MIKT MIKT
0
MLPP MLPP MLPP
0 VPG VPG VPG
−500 −500
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Timesteps 1e6 Timesteps 1e6 Timesteps 1e6
(d) (e) (f)
Figure 3: Performance of our transfer learning algorithm (MIKT) and the baselines (VPG, MLPP) on the MuJoCo
locomotion tasks. Each plot is titled “x to y”, where x is the source (teacher) MDP and y is the target (student) MDP.
(a) CentipedeFour to CentipedeEight, (b) CentipedeSix to CentipedeEight, (c) CentipedeFour to CpCentipedeSix, (d)
CentipedeSix to CpCentipedeEight, (e) CentipedeFour to Ant, (f) CentipedeSix to Ant.
5 EXPERIMENTS a centipede – it consists of repetitive torso bodies, each
having two legs. Figure 2 shows an illustration of the dif-
In this section, we perform experiments to quantify the ferent centipede agents. Please see (Wang et al., 2018)
efficacy of our algorithm, MIKT, for transfer learning in for a detailed description of the environment generation
RL, and also do some qualitative analysis. We address process. The agent is rewarded for running fast in a par-
the following questions: a) Can we do successful knowl- ticular direction. Table 1 includes the state and action
edge transfer between a teacher and a student with dif- dimensions of all the agents. Centipede-x refers to a
ferent state- and action-space? b) Are both the losses centipede with x legs; we use x ∈ {4, 6, 8}. We use
{LPPO , LMI } important for learning useful embeddings additional environments where the centipede is crippled
φ? c) How does task-similarity affect the benefits that (some legs disabled) and denote this by Cp-Centipede-
can be reaped from MIKT? x. Finally, we include the standard Ant-v2 task from the
MuJoCo suite. Note that all robots have separate state
Table 1: MuJoCo locomotion environments. and action dimensions. Intuitively though, these loco-
motion tasks share an inherent structure that could be ex-
Environment State Dimension Action Dimension ploited for transfer learning between the centipedes of
CentipedeFour 97 10 various types. We now demonstrate that our algorithm
CentipedeSix 139 16 achieves this successfully.
CentipedeEight 181 22
CpCentipedeSix 139 12 Baselines: We compare MIKT with two baselines: a)
CpCentipedeEight 181 18 Vanilla Policy Gradient (VPG), which learns the task in
Ant 111 8 the target MDP from scratch using only the environmen-
tal rewards. Any transfer learning algorithm which ef-
Environments: We evaluate using locomotion tasks fectively leverages the available teacher networks should
for legged robots, modeled in OpenAI Gym (Brock- be able to outperform this baseline that does not receive
man et al., 2016) using the MuJoCo physics simulator. any prior knowledge it can use. We use the standard
Specifically, we use the environments provided by Wang PPO (Schulman et al., 2017) algorithm for this baseline.
et al. (2018), where the agent structure resembles that of b) MLP Pre-trained (MLPP) In our setting, the teacherCentipedeFour to CentipedeEight CentipedeSix to CentipedeEight CentipedeFour to CpCentipedeSix
3500 3500
3000
3000 3000
2500
2500 2500
2000
2000 2000
Reward
Reward
Reward
1500
1500 1500
1000
1000 1000
500 500
MIKT MIKT 500 MIKT
MIKT w/o MI MIKT w/o MI MIKT w/o MI
0 0
MIKT w/o RL gradients MIKT w/o RL gradients 0 MIKT w/o RL gradients
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Timesteps 1e6 Timesteps 1e6 Timesteps 1e6
(a) (b) (c)
CentipedeSix to CpCentipedeEight CentipedeFour to Ant CentipedeSix to Ant
3000 3000
3000
2500 2500
2500
2000 2000
2000
Reward
Reward
Reward
1500 1500
1500
1000 1000
1000
500 500
500
MIKT 0 MIKT 0 MIKT
MIKT w/o MI MIKT w/o MI MIKT w/o MI
0 MIKT w/o RL gradients MIKT w/o RL gradients MIKT w/o RL gradients
−500 −500
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Timesteps 1e6 Timesteps 1e6 Timesteps 1e6
(d) (e) (f)
Figure 4: Ablation on the importance of each of {LPPO , LMI } for training the encoder φ. MIKT (blue) is compared
with two variants: MIKT w/o MI (LMI not used) and MIKT w/o RL gradients (LPPO not used).
and the student networks have dissimilar input and out- trained and randomly initialized parameters of the stu-
put dimensions (because the MDPs have different state- dent networks. This indicates that the MLPP strategy is
and action-spaces). A natural transfer learning strategy not productive for transfer learning across the RL loco-
is to remove the input and output layers from the pre- motion tasks considered. Finally, we note that our al-
trained teacher and replace them with new learnable lay- gorithm (MIKT) vastly outperforms the two baselines,
ers that match the dimensions required of the student pol- both achieving higher returns in earlier stages of train-
icy (analogously value) network. The middle stack of the ing and reaching much higher final performance. This
deep neural network is then fine-tuned with the RL loss. proves that firstly, these tasks do have a structural com-
Prior work has shown that such a transfer is effective in monality such that a teacher policy trained in one task
certain computer vision tasks. could be used advantageously to accelerate learning in a
different task; and secondly, that MIKT is a successful
approach for achieving such a knowledge transfer. This
5.1 EXPERIMENTAL RESULTS
works even when the teacher and student MDPs have dif-
ferent state- and action-spaces, and is realized by learn-
Figure 3 plots the learning curves for MIKT and our two
ing embeddings that are task-aligned and are optimized
baselines in different transfer learning experiments. Each
with a mutual information loss (Algorithm 1).
plot is titled “x to y”, where x is the source (teacher)
MDP and y is the target (student) MDP. We run each
experiment with 5 different random seeds and plot the 5.2 ABLATION STUDIES
average episodic returns (mean and standard deviation)
on the y-axis, against the number of timesteps of envi- Are gradients from both {LPPO , LMI } to the encoder
ronment interaction (2 million total) on the x-axis. VPG beneficial? To quantify this, we experiment with two
does not use utilize the pre-trained teachers. We ob- variants of our algorithm, each of which removes one
serve that its performance improves with the training it- of the components: MIKT w/o MI, which does not up-
erations, albeit at a sluggish pace. MLPP uses the mid- date φ with the mutual information loss proposed in Sec-
dle stack of the pre-trained teacher network as an ini- tion 3.2, and MIKT w/o RL gradients, which omits using
tialization and trains the input and output layers from the policy-gradient and the value function TD-error gra-
scratch. It only performs on par with VPG, potentially dient for the encoder. Figure 4 plots the performance of
due to the non-constructive interaction between the pre- these variants and compares it to MIKT (which includesboth the losses). We note that MIKT w/o MI generally action-spaces. We achieve this by learning an encoder to
struggles to learn in the early stages of training; see for produce embeddings that draw out useful representations
instance Figure 4 (c), (d). MIKT w/o RL gradients does from the teacher networks. We argue that training the en-
comparatively better early on in training, but it is evi- coder with both the RL-loss and the mutual information-
dent that MIKT is the most performant, both in terms of loss yields rich representations; we provide empirical
early training efficiency and the average episodic returns validation for this as well. Our experiments on a set
of the final policy. This supports our design choice of of challenging locomotion tasks involving many-legged
using both {LPPO , LMI } to update the encoder φ. centipedes show that MIKT is a successful approach for
achieving knowledge transfer when the teacher and stu-
How sensitive is MIKT to the task-similarity? It is
dent MDPs have mismatched state- and action-space.
reasonable to assume that the benefits of transfer learning
depend on the task-similarity between the teacher and the
student. To better understand this in the context of our al- References
gorithm, we consider learning in the CentipedeEight en-
Pieter Abbeel and Andrew Y Ng. Apprenticeship learn-
vironment using different types of teachers – Centipede-
ing via inverse reinforcement learning. In Proceed-
Four, CentipedeSix, Hopper. In Figure 5a, we notice
ings of the twenty-first international conference on
that the influence of the Centipede-{Four,Six} teachers
Machine learning, pp. 1, 2004.
is much more significant than the Hopper teacher. This is
likely because the motion of the centipedes shares sim- Felix Agakov and David Barber. Variational information
ilarity, whereas the Hopper (which is trained to hop) is maximization for neural coding. In International Con-
a dissimilar task and therefore less useful for transfer ference on Neural Information Processing, pp. 543–
learning. In Figure 5b we plot the value of the weight 548. Springer, 2004.
on the student representation, when doing a weighed lin- Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D
ear combination with the teacher (Section 3.1). We ob- Lawrence, and Zhenwen Dai. Variational information
serve with the Hopper teacher that, very early in training, distillation for knowledge transfer. In Proceedings of
the student learns to trust its own learned representations the IEEE Conference on Computer Vision and Pattern
rather than incorporate knowledge from the dissimilar Recognition, pp. 9163–9171, 2019.
teacher. Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej,
CentipedeEight with Various Teachers MIKT Normalized Student Weight Mateusz Litwin, Bob McGrew, Arthur Petron, Alex
3500 1.0
Paino, Matthias Plappert, Glenn Powell, Raphael
3000
0.9 Ribas, et al. Solving rubik’s cube with a robot hand.
Normalized Student Weight
2500
arXiv preprint arXiv:1910.07113, 2019.
0.8
2000
Greg Brockman, Vicki Cheung, Ludwig Pettersson,
Reward
1500 0.7
Jonas Schneider, John Schulman, Jie Tang, and Wo-
1000
0.6
jciech Zaremba. Openai gym, 2016.
500
MIKT w/ CentipedeFour MIKT w/ CentipedeFour Teacher
0
MIKT w/ CentipedeSix
MIKT w/ Hopper
0.5 MIKT w/ CentipedeSix Teacher
MIKT w/ Hopper Teacher
Wojciech Marian Czarnecki, Siddhant M Jayakumar,
0.0 0.5 1.0
Timesteps
1.5 2.0
1e6
0.00 0.25 0.50 0.75 1.00 1.25
Timesteps
1.50 1.75 2.00
1e6
Max Jaderberg, Leonard Hasenclever, Yee Whye Teh,
Simon Osindero, Nicolas Heess, and Razvan Pascanu.
(a) (b) Mix&match-agent curricula for reinforcement learn-
ing. arXiv preprint arXiv:1806.01780, 2018.
Figure 5: Training on CentipedeEight with different
teachers. (a) Transfer from a dissimilar teacher (Hop- Wojciech Marian Czarnecki, Razvan Pascanu, Simon
per) is less effective compared to using Centipede teach- Osindero, Siddhant M Jayakumar, Grzegorz Swirszcz,
ers. (b) Value of the weight on the student representa- and Max Jaderberg. Distilling policy distillation.
tion in the weighed linear combination. With the Hopper arXiv preprint arXiv:1902.02186, 2019.
teacher, the value rises sharply in the early stages, indi- Yan Duan, John Schulman, Xi Chen, Peter L Bartlett,
cating a low teacher contribution. Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforce-
ment learning via slow reinforcement learning. arXiv
preprint arXiv:1611.02779, 2016.
6 CONCLUSION Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-
agnostic meta-learning for fast adaptation of deep net-
In this paper, we proposed an algorithm for transfer works. In Proceedings of the 34th International Con-
learning in RL where the teacher (source) and the stu- ference on Machine Learning-Volume 70, pp. 1126–
dent (task) agents can have arbitrarily different state- and 1135. JMLR. org, 2017.Shani Gamrian and Yoav Goldberg. Transfer learning Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gul-
for related reinforcement learning tasks via image-to- cehre, Guillaume Desjardins, James Kirkpatrick, Raz-
image translation. arXiv preprint arXiv:1806.07377, van Pascanu, Volodymyr Mnih, Koray Kavukcuoglu,
2018. and Raia Hadsell. Policy distillation. arXiv preprint
Tanmay Gangwani and Jian Peng. Policy optimization by arXiv:1511.06295, 2015.
genetic distillation. arXiv preprint arXiv:1711.01012, Andrei A Rusu, Neil C Rabinowitz, Guillaume Des-
2017. jardins, Hubert Soyer, James Kirkpatrick, Koray
Kavukcuoglu, Razvan Pascanu, and Raia Had-
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill-
sell. Progressive neural networks. arXiv preprint
ing the knowledge in a neural network. arXiv preprint
arXiv:1606.04671, 2016.
arXiv:1503.02531, 2015.
Simon Schmitt, Jonathan J Hudson, Augustin Zidek,
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel,
Simon Osindero, Carl Doersch, Wojciech M Czar-
Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,
necki, Joel Z Leibo, Heinrich Kuttler, Andrew Zisser-
and Daan Wierstra. Continuous control with deep rein-
man, Karen Simonyan, et al. Kickstarting deep rein-
forcement learning. arXiv preprint arXiv:1509.02971,
forcement learning. arXiv preprint arXiv:1803.03835,
2015.
2018.
Iou-Jen Liu, Jian Peng, and Alexander G Schwing. John Schulman, Sergey Levine, Pieter Abbeel, Michael
Knowledge flow: Improve upon your teachers. arXiv Jordan, and Philipp Moritz. Trust region policy op-
preprint arXiv:1904.05878, 2019. timization. In International Conference on Machine
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Learning, pp. 1889–1897, 2015a.
Andrei A Rusu, Joel Veness, Marc G Bellemare, John Schulman, Philipp Moritz, Sergey Levine, Michael
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Jordan, and Pieter Abbeel. High-dimensional contin-
Georg Ostrovski, et al. Human-level control through uous control using generalized advantage estimation.
deep reinforcement learning. Nature, 518(7540):529– arXiv preprint arXiv:1506.02438, 2015b.
533, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Radford, and Oleg Klimov. Proximal policy optimiza-
Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, tion algorithms. arXiv preprint arXiv:1707.06347,
David Silver, and Koray Kavukcuoglu. Asynchronous 2017.
methods for deep reinforcement learning. In Inter-
David Silver, Aja Huang, Chris J Maddison, Arthur
national Conference on Machine Learning, pp. 1928–
Guez, Laurent Sifre, George Van Den Driessche, Ju-
1937, 2016.
lian Schrittwieser, Ioannis Antonoglou, Veda Panneer-
Andrew Y Ng, Stuart J Russell, et al. Algorithms for shelvam, Marc Lanctot, et al. Mastering the game of
inverse reinforcement learning. In Icml, pp. 663–670, go with deep neural networks and tree search. nature,
2000. 529(7587):484–489, 2016.
Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdi- Richard S Sutton, David A McAllester, Satinder P Singh,
nov. Actor-mimic: Deep multitask and transfer rein- and Yishay Mansour. Policy gradient methods for re-
forcement learning. arXiv preprint arXiv:1511.06342, inforcement learning with function approximation. In
2015. Advances in neural information processing systems,
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, pp. 1057–1063, 2000.
Giulia Vezzani, John Schulman, Emanuel Todorov, Yee Teh, Victor Bapst, Wojciech M Czarnecki, John
and Sergey Levine. Learning complex dexterous Quan, James Kirkpatrick, Raia Hadsell, Nicolas
manipulation with deep reinforcement learning and Heess, and Razvan Pascanu. Distral: Robust multitask
demonstrations. arXiv preprint arXiv:1709.10087, reinforcement learning. In Advances in Neural Infor-
2017. mation Processing Systems, pp. 4496–4506, 2017.
Danilo Jimenez Rezende and Shakir Mohamed. Vari- Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler.
ational inference with normalizing flows. arXiv Nervenet: Learning structured policy with graph neu-
preprint arXiv:1505.05770, 2015. ral networks. 2018.
Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and
Beyond sharing weights for deep domain adaptation. Anind K Dey. Maximum entropy inverse reinforce-
IEEE transactions on pattern analysis and machine ment learning. In AAAI, volume 8, pp. 1433–1438.
intelligence, 41(4):801–814, 2018. Chicago, IL, USA, 2008.APPENDIX
A Hyper-parameters
Table 2: Hyper-parameters used for all experiments.
Hyperparameter Value
Hidden Layers 2
Hidden Units 64
Activation tanh
Optimizer Adam
Learning Rate 3 x 10−4
Epochs per Iteration 10
Minibatch Size 64
Discount (γ) 0.99
GAE parameter (λ) 0.95
Clip range () 0.2
B Normalized Student Weights
CentipedeFour to CentipedeEight CentipedeSix to CentipedeEight CentipedeFour to CpCentipedeSix
1.0 1.0 1.0
0.9 0.9 0.9
Normalized Student Weight
Normalized Student Weight
Normalized Student Weight
0.8
0.8 0.8
0.7
0.7 0.7
0.6
0.6 0.6
0.5
0.5 0.5
MIKT Normalized Student Weight MIKT Normalized Student Weight MIKT Normalized Student Weight
0.4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Timesteps 1e6 Timesteps 1e6 Timesteps 1e6
(a) (b) (c)
CentipedeSix to CpCentipedeEight CentipedeFour to Ant CentipedeSix to Ant
1.0 1.0 1.0
0.9 0.9
0.9
Normalized Student Weight
Normalized Student Weight
Normalized Student Weight
0.8 0.8
0.8
0.7 0.7
0.7
0.6 0.6
0.6 0.5
0.5
0.5 0.4
0.4
MIKT Normalized Student Weight MIKT Normalized Student Weight MIKT Normalized Student Weight
0.3
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Timesteps 1e6 Timesteps 1e6 Timesteps 1e6
(d) (e) (f)
Figure 6: Plots of the normalized student weight throughout the course of training. Lower values indicate heavier
dependence on teacher representations. A value of 1 indicates the student is completely independent of the teacher.C KL-Regularization Ablation
CentipedeFour to CentipedeEight CentipedeSix to CentipedeEight CentipedeFour to CpCentipedeSix
3500 3500
3000
3000 3000
2500
2500 2500
2000
2000 2000
Reward
Reward
Reward
1500
1500 1500
1000 1000 1000
500 500 500
MIKT MIKT MIKT
0 0
MIKT w/o KL MIKT w/o KL 0 MIKT w/o KL
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Timesteps 1e6 Timesteps 1e6 Timesteps 1e6
(a) (b) (c)
CentipedeSix to CpCentipedeEight CentipedeFour to Ant CentipedeSix to Ant
3500
3000
3000
3000
2500
2500 2500
2000
2000 2000
Reward
Reward
Reward
1500
1500
1500
1000 1000
1000
500 500
500
0 0
MIKT MIKT MIKT
0 MIKT w/o KL MIKT w/o KL MIKT w/o KL
−500 −500
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Timesteps 1e6 Timesteps 1e6 Timesteps 1e6
(d) (e) (f)
Figure 7: MIKT with KL-regularization (blue) vs. MIKT without KL-regularization (green). MIKT still works well
without the KL-regularization.You can also read