"This item is a glaxefw, and this is a glaxuzb": Compositionality Through Language Transmission, using Artificial Neural Networks

Page created by Lucille Sparks

Home & Garden

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

"This item is a glaxefw, and this is a glaxuzb": Compositionality Through Language Transmission, using Artificial Neural Networks

“This item is a glaxefw, and this is a glaxuzb”: Compositionality Through
                                                 Language Transmission, using Artificial Neural Networks

                                                                            Hugh Perkins (hp@asapp.com)
                                                                              ASAPP (https://asapp.com)
                                                                          1 World Trade Center, NY 10007 USA

                                                              Abstract                          compositionality unless they have to. In the con-
                                                                                                text of referential games, (Lazaridou et al., 2018)
                                             We propose an architecture and process for us-     showed that agent utterances had a topographic
                                             ing the Iterated Learning Model (”ILM”) for
arXiv:2101.11739v1 [cs.CL] 27 Jan 2021

                                                                                                rho of 0.16-0.26, on a scale of 0 to 1, even whilst
                                             artificial neural networks. We show that ILM
                                                                                                showing a task accuracy of in excess of 98%.
                                             does not lead to the same clear composition-
                                             ality as observed using DCGs, but does lead           In this work, following the ideas of (Kirby,
                                             to a modest improvement in compositionality,       2001), we hypothesize that human languages are
                                             as measured by holdout accuracy and topo-          compositional because compositional languages
                                             logic similarity. We show that ILM can lead        are highly compressible, and can be transmitted
                                             to an anti-correlation between holdout accu-       across generations most easily. We extend the
                                             racy and topologic rho. We demonstrate that        ideas of (Kirby, 2001) to artificial neural networks,
                                             ILM can increase compositionality when us-
                                                                                                and experiment with using non-symbolic inputs to
                                             ing non-symbolic high-dimensional images as
                                             input.
                                                                                                generate each utterance.
                                                                                                   We find that transmitting languages across gen-
                                         1   Introduction                                       erations using artificial neural networks does not
                                                                                                lead to such clearly visible compositionality as
                                         Human languages are compositional. For exam-           was apparent in (Kirby, 2001). However, we were
                                         ple, if we wish to communicate the idea of a ‘red      unable to prove a null hypothesis that ILM using
                                         box’, we use one word to represent the color ‘red‘,    artificial neural networks does not increase com-
                                         and one to represent the shape ‘box‘. We can use       positionality across generations. We find that ob-
                                         the same set of colors with other shapes, such as      jective measures of compositionality do increase
                                         ‘sphere‘. This contrasts with a non-compositional      over several generations. We find that the mea-
                                         language, where each combination of color and          sures of compositionality reach a relatively mod-
                                         shape would have its own unique word, such as          est plateau after several generations.
                                         ‘aefabg‘. That we use words at all is a characteris-      Our key contributions are:
                                         tic of compositionality. We could alternatively use
                                         a unique sequence of letters or phonemes for each         • propose an architecture for using ILM with
                                         possible thought or utterance.                              artificial neural networks, including with
                                            Compositionality provides advantages over                non-symbolic input
                                         non-compositional language. Compositional lan-
                                                                                                   • show that ILM with artificial neural networks
                                         guage allows us to generalize concepts such as col-
                                                                                                     does not lead to the same clear composition-
                                         ors across different situations and scenarios. How-
                                                                                                     ality as observed using DCGs
                                         ever, it is unclear what is the concrete mechanism
                                         that led to human languages being compositional.          • show that ILM does lead to a modest increase
                                         In laboratory experiments using artificial neural           in compositionality for neural models
                                         networks, languages emerging between multiple
                                         communicating agents show some small signs of             • show that two measures of compositionality,
                                         compositionality, but do not show the clear com-            i.e. holdout accuracy and topologic similar-
                                         positional behavior that human languages show.              ity, can correlate negatively, in the presence
                                         (Kottur et al., 2017) shows that agents do not learn        of ILM

• demonstrate an effect of ILM on composi-
                                                                             S : (a0 , b3 ) → abc
      tionality for non-symbolic high-dimensional
      inputs

2    Iterated Learning Method                                           S : (x, y) → A : y B : x

(Kirby, 2001) hypothesized that compositionality                                A : b3 → ab
in language emerges because languages need to
                                                                                B : a0 → c
be easy to learn, in order to be transmitted be-
tween generations. (Kirby, 2001) showed that
using simulated teachers and students equipped             Figure 1: Two Example sets of DCG rules. Each
                                                           set will produce utterance ‘abc’ when presented with
with a context-free grammar, the transmission of a
                                                           meanings (a0 , b3 ).
randomly initialized language across generations
caused the emergence of an increasingly composi-
tional grammar. (Kirby et al., 2008) showed evi-                        a0      a1      a2    a3    a4
dence for the same process in humans, who were                     b0   qda     bguda   lda   kda   ixcda
                                                                   b1   qr      bgur    lr    kr    ixcr
each tasked with transmitting a language to an-                    b2   qa      bgua    la    ka    ixca
other participant, in a chain.                                     b3   qu      bguu    lu    ku    ixcu
   (Kirby, 2001) termed this approach the ”Iter-                   b4   qp      bgup    lp    kp    ixcp
ated Learning Method” (ILM). Learning proceeds
                                                           Table 1: Example language generated by Kirby’s ILM.
in a sequence of generations. In each generation,
a teacher agent transmits a language to a student
agent. The student agent then becomes the teacher             In (Kirby, 2001), the agents are deterministic
agent for the next generation, and a new student           sets of DCG rules, e.g. see Figure 1. For each pair
agent is created. A language G is defined as a             of meaning and utterance (mi , ui ) ∈ Gtrain,t , if
mapping G : M 7→ U from a space of mean-                   (mi , ui ) is defined by the existing grammar rules,
ings M to a space of utterances U. G can be                then no learning takes place. Otherwise, a new
represented as a set of pairs of meanings and ut-          grammar rule is added, that maps from mi to
terances G = {(m1 , u1 ), (m2 , u2 ), . . . (mn , un )}.   ui . Then, in the generalization phase, rules are
Transmission from teacher to student is imperfect,         merged, where possible, to form a smaller set of
in that only a subset, Gtrain of the full language         rules, consistent with the set of meaning/utterance
space G is presented to the student. Thus the stu-         pairs seen during training, Gtrain,t . The general-
dent agent must generalize from the seen mean-             ization phase uses a complex set of hand-crafted
ing/utterance pairs {(mi , ui ) | mi ∈ Mtrain,t ⊂          merging rules.
M} to unseen meanings, {mi | mi ∈ (M \
                                                              The initial language at generation t0 is ran-
Mtrain, t )}. We represent the mapping from mean-
                                                           domly initialized, such that each ut,i is initialized
ing mi to utterance ui by the teacher as fT (·).
                                                           with a random sequence of letters. The meaning
Similarly, we represent the student agent as fS (·)
                                                           space comprised two attributes, each having 5 or
In ILM each generation proceeds as follows:
                                                           10 possible values, giving a total meaning space
    • draw subset of meanings Mtrain,t from the full       of 52 = 25 or 102 = 100 possible meanings.
      set of meanings M                                       (Kirby, 2001) examined the compositionality of
                                                           the language after each generation, by looking for
    • invention: use teacher agent to generate ut-         common substrings in the utterances for each at-
      terances Utrain,t = {ui,t = fT (mi ) | mi ∈          tribute. An example language is shown in Table
      Mtrain,t }                                           1. In this language, there are two meaning at-
    • incorporation: the student memorizes the             tributes, a and b taking values {a0 , . . . , a4 } and
      teacher’s mapping from Mtrain,t to Utrain,t          {b0 , . . . , b4 }. For example, attribute a could be
                                                           color, and a0 could represent ‘red’; whilst b could
    • generalization: the student generalizes from         be shape, and b3 could represent ‘square’. Then
      the seen meaning/utterance pairs Gtrain,t to         the word for ‘red square’, in the example language
      determine utterances for the unseen meanings         shown, would be ‘qu’. We can see that in the ex-
      Mtrain,t                                             ample, the attribute a0 was associated with a prefix

‘q’, whilst attribute b3 tended to be associated with Meaning space Nodups Uniq ρ accH
a suffix ‘u’. The example language thus shows 332
- 0.024 0.04 0.05
compositionality. 105 - 0.024 0.08 0
(Kirby et al., 2008) extended ILM to humans. 332 yes 0.039 0.1 0
105 yes 0.05 0.1 0
They observed that ILM with humans could lead
to degenerate grammars, where multiple meanings Table 2: Results using naive ANN ILM architecture.
mapped to identical utterances. However, they ‘Nodups’: remove duplicates; ρ: topographic similar-
showed that pruning duplicate utterances from the ity (see later); ‘Uniq’: uniqueness. Termination criteria
results of the generation phase, prior to presenta- for teacher-student training is 98% accuracy.
tion to the student, was sufficient to prevent the
formation of such degenerate grammars.
ANNs generalize naturally, but learning is lossy
3 ILM using Artificial Neural Networks and imperfect. This contrasts with a DCG which
does not generalize. In the case of a DCG, gener-
alization is implemented by applying certain hand-
crafted rules. With careful crafting of the gen-
eralization rules, the DCG will learn a training
set perfectly, and degenerate grammars are rare.
In the case of using an ANN, the lossy teacher-
student training progressively smooths the out-
puts. In the limit of training over multiple genera-
tions, an ANN produces the same output, indepen-
Figure 2: Naive ILM using Artificial Neural Networks dent of the input: a degenerate grammar. The first
two rows of Table 2 show results for two mean-
We seek to extend ILM to artificial neural net- ing spaces: 2 attributes each with 33 possible val-
works, for example using RNNs. Different from ues (depicted as 332 ), and 5 attributes each with
the DCG in (Kirby, 2001), artificial neural net- 10 possible values (depicted as 105 ). The column
works generalize over their entire support, for each ‘uniq’ is a measure of the uniqueness of utterances
training example. Learning is in general lossy and over the meaning space, where 0 means all utter-
imperfect. ances are identical, and 1 means all utterances are
In the case of using ANNs we need to first con- distinct. We can see that the uniqueness values are
sider how to represent a single ‘meaning’. Con- near zero for both meaning spaces.
sidering the example language depicted in Table We tried the approach of (Kirby et al., 2008)
1 above, we can represent each attribute as a one- of removing duplicate utterances prior to presenta-
hot vector, and represent the set of two attributes tion to the student. Results for ‘nodups’ are shown
as the concatenation of two one-hot vectors. in the last two rows of Table 2. The uniqueness
More generally, we can represent a meaning as improved slightly, but was still near zero. Thus the
a single real-valued vector, m. In this work, we approach of (Kirby et al., 2008) did not prevent the
will use ‘thought vector‘ and ‘meaning vector‘ as formation of a degenerate grammar, in our experi-
synonyms for ‘meaning‘, in the context of ANNs. ments, when using ANNs.
We partition the meaning space M into Mtrain
3.2 Auto-encoder to enforce uniqueness
and Mholdout , such that M = Mtrain ∪ Mholdout .
We will denote a subset of Mtrain at generation t To prevent the formation of degenerate grammars,
by Mtrain,t . we propose to enforce uniqueness of utterances by
mapping the generated utterances back into mean-
3.1 Naive ANN ILM ing space, and using reconstruction loss on the re-
A naive attempt to extend ILM to artificial neural constructed meanings.
networks (ANNs) is to simply replace the DCG in Using meaning space reconstruction loss re-
ILM with an RNN, see Figure 2. quires a way to map from generated utterances
In practice we observed that using this for- back to meaning space. One way to achieve this
mulation leads to a degenerate grammar, where could be to back-propagate from a generated ut-
all meanings map to a single identical utterance. terance back onto a randomly initialized mean-

• student end-to-end training: train the stu-
                                                           dent sender and receiver network end-to-end,
                                                           as an auto-encoder

                                                         For the teacher generation, each utterance ut,n
                                                      is generated as fT,send (mt,n ).
      Figure 3: Agent sender-receiver architecture       For the student supervised training, we train the
                                                      student receiver network fS,recv (·) to generate Ut ,
ing vector. However, this requires multiple back-     given Mt , and we train the student sender net-
propagation iterations in general, and we found       work fS,send (·) to recover Mt given Ut . Super-
this approach to be slow. We choose to introduce      vised training for each network terminates after
a second ANN, which will learn to map from dis-       Nsup epochs, or once training accuracy reaches
crete utterances back to meaning vectors. Our ar-     accsup
chitecture is thus an auto-encoder. We call the de-      The student supervised training serves to trans-
coder the ‘sender’, which maps from a thought         mit the language from the teacher to the student.
vector into discrete language. The encoder is         The student end-to-end training enforces unique-
termed the ‘receiver’. We equip each agent with       ness of utterances, so that the language does not
both a sender and a receiver network, Figure 3.       become degenerate.
                                                         In the end-to-end step, we iterate over multiple
3.3    Neural ILM Training Procedure                  batches, where for each batch j we do:

We will denote the teacher sender network                • sample a set of meanings Mtrain,t,j          =
as fT,send (·), the student receiver network as            {mt,j,0 . . . mt,j,Nbatch } ⊂ Mtrain
fS,recv (·), and the student sender network as
fS,send . The output of f·,send (·) will be non-         • train, using an end-to-end loss function Le2e
normalized logits, representing a sequence of dis-         as an auto-encoder, using meanings Mtrain,t,j
tributions over discrete tokens. These logits can          as both the input and the target ground truth.
be converted into discrete tokens by applying an         End-to-end training is run for either Ne2e
argmax.                                               batches, or until end-to-end training accuracy
   For teacher-student training, we use the sender    reaches threshold acce2e
network of the teacher to generate a set of
meaning-utterance pairs, which represent a sub-       3.4   Non-symbolic input
set of the teacher’s language. We present this lan-   In the general case, the meanings m can be pre-
guage to the student, and train both the sender and   sented as raw non-symbolic stimuli x. Each raw
the receiver network of the student, on this new      stimulus x can be encoded by some network into
language.                                             a thought-vector m. We denote such an encoding
   The ILM training procedures is depicted in Fig-    network as a ‘perception’ network. As an exam-
ure 4. A single generation proceeds as follows.       ple of a perception network, an image could be
For each step t, we do:                               encoded using a convolutional neural network.
                                                         This then presents a challenge when training a
   • meaning sampling we sample a subset of           receiver network. One possible architecture would
     meanings Mtrain,t = {mt,0 . . . mt,N } ⊂         be for the receiver network to generate the origi-
     Mtrain , where Mtrain is a subset of the space   nal input x. We choose instead to share the per-
     of all meanings, i.e. Mtrain = M \ Mholdout      ception network between the sender and receiver
                                                      networks in each agent. During supervised train-
   • teacher generation: use the teacher sender       ing of the sender, using the language generated by
     network to generate the set of utterances        the teacher, we train the perception and sender net-
     Ut = {ut,0 , . . . , ut,N }.                     works jointly. To train the receiver network, we
                                                      hold the perception network weights constant, and
   • student supervised training: train the stu-      train the receiver network to predict the output of
     dent sender and receiver networks super-         the perception network, given input utterance u
     vised, using Mtrain,t and Ut                     and target stimulus x. See Figure 5. Note that by

Figure 4: Neural ILM Training Procedure

                           Figure 5: Generalized Neural ILM Supervised Training

setting the perception network as the identity op-      ity distribution over the vocabulary, for each to-
erator, we recover the earlier supervised training      ken. We can feed these probability distributions
steps.                                                  directly into the receiver network f·,recv , and train
   For end-to-end training, with non-symbolic in-       using cross-entropy loss. We denote this scenario
put, we use a referential task, e.g. as described       SOFTMAX .
in (Lazaridou et al., 2018). The sender network is         Alternatively, we can sample discrete tokens
presented the output of the perception network, m,      from categorical distributions parameterized by
and generates utterance u. The receiver network         the softmax output. We train the resulting end-to-
chooses a target image from distractors which           end network using REINFORCE. We use a mov-
matches the image presented to the sender. The          ing average baseline, and entropy regularization.
target image that the receiver network perceives        This scenario is denoted RL.
could be the original stimulus presented to the
sender, or it could be a stimulus which matches         3.6   Evaluation of Compositionality
the original image in concept, but is not the same
                                                        We wish to use objective measures of composi-
stimulus. For example, two images could contain
                                                        tionality. This is necessary because the compo-
the same shapes, having the same colors, but in
                                                        sitional signal is empirically relatively weak. We
different positions. Figure 6 depicts the architec-
                                                        assume access to the ground truth for the mean-
ture, with a single distractor. In practice, multiple
                                                        ings, and use two approaches: topographic simi-
distractors are typically used.
                                                        larity, ρ, as defined in (Brighton and Kirby, 2006)
                                                        and (Lazaridou et al., 2018); and holdout accuracy
3.5   Discrete versus soft utterances
                                                        accH .
When we train a sender and receiver network end-           ρ is the correlation between distance in mean-
to-end, we can put a softmax on the output of           ing space, and distance in utterance space, taken
the sender network f·,send , to produce a probabil-     across multiple examples. For the distance metric,

Figure 6: End-to-end Referential Task for Non-symbolic Inputs, where x s is the input stimulus presented to the
sender, x tgt is the target input simulus, and x distr1 is a distractor stimulus.

we use the L0 distance, for both meanings and ut- 4 Related Work
terances. That is, in meaning space, the distance
between ‘red square‘ and ‘yellow square‘ is 1; and Work on emergent communications was revived
the distance between ‘red square’ and ‘yellow cir- recently for example by (Lazaridou et al., 2016)
cle’ is 2. In utterance space, the difference be- and (Foerster et al., 2016). CITE and CITE
tween ‘glaxefw’ and ‘glaxuzg’ is 3. Considered showed emergent communicatoins in a 2d world.
as an edit distance, we consider substitutions; but CITE Several works investigate the composition-
neither insertions nor deletions. For the correla- ality of the emergent language, such as CITE,
tion measure, we use the Spearman’s Rank Corre- CITE, CITE. (Kottur et al., 2017) showed that
lation. agents do not generate compositional languages
accH shows the ability of the agents to general- unless they have to. (Lazaridou et al., 2018)
ize to combinations of shapes and colors not seen used a referential game with high-dimensional
in the training set. For example, the training set non-symbolic input, and showed the resulting lan-
might contain examples of ‘red square’, ‘yellow guages contained elements of compositionality,
square’, and ‘yellow circle’, but not ‘red circle’. If measured by topographic similarity. (Bouchacourt
the utterances were perfectly compositional, both and Baroni, 2018) caution that agents may not be
as generated by the sender, and as interpreted by communicating what we think they are communi-
the receiver, then we would expect performance cating, by using randomized images, and by inves-
on ‘red circle’ to be similar to the performance on tigating the effect of swapping the target image.
‘yellow circle’. The performance on the holdout (Andreas et al., 2017) proposed an approach to
set, relative to the performance on the training set, learn to translate from an emergent language into a
can thus be interpreted as a measure of composi- natural language. Obtaining compositional emer-
tionality. gent language can be viewed as disentanglement
of the agent communications. (Locatello et al.,
Note that when there is just a single attribute, it 2019) prove that unsupervised learning of disen-
is not possible to exclude any values from train- tangled representations is fundamentally impossi-
ing, otherwise the model would never have been ble without inductive biases both on the consid-
exposed to the value at all. Therefore accH is only ered learning approaches and the data sets.
a useful measure of compositionality when there
Kirby pioneered ILM in (Kirby, 2001), extend-
are at least 2 attributes.
ing it to humans in (Kirby et al., 2008). (Grif-
We observe that one key difference between ρ fiths and Kalish, 2007) proved that for Bayesian
and accH is that ρ depends only on the compo- agents, that the iterated learning method converges
sitional behavior of the sender, whereas acch de- to a distribution over languages that is determined
pends also on the compositional behavior of the entirely by the prior, which is somewhat aligned
receiver. As noted in (Lowe et al., 2019), it is pos- with the result in (Locatello et al., 2019) for disen-
sible for utterances generated by a sender to ex- tangled representations. (Li and Bowling, 2019),
hibit a particular behavior or characteristic without (Cogswell et al., 2020), and (Ren et al., 2020) ex-
the receiver making use of this behavior or charac- tend ILM to artificial neural networks, using sym-
teristic. bolic inputs. Symbolic input vectors are by na-

ture themselves compositional, typically, the con- We choose two meanings spaces: 332 and 105 .
catenation of one-hot vectors of attribute values, 332 is constructed to be similar in nature to (Kirby,
or of per-attribute embeddings (e.g. (Kottur et al., 2001), whilst being large enough to train an RNN
2017)). Thus, these works show that given compo- without immediately over-fitting. With 33 possi-
sitional input, agents can generate compositional ble values per attribute, the number of possible
output. In our work, we extend ILM to high- meanings increases from 102 = 100 to 332 ≈
dimensional, non-symbolic inputs. However, a 1, 000. In addition to not over-fitting, this allows
concurrent work (Dagan et al., 2020) also extends us to set aside a reasonable holdout set of 128 ex-
ILM to image inputs, and also takes an additional amples. We experiment in addition with a mean-
step in examining the effect of genetic evolution ing space of 105 , which has a total of 100, 000 pos-
of the network architecture, in addition to the cul- sible meanings. We hypothesized that the much
tural evolution of the language that we consider in larger number of meanings prevents the network
our own work. from simply memorizing each meaning, and thus
(Andreas, 2019) provides a very general frame- force the network to naturally adopt a more com-
work, TRE, for evaluating compositionality, along positional representation.
with a specific implementation that relates closely
to the language representations used in the current 5.1.2 Experimental Setup
work. It uses a learned linear projection to rear- The model architecture for the symbolic concept
range tokens within each utterance; and a relax- task is that depicted in Figure 4.
ation to enable the use of gradient descent to learn The sender model converts each meaning into a
the projection. Due to time pressure, we did not many-hot representation, of dimension k · a, then
use TRE in our own work. projects the many-hot representation into an em-
Our work on neural ILM relates to distillation, bedding space.
(Ba and Caruana, 2014), (Hinton et al., 2015), in
which a large teacher networks distills knowledge 5.1.3 Results
into a smaller student network. More recently, Table 3 shows the results for the symbolic concept
(Furlanello et al., 2018) showed that when the stu- task. We can see that when using an RL link, ILM
dent network has identical size and architecture to improves the topographic similarity measure, for
the teacher network, distillation can still give an both 332 and 105 meaning spaces. This is true for
improvement in validation accuracy on a vision both SOFTMAX and RL. Interestingly, in the 105
and a language model. Our work relates also to meaning space, the increase in compositionality
self-training, (He et al., 2019), in which learning as measured by ρ is associated with a decrease in
proceeds in iterations, similar to ILM generations. accH , for both SOFTMAX and RL. This could indi-
cate potentially that ILM is inducing the sender to
5 Experiments generate more compositional output, but that the
5.1 Experiment 1: Symbolic Input receiver’s understanding of the utterance becomes
less compositional, in this scenario. It is interest-
5.1.1 Dataset construction
ing that ρ and accH can be inversely correlated, in
We conduct experiments first on a synthetic con- certain scenarios. This aligns somewhat with the
cept dataset, built to resemble that of (Kirby, findings in (Lowe et al., 2019).
2001).
Interestingly, it is not clear that using a 105
We experiment conceptually with meanings
meaning space leads to more compositional utter-
with a attributes, where each attribute can take one
ances than the much smaller 332 meaning space.
of k values. The set of all possible meanings M
comprises k a unique meanings. We use the no- 5.2 Experiment 2: Images
tation k a to describe such a meaning space. We
reserve a holdout set H of 128 meanings, which 5.2.1 Dataset
will not be presented during training. This leaves In Experiment One, we conserved the type of stim-
(k a − 128) meanings for training and validation. uli used in prior work on ILM, eg (Kirby, 2001),
In addition, we remove from the training set any using highly structured input. In Experiment Two,
meanings having 3 or more attributes in common we investigate the extent to which ILM shows a
with any meanings in the holdout set. benefit using unstructured high-dimensional input.

M L E2E Tgt ILM? accH ρ
2
33 SOFTMAX e=100k 0.97+/-0.02 0.23+/-0.01
332 SOFTMAX e=100k yes 0.984+/-0.002 0.30+/-0.02
332 RL e=500k 0.39+/-0.01 0.18+/-0.01
332 RL e=500k yes 0.52+/-0.04 0.238+/-0.008
105 SOFTMAX e=100k 0.97+/-0.01 0.22+/-0.02
105 SOFTMAX e=100k yes 0.56+/-0.06 0.28+/-0.01
105 RL e=500k 0.65+/-0.17 0.17+/-0.02
105 RL e=500k yes 0.449+/-0.004 0.28+/-0.01

Table 3: Results using auto-encoder architecture on synthetic concepts dataset. ”E2E Tgt”: termination criteria
(”target”) for end-to-end training; ”ρ”: topographic similarity. Where ILM is used, it is run for 5 generations.

We used OpenGL to create scenes containing col- have been seen for a cube, but not for a cylinder.
ored objects, of various shapes, in different posi- In the case of just one shape, this would mean that
tions. the color had never been seen at all, so for a sin-
In the previous task, using symbolic meanings, gle shape, we relax this requirement, and just use
we required the listener to reconstruct the sym- unseen geometrical configurations in the holdout
bolic meaning. In the case of images, we use a set.
referential task, as discussed in Section 3.4. The The dataset is constructed using OpenGL and
advantage of using a referential task is that we python. The code will be made available at 1 .
do not require the agents to communicate the ex-
act position and color of each object, just which 5.2.2 Experimental setup
shapes and colors are present. If the agents agree The supervised learning of the student sender and
on an ordering over shapes, then the number of receiver from the teacher generated language is il-
attributes to be communicated is exactly equal to lustrated in Figure 5. The referential task architec-
the number of objects in the images. The positions ture is depicted in Figure 6. Owing to time pres-
of the objects are randomized to noise the images. sure, we experimented only with using RL. We
We also varied the colors of the ground plane over chose RL over SOFTMAX because we felt that RL
each image. is more representative of the discrete nature of nat-
Example images are shown in Figure 7. Each ural languages.
example comprises 6 images: one sender image,
5.2.3 Results
the target receiver image, and 4 distractor images.
Each object in a scene was a different shape, and
Shapes ILM? Batches accH Holdout ρ
we varied the colors and the positions of each ob-
ject. Each shape was unique within each image. 1 300k 0.76+/-0.11 0.55+/-0.03
1 Yes 300k 0.95+/-0.03 0.69+/-0.04
Two images were considered to match if the sets
2 600k 0.21+/-0.03 0.46+/-0.2
of shapes were identical, and if the objects with 2 Yes 600k 0.30+/-0.06 0.64+/-0.05
the same shapes were identically colored. The po-
3 600k 0.18+/-0.01 0.04+/-0.02
sitions of the objects were irrelevant for the pur- 3 Yes 600k 0.23+/-0.02 0.19+/-0.04
poses of judging if the images matched.
We change only a single color in each distractor, Table 4: Results for OpenGL datasets. ‘Shapes’ is
so that we force the sender and receiver to com- number of shapes, ‘Gens’ is number of ILM genera-
tions, and ‘Batches’ is total number of batches. For
municate all object colors, not just one or two. We
ILM, batches per generation is total batches divided by
create three datasets, for sets of 1, 2 or 3 objects number of ILM generations. For ILM, three genera-
respectively. Each dataset comprises 4096 train- tions are used.
ing examples, and 512 holdout examples.
In the case of two shapes and three shapes, we Table 4 shows the results using the OpenGL
create the holdout set by setting aside combina- datasets. We can see that when training using the
tions of shapes and colors which are never seen
1
in the training set. That is, the color ‘red’ might https://github.com/asappresearch/neural-ilm

Figure 7: Example referential task images, one example per row. The sender image and the correct receiver image
are the first two images in each row.

Figure 8: Examples of individual runs up to 10 generations. ‘1 ilm’, ‘2 ilm’, and ‘3 ilm’ denote ILM over the
one, two and three shape datasets respectively. ‘e2e acc’ denotes end to end training accuracy, ‘e2e holdout acc’
denotes end to end accuracy on the holdout set (accH ), and ‘e2e rho’ denotes the topologic similarity of the
generated utterances (ρ).

RL   scenario, ILM shows an improvement across            note firstly that the variance across runs is high,
both 332 and 105 meaning spaces.                          which makes evaluating trends challenging. Re-
   The increase in topographic similarity is asso-        sults in the table above were reported using five
ciated with an improvement in holdout accuracy,           runs per scenario, and pre-selecting which runs to
across all scenarios, similar to the 332 symbolic         use prior to running them.
concepts scenario.                                           We can see that end to end training accuracy is
   Figure 8 shows examples of individual runs.            good for the one and two shapes scenario, but that
The plots within each row are for the same dataset,       the model struggles to achieve high training accu-
i.e. one shape, two shapes, or three shapes. The          racy in the more challenging three shapes dataset.
first column shows the end to end accuracy, the           The holdout accuracy similarly falls dramatically,
second column shows holdout accuracy, accH , and          relative to the training accuracy, as the number of
the third column shows topologic similarity ρ. We         shapes in the dataset increases. Our original hy-

pothesis was that the more challenging dataset, i.e. pages 2654–2662. http://papers.nips.cc/paper/5484-
three shapes, would be harder to memorize, and do-deep-nets-really-need-to-be-deep.pdf.
would thus lead to better compositionality. That Diane Bouchacourt and Marco Baroni. 2018. How
the holdout accuracy actually gets worse, com- agents see things: On visual representations in
pared to the training accuracy, with more shapes an emergent language game. arXiv preprint
arXiv:1808.10696 .
was surprising to us.
Similarly, the topological similarity actually be- Henry Brighton and Simon Kirby. 2006. Understand-
comes worse as we add more shapes to the dataset. ing linguistic evolution by visualizing the emergence
of topographic mappings. Artificial life 12(2):229–
This seems unlikely to be simply because the re- 242.
ceiver struggles to learn anything at all, since the
end to end training accuracy stays relatively high Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh,
and Dhruv Batra. 2020. Emergence of composi-
across all three datasets. We note that the ILM ef- tional language with deep generational transmission.
fect is only apparent over the first few generations,
reaching a plateau after around 2-3 generations. Gautier Dagan, Dieuwke Hupkes, and Elia Bruni.
2020. Co-evolution of language and agents in ref-
erential games. arXiv preprint arXiv:2001.03361 .
6 Conclusion
Jakob Foerster, Ioannis Alexandros Assael, Nando
In this paper, we proposed an architecture to use de Freitas, and Shimon Whiteson. 2016. Learning to
the iterated learning method (”ILM”) for neu- communicate with deep multi-agent reinforcement
learning. In D. Lee, M. Sugiyama, U. Luxburg,
ral networks, including for non-symbolic high-
I. Guyon, and R. Garnett, editors, Advances in Neu-
dimensional input. We showed that using ILM ral Information Processing Systems. Curran Asso-
with neural networks does not lead to the same ciates, Inc., volume 29, pages 2137–2145.
clear compositionality as observed for DCGs. Tommaso Furlanello, Zachary C Lipton, Michael
However, we showed that ILM does lead to a Tschannen, Laurent Itti, and Anima Anandkumar.
modest increase in compositionality, as measured 2018. Born again neural networks. arXiv preprint
by both holdout accuracy and topologic similar- arXiv:1805.04770 .
ity. We showed that holdout accuracy and topo- Thomas L Griffiths and Michael L Kalish. 2007. Lan-
logic rho can be anti-correlated with each other, guage evolution by iterated learning with bayesian
in the presence of ILM. Thus caution might be agents. Cognitive science 31(3):441–480.
considered when using only a single one of these Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio
measures. We showed that ILM leads to an in- Ranzato. 2019. Revisiting self-training for neural
crease in compositionality for non-symbolic high- sequence generation.
dimensional input images. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.
Distilling the knowledge in a neural network. arXiv
Acknowledgements preprint arXiv:1503.02531 .

Thank you to Angeliki Lazaridou for many inter- Simon Kirby. 2001. Spontaneous evolution of linguis-
tic structure-an iterated learning model of the emer-
esting discussions and ideas that I’ve tried to use gence of regularity and irregularity. IEEE Transac-
in this paper. tions on Evolutionary Computation 5(2):102–110.
Simon Kirby, Hannah Cornish, and Kenny Smith.
2008. Cumulative cultural evolution in the lab-
References oratory: An experimental approach to the origins
Jacob Andreas. 2019. Measuring compositional- of structure in human language. Proceedings of
ity in representation learning. arXiv preprint the National Academy of Sciences 105(31):10681–
arXiv:1902.07181 . 10686.
Satwik Kottur, José MF Moura, Stefan Lee, and
Jacob Andreas, Anca Dragan, and Dan Klein. Dhruv Batra. 2017. Natural language does not
2017. Translating neuralese. arXiv preprint emerge’naturally’in multi-agent dialog. arXiv
arXiv:1704.06960 . preprint arXiv:1706.08502 .
Jimmy Ba and Rich Caruana. 2014. Do deep nets Angeliki Lazaridou, Karl Moritz Hermann, Karl
really need to be deep? In Z. Ghahramani, Tuyls, and Stephen Clark. 2018. Emergence of
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. linguistic communication from referential games
Weinberger, editors, Advances in Neural Informa- with symbolic and pixel input. arXiv preprint
tion Processing Systems 27, Curran Associates, Inc., arXiv:1804.03984 .

Angeliki Lazaridou, Alexander Peysakhovich, and
  Marco Baroni. 2016. Multi-agent cooperation and
  the emergence of (natural) language. arXiv preprint
  arXiv:1612.07182 .
Fushan Li and Michael Bowling. 2019. Ease-of-
  teaching and language structure from emergent com-
  munication. In Advances in Neural Information
  Processing Systems. pages 15851–15861.
Francesco Locatello, Stefan Bauer, Mario Lucic, Gun-
  nar Raetsch, Sylvain Gelly, Bernhard Schölkopf,
  and Olivier Bachem. 2019. Challenging common
  assumptions in the unsupervised learning of disen-
  tangled representations. In international conference
  on machine learning. PMLR, pages 4114–4124.
Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle
  Pineau, and Yann Dauphin. 2019. On the pitfalls of
  measuring emergent communication. arXiv preprint
  arXiv:1903.05168 .
Yi Ren, Shangmin Guo, Matthieu Labeau, Shay B Co-
   hen, and Simon Kirby. 2020. Compositional lan-
   guages emerge in a neural iterated learning model.
   arXiv preprint arXiv:2002.01365 .

Appendix: hyper-parameters
For all experiments, results and error bars are re-
ported using five runs per scenario. We pre-select
which runs to use for reporting before running
them.

6.1   Experiment 1
For experiment 1, we use a batch-size of 100, em-
bedding size of 50. RNNs are chosen to be GRUs.
We query the teacher for utterances for 40% of the
training meaning space each generation. We use
an utterance length of 6, and a vocabulary size of
4.

6.2   Experiment 2
For experiment 2, we use the same architecture as
(Lazaridou et al., 2018), with the exception that
we add a max pooling layer after the convolutional
network layers, with kernel size 8 by 8; and we
replace the stride 2 convolutional layers by stride
1 convolutional layers, followed by 2 by 2 max
pooling layers.
   We use entropy regularization for both the
sender and receiver networks, as per (Lazaridou
et al., 2018). At test-time, we take the argmax,
instead of sampling.
   Other hyper-parameters were as follows:
   • optimizer: RMSProp
   • convolutional layers: 8
   • batch size: 32
   • no gradient clipping
   • utterance length: 6
   • utterance vocabulary size: 100
   • embedding size: 50
   • RNN type: GRU
   • Number RNN layers: 1
   • dropout: 0.5
   • supervised training fraction: 0.4
   • number supervised training steps: 200k
   • number end to end training steps: 200k
   • sender entropy regularization: 0.01
   • receiver entropy regularization: 0.001

You can also read