"This item is a glaxefw, and this is a glaxuzb": Compositionality Through Language Transmission, using Artificial Neural Networks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
“This item is a glaxefw, and this is a glaxuzb”: Compositionality Through Language Transmission, using Artificial Neural Networks Hugh Perkins (hp@asapp.com) ASAPP (https://asapp.com) 1 World Trade Center, NY 10007 USA Abstract compositionality unless they have to. In the con- text of referential games, (Lazaridou et al., 2018) We propose an architecture and process for us- showed that agent utterances had a topographic ing the Iterated Learning Model (”ILM”) for arXiv:2101.11739v1 [cs.CL] 27 Jan 2021 rho of 0.16-0.26, on a scale of 0 to 1, even whilst artificial neural networks. We show that ILM showing a task accuracy of in excess of 98%. does not lead to the same clear composition- ality as observed using DCGs, but does lead In this work, following the ideas of (Kirby, to a modest improvement in compositionality, 2001), we hypothesize that human languages are as measured by holdout accuracy and topo- compositional because compositional languages logic similarity. We show that ILM can lead are highly compressible, and can be transmitted to an anti-correlation between holdout accu- across generations most easily. We extend the racy and topologic rho. We demonstrate that ideas of (Kirby, 2001) to artificial neural networks, ILM can increase compositionality when us- and experiment with using non-symbolic inputs to ing non-symbolic high-dimensional images as input. generate each utterance. We find that transmitting languages across gen- 1 Introduction erations using artificial neural networks does not lead to such clearly visible compositionality as Human languages are compositional. For exam- was apparent in (Kirby, 2001). However, we were ple, if we wish to communicate the idea of a ‘red unable to prove a null hypothesis that ILM using box’, we use one word to represent the color ‘red‘, artificial neural networks does not increase com- and one to represent the shape ‘box‘. We can use positionality across generations. We find that ob- the same set of colors with other shapes, such as jective measures of compositionality do increase ‘sphere‘. This contrasts with a non-compositional over several generations. We find that the mea- language, where each combination of color and sures of compositionality reach a relatively mod- shape would have its own unique word, such as est plateau after several generations. ‘aefabg‘. That we use words at all is a characteris- Our key contributions are: tic of compositionality. We could alternatively use a unique sequence of letters or phonemes for each • propose an architecture for using ILM with possible thought or utterance. artificial neural networks, including with Compositionality provides advantages over non-symbolic input non-compositional language. Compositional lan- • show that ILM with artificial neural networks guage allows us to generalize concepts such as col- does not lead to the same clear composition- ors across different situations and scenarios. How- ality as observed using DCGs ever, it is unclear what is the concrete mechanism that led to human languages being compositional. • show that ILM does lead to a modest increase In laboratory experiments using artificial neural in compositionality for neural models networks, languages emerging between multiple communicating agents show some small signs of • show that two measures of compositionality, compositionality, but do not show the clear com- i.e. holdout accuracy and topologic similar- positional behavior that human languages show. ity, can correlate negatively, in the presence (Kottur et al., 2017) shows that agents do not learn of ILM
• demonstrate an effect of ILM on composi- S : (a0 , b3 ) → abc tionality for non-symbolic high-dimensional inputs 2 Iterated Learning Method S : (x, y) → A : y B : x (Kirby, 2001) hypothesized that compositionality A : b3 → ab in language emerges because languages need to B : a0 → c be easy to learn, in order to be transmitted be- tween generations. (Kirby, 2001) showed that using simulated teachers and students equipped Figure 1: Two Example sets of DCG rules. Each set will produce utterance ‘abc’ when presented with with a context-free grammar, the transmission of a meanings (a0 , b3 ). randomly initialized language across generations caused the emergence of an increasingly composi- tional grammar. (Kirby et al., 2008) showed evi- a0 a1 a2 a3 a4 dence for the same process in humans, who were b0 qda bguda lda kda ixcda b1 qr bgur lr kr ixcr each tasked with transmitting a language to an- b2 qa bgua la ka ixca other participant, in a chain. b3 qu bguu lu ku ixcu (Kirby, 2001) termed this approach the ”Iter- b4 qp bgup lp kp ixcp ated Learning Method” (ILM). Learning proceeds Table 1: Example language generated by Kirby’s ILM. in a sequence of generations. In each generation, a teacher agent transmits a language to a student agent. The student agent then becomes the teacher In (Kirby, 2001), the agents are deterministic agent for the next generation, and a new student sets of DCG rules, e.g. see Figure 1. For each pair agent is created. A language G is defined as a of meaning and utterance (mi , ui ) ∈ Gtrain,t , if mapping G : M 7→ U from a space of mean- (mi , ui ) is defined by the existing grammar rules, ings M to a space of utterances U. G can be then no learning takes place. Otherwise, a new represented as a set of pairs of meanings and ut- grammar rule is added, that maps from mi to terances G = {(m1 , u1 ), (m2 , u2 ), . . . (mn , un )}. ui . Then, in the generalization phase, rules are Transmission from teacher to student is imperfect, merged, where possible, to form a smaller set of in that only a subset, Gtrain of the full language rules, consistent with the set of meaning/utterance space G is presented to the student. Thus the stu- pairs seen during training, Gtrain,t . The general- dent agent must generalize from the seen mean- ization phase uses a complex set of hand-crafted ing/utterance pairs {(mi , ui ) | mi ∈ Mtrain,t ⊂ merging rules. M} to unseen meanings, {mi | mi ∈ (M \ The initial language at generation t0 is ran- Mtrain, t )}. We represent the mapping from mean- domly initialized, such that each ut,i is initialized ing mi to utterance ui by the teacher as fT (·). with a random sequence of letters. The meaning Similarly, we represent the student agent as fS (·) space comprised two attributes, each having 5 or In ILM each generation proceeds as follows: 10 possible values, giving a total meaning space • draw subset of meanings Mtrain,t from the full of 52 = 25 or 102 = 100 possible meanings. set of meanings M (Kirby, 2001) examined the compositionality of the language after each generation, by looking for • invention: use teacher agent to generate ut- common substrings in the utterances for each at- terances Utrain,t = {ui,t = fT (mi ) | mi ∈ tribute. An example language is shown in Table Mtrain,t } 1. In this language, there are two meaning at- • incorporation: the student memorizes the tributes, a and b taking values {a0 , . . . , a4 } and teacher’s mapping from Mtrain,t to Utrain,t {b0 , . . . , b4 }. For example, attribute a could be color, and a0 could represent ‘red’; whilst b could • generalization: the student generalizes from be shape, and b3 could represent ‘square’. Then the seen meaning/utterance pairs Gtrain,t to the word for ‘red square’, in the example language determine utterances for the unseen meanings shown, would be ‘qu’. We can see that in the ex- Mtrain,t ample, the attribute a0 was associated with a prefix
‘q’, whilst attribute b3 tended to be associated with Meaning space Nodups Uniq ρ accH a suffix ‘u’. The example language thus shows 332 - 0.024 0.04 0.05 compositionality. 105 - 0.024 0.08 0 (Kirby et al., 2008) extended ILM to humans. 332 yes 0.039 0.1 0 105 yes 0.05 0.1 0 They observed that ILM with humans could lead to degenerate grammars, where multiple meanings Table 2: Results using naive ANN ILM architecture. mapped to identical utterances. However, they ‘Nodups’: remove duplicates; ρ: topographic similar- showed that pruning duplicate utterances from the ity (see later); ‘Uniq’: uniqueness. Termination criteria results of the generation phase, prior to presenta- for teacher-student training is 98% accuracy. tion to the student, was sufficient to prevent the formation of such degenerate grammars. ANNs generalize naturally, but learning is lossy 3 ILM using Artificial Neural Networks and imperfect. This contrasts with a DCG which does not generalize. In the case of a DCG, gener- alization is implemented by applying certain hand- crafted rules. With careful crafting of the gen- eralization rules, the DCG will learn a training set perfectly, and degenerate grammars are rare. In the case of using an ANN, the lossy teacher- student training progressively smooths the out- puts. In the limit of training over multiple genera- tions, an ANN produces the same output, indepen- Figure 2: Naive ILM using Artificial Neural Networks dent of the input: a degenerate grammar. The first two rows of Table 2 show results for two mean- We seek to extend ILM to artificial neural net- ing spaces: 2 attributes each with 33 possible val- works, for example using RNNs. Different from ues (depicted as 332 ), and 5 attributes each with the DCG in (Kirby, 2001), artificial neural net- 10 possible values (depicted as 105 ). The column works generalize over their entire support, for each ‘uniq’ is a measure of the uniqueness of utterances training example. Learning is in general lossy and over the meaning space, where 0 means all utter- imperfect. ances are identical, and 1 means all utterances are In the case of using ANNs we need to first con- distinct. We can see that the uniqueness values are sider how to represent a single ‘meaning’. Con- near zero for both meaning spaces. sidering the example language depicted in Table We tried the approach of (Kirby et al., 2008) 1 above, we can represent each attribute as a one- of removing duplicate utterances prior to presenta- hot vector, and represent the set of two attributes tion to the student. Results for ‘nodups’ are shown as the concatenation of two one-hot vectors. in the last two rows of Table 2. The uniqueness More generally, we can represent a meaning as improved slightly, but was still near zero. Thus the a single real-valued vector, m. In this work, we approach of (Kirby et al., 2008) did not prevent the will use ‘thought vector‘ and ‘meaning vector‘ as formation of a degenerate grammar, in our experi- synonyms for ‘meaning‘, in the context of ANNs. ments, when using ANNs. We partition the meaning space M into Mtrain 3.2 Auto-encoder to enforce uniqueness and Mholdout , such that M = Mtrain ∪ Mholdout . We will denote a subset of Mtrain at generation t To prevent the formation of degenerate grammars, by Mtrain,t . we propose to enforce uniqueness of utterances by mapping the generated utterances back into mean- 3.1 Naive ANN ILM ing space, and using reconstruction loss on the re- A naive attempt to extend ILM to artificial neural constructed meanings. networks (ANNs) is to simply replace the DCG in Using meaning space reconstruction loss re- ILM with an RNN, see Figure 2. quires a way to map from generated utterances In practice we observed that using this for- back to meaning space. One way to achieve this mulation leads to a degenerate grammar, where could be to back-propagate from a generated ut- all meanings map to a single identical utterance. terance back onto a randomly initialized mean-
• student end-to-end training: train the stu- dent sender and receiver network end-to-end, as an auto-encoder For the teacher generation, each utterance ut,n is generated as fT,send (mt,n ). Figure 3: Agent sender-receiver architecture For the student supervised training, we train the student receiver network fS,recv (·) to generate Ut , ing vector. However, this requires multiple back- given Mt , and we train the student sender net- propagation iterations in general, and we found work fS,send (·) to recover Mt given Ut . Super- this approach to be slow. We choose to introduce vised training for each network terminates after a second ANN, which will learn to map from dis- Nsup epochs, or once training accuracy reaches crete utterances back to meaning vectors. Our ar- accsup chitecture is thus an auto-encoder. We call the de- The student supervised training serves to trans- coder the ‘sender’, which maps from a thought mit the language from the teacher to the student. vector into discrete language. The encoder is The student end-to-end training enforces unique- termed the ‘receiver’. We equip each agent with ness of utterances, so that the language does not both a sender and a receiver network, Figure 3. become degenerate. In the end-to-end step, we iterate over multiple 3.3 Neural ILM Training Procedure batches, where for each batch j we do: We will denote the teacher sender network • sample a set of meanings Mtrain,t,j = as fT,send (·), the student receiver network as {mt,j,0 . . . mt,j,Nbatch } ⊂ Mtrain fS,recv (·), and the student sender network as fS,send . The output of f·,send (·) will be non- • train, using an end-to-end loss function Le2e normalized logits, representing a sequence of dis- as an auto-encoder, using meanings Mtrain,t,j tributions over discrete tokens. These logits can as both the input and the target ground truth. be converted into discrete tokens by applying an End-to-end training is run for either Ne2e argmax. batches, or until end-to-end training accuracy For teacher-student training, we use the sender reaches threshold acce2e network of the teacher to generate a set of meaning-utterance pairs, which represent a sub- 3.4 Non-symbolic input set of the teacher’s language. We present this lan- In the general case, the meanings m can be pre- guage to the student, and train both the sender and sented as raw non-symbolic stimuli x. Each raw the receiver network of the student, on this new stimulus x can be encoded by some network into language. a thought-vector m. We denote such an encoding The ILM training procedures is depicted in Fig- network as a ‘perception’ network. As an exam- ure 4. A single generation proceeds as follows. ple of a perception network, an image could be For each step t, we do: encoded using a convolutional neural network. This then presents a challenge when training a • meaning sampling we sample a subset of receiver network. One possible architecture would meanings Mtrain,t = {mt,0 . . . mt,N } ⊂ be for the receiver network to generate the origi- Mtrain , where Mtrain is a subset of the space nal input x. We choose instead to share the per- of all meanings, i.e. Mtrain = M \ Mholdout ception network between the sender and receiver networks in each agent. During supervised train- • teacher generation: use the teacher sender ing of the sender, using the language generated by network to generate the set of utterances the teacher, we train the perception and sender net- Ut = {ut,0 , . . . , ut,N }. works jointly. To train the receiver network, we hold the perception network weights constant, and • student supervised training: train the stu- train the receiver network to predict the output of dent sender and receiver networks super- the perception network, given input utterance u vised, using Mtrain,t and Ut and target stimulus x. See Figure 5. Note that by
Figure 4: Neural ILM Training Procedure Figure 5: Generalized Neural ILM Supervised Training setting the perception network as the identity op- ity distribution over the vocabulary, for each to- erator, we recover the earlier supervised training ken. We can feed these probability distributions steps. directly into the receiver network f·,recv , and train For end-to-end training, with non-symbolic in- using cross-entropy loss. We denote this scenario put, we use a referential task, e.g. as described SOFTMAX . in (Lazaridou et al., 2018). The sender network is Alternatively, we can sample discrete tokens presented the output of the perception network, m, from categorical distributions parameterized by and generates utterance u. The receiver network the softmax output. We train the resulting end-to- chooses a target image from distractors which end network using REINFORCE. We use a mov- matches the image presented to the sender. The ing average baseline, and entropy regularization. target image that the receiver network perceives This scenario is denoted RL. could be the original stimulus presented to the sender, or it could be a stimulus which matches 3.6 Evaluation of Compositionality the original image in concept, but is not the same We wish to use objective measures of composi- stimulus. For example, two images could contain tionality. This is necessary because the compo- the same shapes, having the same colors, but in sitional signal is empirically relatively weak. We different positions. Figure 6 depicts the architec- assume access to the ground truth for the mean- ture, with a single distractor. In practice, multiple ings, and use two approaches: topographic simi- distractors are typically used. larity, ρ, as defined in (Brighton and Kirby, 2006) and (Lazaridou et al., 2018); and holdout accuracy 3.5 Discrete versus soft utterances accH . When we train a sender and receiver network end- ρ is the correlation between distance in mean- to-end, we can put a softmax on the output of ing space, and distance in utterance space, taken the sender network f·,send , to produce a probabil- across multiple examples. For the distance metric,
Figure 6: End-to-end Referential Task for Non-symbolic Inputs, where x s is the input stimulus presented to the sender, x tgt is the target input simulus, and x distr1 is a distractor stimulus. we use the L0 distance, for both meanings and ut- 4 Related Work terances. That is, in meaning space, the distance between ‘red square‘ and ‘yellow square‘ is 1; and Work on emergent communications was revived the distance between ‘red square’ and ‘yellow cir- recently for example by (Lazaridou et al., 2016) cle’ is 2. In utterance space, the difference be- and (Foerster et al., 2016). CITE and CITE tween ‘glaxefw’ and ‘glaxuzg’ is 3. Considered showed emergent communicatoins in a 2d world. as an edit distance, we consider substitutions; but CITE Several works investigate the composition- neither insertions nor deletions. For the correla- ality of the emergent language, such as CITE, tion measure, we use the Spearman’s Rank Corre- CITE, CITE. (Kottur et al., 2017) showed that lation. agents do not generate compositional languages accH shows the ability of the agents to general- unless they have to. (Lazaridou et al., 2018) ize to combinations of shapes and colors not seen used a referential game with high-dimensional in the training set. For example, the training set non-symbolic input, and showed the resulting lan- might contain examples of ‘red square’, ‘yellow guages contained elements of compositionality, square’, and ‘yellow circle’, but not ‘red circle’. If measured by topographic similarity. (Bouchacourt the utterances were perfectly compositional, both and Baroni, 2018) caution that agents may not be as generated by the sender, and as interpreted by communicating what we think they are communi- the receiver, then we would expect performance cating, by using randomized images, and by inves- on ‘red circle’ to be similar to the performance on tigating the effect of swapping the target image. ‘yellow circle’. The performance on the holdout (Andreas et al., 2017) proposed an approach to set, relative to the performance on the training set, learn to translate from an emergent language into a can thus be interpreted as a measure of composi- natural language. Obtaining compositional emer- tionality. gent language can be viewed as disentanglement of the agent communications. (Locatello et al., Note that when there is just a single attribute, it 2019) prove that unsupervised learning of disen- is not possible to exclude any values from train- tangled representations is fundamentally impossi- ing, otherwise the model would never have been ble without inductive biases both on the consid- exposed to the value at all. Therefore accH is only ered learning approaches and the data sets. a useful measure of compositionality when there Kirby pioneered ILM in (Kirby, 2001), extend- are at least 2 attributes. ing it to humans in (Kirby et al., 2008). (Grif- We observe that one key difference between ρ fiths and Kalish, 2007) proved that for Bayesian and accH is that ρ depends only on the compo- agents, that the iterated learning method converges sitional behavior of the sender, whereas acch de- to a distribution over languages that is determined pends also on the compositional behavior of the entirely by the prior, which is somewhat aligned receiver. As noted in (Lowe et al., 2019), it is pos- with the result in (Locatello et al., 2019) for disen- sible for utterances generated by a sender to ex- tangled representations. (Li and Bowling, 2019), hibit a particular behavior or characteristic without (Cogswell et al., 2020), and (Ren et al., 2020) ex- the receiver making use of this behavior or charac- tend ILM to artificial neural networks, using sym- teristic. bolic inputs. Symbolic input vectors are by na-
ture themselves compositional, typically, the con- We choose two meanings spaces: 332 and 105 . catenation of one-hot vectors of attribute values, 332 is constructed to be similar in nature to (Kirby, or of per-attribute embeddings (e.g. (Kottur et al., 2001), whilst being large enough to train an RNN 2017)). Thus, these works show that given compo- without immediately over-fitting. With 33 possi- sitional input, agents can generate compositional ble values per attribute, the number of possible output. In our work, we extend ILM to high- meanings increases from 102 = 100 to 332 ≈ dimensional, non-symbolic inputs. However, a 1, 000. In addition to not over-fitting, this allows concurrent work (Dagan et al., 2020) also extends us to set aside a reasonable holdout set of 128 ex- ILM to image inputs, and also takes an additional amples. We experiment in addition with a mean- step in examining the effect of genetic evolution ing space of 105 , which has a total of 100, 000 pos- of the network architecture, in addition to the cul- sible meanings. We hypothesized that the much tural evolution of the language that we consider in larger number of meanings prevents the network our own work. from simply memorizing each meaning, and thus (Andreas, 2019) provides a very general frame- force the network to naturally adopt a more com- work, TRE, for evaluating compositionality, along positional representation. with a specific implementation that relates closely to the language representations used in the current 5.1.2 Experimental Setup work. It uses a learned linear projection to rear- The model architecture for the symbolic concept range tokens within each utterance; and a relax- task is that depicted in Figure 4. ation to enable the use of gradient descent to learn The sender model converts each meaning into a the projection. Due to time pressure, we did not many-hot representation, of dimension k · a, then use TRE in our own work. projects the many-hot representation into an em- Our work on neural ILM relates to distillation, bedding space. (Ba and Caruana, 2014), (Hinton et al., 2015), in which a large teacher networks distills knowledge 5.1.3 Results into a smaller student network. More recently, Table 3 shows the results for the symbolic concept (Furlanello et al., 2018) showed that when the stu- task. We can see that when using an RL link, ILM dent network has identical size and architecture to improves the topographic similarity measure, for the teacher network, distillation can still give an both 332 and 105 meaning spaces. This is true for improvement in validation accuracy on a vision both SOFTMAX and RL. Interestingly, in the 105 and a language model. Our work relates also to meaning space, the increase in compositionality self-training, (He et al., 2019), in which learning as measured by ρ is associated with a decrease in proceeds in iterations, similar to ILM generations. accH , for both SOFTMAX and RL. This could indi- cate potentially that ILM is inducing the sender to 5 Experiments generate more compositional output, but that the 5.1 Experiment 1: Symbolic Input receiver’s understanding of the utterance becomes less compositional, in this scenario. It is interest- 5.1.1 Dataset construction ing that ρ and accH can be inversely correlated, in We conduct experiments first on a synthetic con- certain scenarios. This aligns somewhat with the cept dataset, built to resemble that of (Kirby, findings in (Lowe et al., 2019). 2001). Interestingly, it is not clear that using a 105 We experiment conceptually with meanings meaning space leads to more compositional utter- with a attributes, where each attribute can take one ances than the much smaller 332 meaning space. of k values. The set of all possible meanings M comprises k a unique meanings. We use the no- 5.2 Experiment 2: Images tation k a to describe such a meaning space. We reserve a holdout set H of 128 meanings, which 5.2.1 Dataset will not be presented during training. This leaves In Experiment One, we conserved the type of stim- (k a − 128) meanings for training and validation. uli used in prior work on ILM, eg (Kirby, 2001), In addition, we remove from the training set any using highly structured input. In Experiment Two, meanings having 3 or more attributes in common we investigate the extent to which ILM shows a with any meanings in the holdout set. benefit using unstructured high-dimensional input.
M L E2E Tgt ILM? accH ρ 2 33 SOFTMAX e=100k 0.97+/-0.02 0.23+/-0.01 332 SOFTMAX e=100k yes 0.984+/-0.002 0.30+/-0.02 332 RL e=500k 0.39+/-0.01 0.18+/-0.01 332 RL e=500k yes 0.52+/-0.04 0.238+/-0.008 105 SOFTMAX e=100k 0.97+/-0.01 0.22+/-0.02 105 SOFTMAX e=100k yes 0.56+/-0.06 0.28+/-0.01 105 RL e=500k 0.65+/-0.17 0.17+/-0.02 105 RL e=500k yes 0.449+/-0.004 0.28+/-0.01 Table 3: Results using auto-encoder architecture on synthetic concepts dataset. ”E2E Tgt”: termination criteria (”target”) for end-to-end training; ”ρ”: topographic similarity. Where ILM is used, it is run for 5 generations. We used OpenGL to create scenes containing col- have been seen for a cube, but not for a cylinder. ored objects, of various shapes, in different posi- In the case of just one shape, this would mean that tions. the color had never been seen at all, so for a sin- In the previous task, using symbolic meanings, gle shape, we relax this requirement, and just use we required the listener to reconstruct the sym- unseen geometrical configurations in the holdout bolic meaning. In the case of images, we use a set. referential task, as discussed in Section 3.4. The The dataset is constructed using OpenGL and advantage of using a referential task is that we python. The code will be made available at 1 . do not require the agents to communicate the ex- act position and color of each object, just which 5.2.2 Experimental setup shapes and colors are present. If the agents agree The supervised learning of the student sender and on an ordering over shapes, then the number of receiver from the teacher generated language is il- attributes to be communicated is exactly equal to lustrated in Figure 5. The referential task architec- the number of objects in the images. The positions ture is depicted in Figure 6. Owing to time pres- of the objects are randomized to noise the images. sure, we experimented only with using RL. We We also varied the colors of the ground plane over chose RL over SOFTMAX because we felt that RL each image. is more representative of the discrete nature of nat- Example images are shown in Figure 7. Each ural languages. example comprises 6 images: one sender image, 5.2.3 Results the target receiver image, and 4 distractor images. Each object in a scene was a different shape, and Shapes ILM? Batches accH Holdout ρ we varied the colors and the positions of each ob- ject. Each shape was unique within each image. 1 300k 0.76+/-0.11 0.55+/-0.03 1 Yes 300k 0.95+/-0.03 0.69+/-0.04 Two images were considered to match if the sets 2 600k 0.21+/-0.03 0.46+/-0.2 of shapes were identical, and if the objects with 2 Yes 600k 0.30+/-0.06 0.64+/-0.05 the same shapes were identically colored. The po- 3 600k 0.18+/-0.01 0.04+/-0.02 sitions of the objects were irrelevant for the pur- 3 Yes 600k 0.23+/-0.02 0.19+/-0.04 poses of judging if the images matched. We change only a single color in each distractor, Table 4: Results for OpenGL datasets. ‘Shapes’ is so that we force the sender and receiver to com- number of shapes, ‘Gens’ is number of ILM genera- tions, and ‘Batches’ is total number of batches. For municate all object colors, not just one or two. We ILM, batches per generation is total batches divided by create three datasets, for sets of 1, 2 or 3 objects number of ILM generations. For ILM, three genera- respectively. Each dataset comprises 4096 train- tions are used. ing examples, and 512 holdout examples. In the case of two shapes and three shapes, we Table 4 shows the results using the OpenGL create the holdout set by setting aside combina- datasets. We can see that when training using the tions of shapes and colors which are never seen 1 in the training set. That is, the color ‘red’ might https://github.com/asappresearch/neural-ilm
Figure 7: Example referential task images, one example per row. The sender image and the correct receiver image are the first two images in each row. Figure 8: Examples of individual runs up to 10 generations. ‘1 ilm’, ‘2 ilm’, and ‘3 ilm’ denote ILM over the one, two and three shape datasets respectively. ‘e2e acc’ denotes end to end training accuracy, ‘e2e holdout acc’ denotes end to end accuracy on the holdout set (accH ), and ‘e2e rho’ denotes the topologic similarity of the generated utterances (ρ). RL scenario, ILM shows an improvement across note firstly that the variance across runs is high, both 332 and 105 meaning spaces. which makes evaluating trends challenging. Re- The increase in topographic similarity is asso- sults in the table above were reported using five ciated with an improvement in holdout accuracy, runs per scenario, and pre-selecting which runs to across all scenarios, similar to the 332 symbolic use prior to running them. concepts scenario. We can see that end to end training accuracy is Figure 8 shows examples of individual runs. good for the one and two shapes scenario, but that The plots within each row are for the same dataset, the model struggles to achieve high training accu- i.e. one shape, two shapes, or three shapes. The racy in the more challenging three shapes dataset. first column shows the end to end accuracy, the The holdout accuracy similarly falls dramatically, second column shows holdout accuracy, accH , and relative to the training accuracy, as the number of the third column shows topologic similarity ρ. We shapes in the dataset increases. Our original hy-
pothesis was that the more challenging dataset, i.e. pages 2654–2662. http://papers.nips.cc/paper/5484- three shapes, would be harder to memorize, and do-deep-nets-really-need-to-be-deep.pdf. would thus lead to better compositionality. That Diane Bouchacourt and Marco Baroni. 2018. How the holdout accuracy actually gets worse, com- agents see things: On visual representations in pared to the training accuracy, with more shapes an emergent language game. arXiv preprint arXiv:1808.10696 . was surprising to us. Similarly, the topological similarity actually be- Henry Brighton and Simon Kirby. 2006. Understand- comes worse as we add more shapes to the dataset. ing linguistic evolution by visualizing the emergence of topographic mappings. Artificial life 12(2):229– This seems unlikely to be simply because the re- 242. ceiver struggles to learn anything at all, since the end to end training accuracy stays relatively high Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, and Dhruv Batra. 2020. Emergence of composi- across all three datasets. We note that the ILM ef- tional language with deep generational transmission. fect is only apparent over the first few generations, reaching a plateau after around 2-3 generations. Gautier Dagan, Dieuwke Hupkes, and Elia Bruni. 2020. Co-evolution of language and agents in ref- erential games. arXiv preprint arXiv:2001.03361 . 6 Conclusion Jakob Foerster, Ioannis Alexandros Assael, Nando In this paper, we proposed an architecture to use de Freitas, and Shimon Whiteson. 2016. Learning to the iterated learning method (”ILM”) for neu- communicate with deep multi-agent reinforcement learning. In D. Lee, M. Sugiyama, U. Luxburg, ral networks, including for non-symbolic high- I. Guyon, and R. Garnett, editors, Advances in Neu- dimensional input. We showed that using ILM ral Information Processing Systems. Curran Asso- with neural networks does not lead to the same ciates, Inc., volume 29, pages 2137–2145. clear compositionality as observed for DCGs. Tommaso Furlanello, Zachary C Lipton, Michael However, we showed that ILM does lead to a Tschannen, Laurent Itti, and Anima Anandkumar. modest increase in compositionality, as measured 2018. Born again neural networks. arXiv preprint by both holdout accuracy and topologic similar- arXiv:1805.04770 . ity. We showed that holdout accuracy and topo- Thomas L Griffiths and Michael L Kalish. 2007. Lan- logic rho can be anti-correlated with each other, guage evolution by iterated learning with bayesian in the presence of ILM. Thus caution might be agents. Cognitive science 31(3):441–480. considered when using only a single one of these Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio measures. We showed that ILM leads to an in- Ranzato. 2019. Revisiting self-training for neural crease in compositionality for non-symbolic high- sequence generation. dimensional input images. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv Acknowledgements preprint arXiv:1503.02531 . Thank you to Angeliki Lazaridou for many inter- Simon Kirby. 2001. Spontaneous evolution of linguis- tic structure-an iterated learning model of the emer- esting discussions and ideas that I’ve tried to use gence of regularity and irregularity. IEEE Transac- in this paper. tions on Evolutionary Computation 5(2):102–110. Simon Kirby, Hannah Cornish, and Kenny Smith. 2008. Cumulative cultural evolution in the lab- References oratory: An experimental approach to the origins Jacob Andreas. 2019. Measuring compositional- of structure in human language. Proceedings of ity in representation learning. arXiv preprint the National Academy of Sciences 105(31):10681– arXiv:1902.07181 . 10686. Satwik Kottur, José MF Moura, Stefan Lee, and Jacob Andreas, Anca Dragan, and Dan Klein. Dhruv Batra. 2017. Natural language does not 2017. Translating neuralese. arXiv preprint emerge’naturally’in multi-agent dialog. arXiv arXiv:1704.06960 . preprint arXiv:1706.08502 . Jimmy Ba and Rich Caruana. 2014. Do deep nets Angeliki Lazaridou, Karl Moritz Hermann, Karl really need to be deep? In Z. Ghahramani, Tuyls, and Stephen Clark. 2018. Emergence of M. Welling, C. Cortes, N. D. Lawrence, and K. Q. linguistic communication from referential games Weinberger, editors, Advances in Neural Informa- with symbolic and pixel input. arXiv preprint tion Processing Systems 27, Curran Associates, Inc., arXiv:1804.03984 .
Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. 2016. Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182 . Fushan Li and Michael Bowling. 2019. Ease-of- teaching and language structure from emergent com- munication. In Advances in Neural Information Processing Systems. pages 15851–15861. Francesco Locatello, Stefan Bauer, Mario Lucic, Gun- nar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. 2019. Challenging common assumptions in the unsupervised learning of disen- tangled representations. In international conference on machine learning. PMLR, pages 4114–4124. Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin. 2019. On the pitfalls of measuring emergent communication. arXiv preprint arXiv:1903.05168 . Yi Ren, Shangmin Guo, Matthieu Labeau, Shay B Co- hen, and Simon Kirby. 2020. Compositional lan- guages emerge in a neural iterated learning model. arXiv preprint arXiv:2002.01365 .
Appendix: hyper-parameters For all experiments, results and error bars are re- ported using five runs per scenario. We pre-select which runs to use for reporting before running them. 6.1 Experiment 1 For experiment 1, we use a batch-size of 100, em- bedding size of 50. RNNs are chosen to be GRUs. We query the teacher for utterances for 40% of the training meaning space each generation. We use an utterance length of 6, and a vocabulary size of 4. 6.2 Experiment 2 For experiment 2, we use the same architecture as (Lazaridou et al., 2018), with the exception that we add a max pooling layer after the convolutional network layers, with kernel size 8 by 8; and we replace the stride 2 convolutional layers by stride 1 convolutional layers, followed by 2 by 2 max pooling layers. We use entropy regularization for both the sender and receiver networks, as per (Lazaridou et al., 2018). At test-time, we take the argmax, instead of sampling. Other hyper-parameters were as follows: • optimizer: RMSProp • convolutional layers: 8 • batch size: 32 • no gradient clipping • utterance length: 6 • utterance vocabulary size: 100 • embedding size: 50 • RNN type: GRU • Number RNN layers: 1 • dropout: 0.5 • supervised training fraction: 0.4 • number supervised training steps: 200k • number end to end training steps: 200k • sender entropy regularization: 0.01 • receiver entropy regularization: 0.001
You can also read