Identity-Based Patterns in Deep Convolutional Networks: Generative Adversarial Phonology and Reduplication
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Identity-Based Patterns in Deep Convolutional Networks: Generative Adversarial Phonology and Reduplication Gašper Beguš University of California, Berkeley begus@berkeley.edu Abstract called the base) with some added meaning (Inke- las and Zoll, 2005; Urbanczyk, 2017). It can be This paper models unsupervised learning total, which means that all phonemes in a word of an identity-based pattern (or copying) in speech called reduplication from raw con- get copied (e.g. /pula/ → [pula-pula]), or par- tinuous data with deep convolutional neural tial, where only a subset of segments gets copied arXiv:2009.06110v2 [cs.CL] 17 Jul 2021 networks. We use the ciwGAN architecture (e.g. /pula/ → [pu-pula]). (Beguš, 2021a) in which learning of mean- Reduplication is indeed unique among pro- ingful representations in speech emerges cesses in natural language because combining from a requirement that the CNNs generate learned entities based on training data distributions informative data. We propose a technique to wug-test CNNs trained on speech and, does not yield the desired outputs. For example, based on four generative tests, argue that a learner can be presented with pairs of bare and the network learns to represent an identity- reduplicated words, such as /pala/ ∼ /papala/ and based pattern in its latent space. By manip- /tala/ ∼ /tatala/. The learner can then be tested on ulating only two categorical variables in the providing a reduplicated variant of a novel unob- latent space, we can actively turn an unredu- served item with an initial sound /k/ that they have plicated form into a reduplicated form with not been exposed to (e.g. /kala/). If the learner no other substantial changes to the output in the majority of cases. We also argue learns the reduplication pattern, they will output that the network extends the identity-based [kakala]. If the learner simply learns that /pa/ pattern to unobserved data. Exploration of and /ta/ are optional constituents that can be at- how meaningful representations of identity- tached to words based on data distribution, they based patterns emerge in CNNs and how the will output [pakala] or [takala]. Reduplication is latent space variables outside of the train- thus an identity-based pattern (similar to copy- ing range correlate with identity-based pat- ing), which is computationally more challeng- terns in the output has general implications ing to learn (Gasser, 1993), both in connectionist for neural network interpretability. (Brugiapaglia et al., 2020) and non-connectionist frameworks (Savitch, 1989; Dolatian and Heinz, 1 Introduction 2020). In /ki aj ki aj la/, the two sounds in the redu- The relationship between symbolic representa- plicative morpheme, /ki / and /aj /, need to be in tions and connectionism has been subject to on- an identity relationship with the first two segments going discussions in computational cognitive sci- of the base, /ki / and /aj /, and the learner needs to ence. Phonology offers a unique testing ground copy rather than recombine learned elements. in this debate as it is concerned with the first Marcus et al. (1999) argue that connectionist discretization that human language users per- models such as simple neural networks are un- form: from continuous phonetic data to discretized able to learn a simple reduplication pattern that 7- mental representations of meaning-distinguishing month old human infants are able to learn (see also sounds called phonemes. Gasser 1993). According to Marcus et al. (1999), Identity-based patterns, repetition, or copying the behavioral outcomes of their experiments can- have been at the center of this debate (Mar- not be modeled by simple counting, attention to cus et al., 1999). Reduplication is a mor- statistical trends in the input, attention to repe- phophonological process where phonological con- tition, or connectionist (simple neural network) tent (phonemes) get copied from a word (also computational models. Instead, they argue, the re-
sults support the claim that human infants employ While equivalents of copying/identity-based pat- “algebraic rules” (Marcus et al., 1999; Marcus, terns can be constructed in the visual domain, we 2001; Berent, 2013) to learn reduplication patterns are not aware of studies that test identity-based vi- (for a discussion, see, among others, McClelland sual patterns with deep convolutional neural net- and Plaut 1999; Endress et al. 2007). works. With the development of neural network archi- In this paper, we model reduplication, one of tectures, several studies revisited the claim that the computationally most challenging processes, neural networks are unable to learn reduplica- from raw unlabeled acoustic data with deep convo- tive patterns (Alhama and Zuidema, 2018; Prick- lutional networks in the GAN framework. The ad- ett et al., 2018; Nelson et al., 2020; Brugiapaglia vantage of the GAN framework for cognitive mod- et al., 2020), arguing that identity-based patterns eling is that the network has to learn to output raw can indeed be learned with more complex archi- acoustic data from a latent noise distribution with- tectures.1 All these computational experiments, out directly accessing the training data. We argue however, operate on an already discretized level that CNNs discretize continuous phonetic data and and most of these experiments model reduplica- encode linguistically meaningful units into indi- tion with supervised learning. vidual latent variables. The emergence of a dis- Examples like [pu-pula] and [pula-pula] are cretized representation of an identity-based pattern discretized representations of reduplication, using (reduplication) is induced by a model which forces characters to represent continuous sounds. Most, the Generator network to output informative data if not all, computational models of reduplication, (ciwGAN; Section 4). Additionally, we add in- to the author’s knowledge, model reduplication as ductive bias towards symbolic-like representations character or feature manipulation (the inputs to the by binarizing code variables with which the Gen- models are either characters representing phones erator encodes meaningful representations. We or phonemes or discrete binary featural represen- also test whether a deep convolutional network tations of phonemes). For example, a seq2seq learns reduplication without the two inductive bi- model treats reduplication as a pairing between ases (without the requirement on the Generator to the input unreduplicated sequence of “characters” output informative data and without binarization (such as /tala/) and an output — a reduplicated se- of the latent space) in the bare WaveGAN archi- quence (such as /tatala/). Already abstracted and tecture (Section 5). discretized phonemes or “characters”, however, The experiments bear implications for the dis- are not the primary input to language-learning in- cussion between symbolic and connectionist ap- fants. The primary linguistic data of most hearing proaches to language modelling by testing the infants is raw continuous speech. Hearing infant emergence of rule-like symbolic representations learners need to acquire reduplication from con- within the connectionist framework from raw tinuous speech data that is substantially more com- speech in an unsupervised manner. Results of the plex than already discretized characters or binary experiments suggest that both models, ciwGAN features. and WaveGAN learn the identity-based patterns, Furthermore, most of the existing models of but inductive biases for informative representation reduplication are also supervised. Seq2seq mod- and binarization facilitate learning and yield bet- els, for example, are fully supervised: the train- ter results. We discuss properties of symbolic-like ing consists of pairs of unreduplicated (input) and representations and how they emerge in the mod- reduplicated strings of characters or binary fea- els: discretization, causality (in the sense that ma- tures (output). While the performance can be nipulation of individual elements results in desired tested on unobserved data or even on unobserved outcome), and categoricity (for discussions on segments, the training is nevertheless supervised. these and other aspects of the debate, see Rumel- Human language learners do not have access to hart et al. 1986; McClelland et al. 1986; Fodor and input-output pairings: they are only presented with Pylyshyn 1988; Minsky 1991; Dyer 1991; Marcus positive, surface, and continuous acoustic data. et al. 1999; Marcus 2001; Manning 2003; Berent 1 Wilson (2018, 2020) proposes another approach that al- 2013; Maruyama 2021). lows modeling reduplication. For a non-connectionist com- putational model of reduplication, see Dolatian and Heinz How can we test learning of reduplication in (2018, 2020). a deep convolutional network that is trained only
on raw positive data? We propose a technique to posed for non-identity-based patterns in Beguš test the ability of the Generator to produce forms 2020) allows us to directly explore how the net- absent from the training data set. For example, works encode dependencies in data, their under- we train the networks on acoustic data of items lying values, and interactions between variables, such as /pala/ ∼ /papala/ and /tala/ ∼ /tatala/, but and thus get a better understanding of how exactly test reduplication on acoustic forms of items such deep convolutional networks encode meaningful as /sala/, which is never reduplicated in the train- representations. ing data. Using the technique proposed in Beguš Recent developments in zero-resource speech (2020), we can identify latent variables that cor- modeling (Dunbar et al., 2017, 2019, 2020) enable respond to some phonetic or phonological repre- modeling of speech processes in an unsupervised sentation such as reduplication. By manipulating manner from raw acoustic data. Several proposals and interpolating a single latent variable, we can exist for modeling unsupervised lexical learning actively generate data with and without redupli- (Kamper et al., 2014; Lee et al., 2015; Chung et al., cation. In fact, we can observe a direct relation- 2016) that include generative models such as vari- ship between a single latent variable (out of 100) ational autoencoders (Chung et al., 2016; Baevski and reduplication that with interpolation gradu- et al., 2020; Niekerk et al., 2020) and GANs (Be- ally disappears from the output. Additionally, we guš, 2021a). This framework allows not only un- can identify latent variables that correspond to [s] supervised lexical term discovery, but also phone- in the output. By forcing both reduplication and level identification (Eloff et al., 2019; Shain and [s] in the output through latent space manipula- Elsner, 2019; Chung et al., 2016; Chorowski et al., tion, we can “wug-test” the network’s learning of 2019). While zero-resource speech modeling has reduplication on unobserved data. In other words, yielded promising results in unsupervised label- we can observe what the network will output if ing, the proposals generally do not model phono- we force it to output reduplication and an [s] at logical or morphophonological processes. This the same time. A comparison of generated out- paper thus also tests applicability of the unsuper- puts with human outputs that were withheld from vised speech processing framework for cognitive training reveals a high degree of similarity. We modeling and network interpretability. perform an additional computational experiment to replicate the results from the first experiment 2 Model (from Section 4). In the replication experiment, evidence for learning of the reduplicative pattern Generative Adversarial Networks (Goodfellow also emerges. To the author’s knowledge, this is et al., 2014) are a neural network architecture the first attempt to model reduplication with neu- with two main components: the Generator net- ral network architectures trained on raw acoustic work and the Discriminator network. The Gen- speech data. erator is trained on generating data from some la- tent space that is randomly distributed. The Dis- The computational experiments reveal another criminator takes real training data and the Gener- property about representation learning in deep ator’s outputs and estimates which inputs are real neural networks: we argue that the network ex- and which are generated. The minimax training, tracts information in the training data and repre- where the Generator is trained on maximizing the sents a continuous acoustic identity-based pattern Discriminator’s error rate and the Discriminator is with discretized representation. Out of 100 vari- trained on minimizing its own error rate, results ables, the network encodes reduplication with one in the Generator network outputting data such that or two variables, which is suggested by the fact the Discriminator’s success in distinguishing them that a small subset of variables are substantially from real data is low. It has been shown that more strongly correlated with presence of redupli- GANs not only learn to produce innovative data cation. In other words, there is a near categorical that resemble speech, but also learn to encode pho- drop in regression estimates between one variable netic and phonological representations in the la- and the rest of the latent space. Setting the iden- tent space (Beguš, 2020). The major advantage of tified variables to values well beyond the training the GAN architecture for modeling speech is that range results in near categorical presence of a de- the Generator network does not have direct access sired variable in the output. This technique (pro- to the training data and is not trained on replicat-
ing data (unlike in the autoencoder architecture; The architecture involves three networks: the Räsänen et al. 2016; Eloff et al. 2019; Shain and Generator that takes latent codes (a one-hot vec- Elsner 2019). Instead, the network has to learn to tor) and uniformly distributed z-variables and gen- generate data from noise in a completely unsuper- erates waveforms, a Discriminator that distin- vised manner — without ever directly accessing guishes real from generated outputs, and a Q- the training data. network that takes generated outputs and estimates the latent code (one-hot vector) used by the Gener- In the first experiment, we use the ciwGAN ator. More specifically, the Generator network is a (Categorical InfoWaveGAN) model proposed in deep convolutional network that takes as its input Beguš (2021a). The ciwGAN model combines 100 latent variables (see Figure 1).2 Two of the the WaveGAN and InfoGAN architectures. Wave- 100 variables are code variables (c1 and c2 ) that GAN, proposed by Donahue et al. (2019), is a constitute a one-hot vector. The remaining 98 z- Deep Convolutional Generative Adversarial Net- variables are uniformly distributed on the interval work (DCGAN; proposed by Radford et al. 2016) (−1, 1). The Generator learns to take as the input adapted for time-series audio data. The ba- the 2 code variables and the 98 latent variables and sic architecture is the same as in DCGAN, the output 16384 samples that constitute just over one main difference being that in the WaveGAN pro- second of audio file sampled at 16 kHz through posal, the deep convolutional networks take one- five convolutional layers. The Discriminator net- dimensional time-series data as inputs or outputs. work takes real and generated data (again in the The structure of the Generator and the Discrimina- form of 16384 samples that constitute just over tor networks in the ciwGAN architecture are taken one second of audio file) and learns to estimate from Donahue et al. (2019). InfoGAN (Chen the Wasserstein distance between generated and et al., 2016) is an extension of the GAN archi- real data (according to the proposal in Arjovsky tecture that aims to maximize mutual informa- et al. 2017) through five convolutional layers. In tion between the latent space and generated out- the majority of InfoGAN proposals, the Discrim- puts. The Discriminator/Q-network learns to re- inator and the Q-network share convolutions. Be- trieve the Generator’s latent categorical or con- guš (2021a) introduces a separate Q-network (also tinuous codes (Chen et al., 2016) in addition to in Rodionov 2018).3 estimating realness of generated outputs and real training data. The Q-network is in its structure identical to the Discriminator network, but the final layer is fully Beguš (2021a) proposes a model that combines connected to nodes that correspond to the number these two proposals and introduces a new latent of categorical variables (Beguš, 2021a). In the ci- space structure (in the fiwGAN architecture). Be- wGAN architecture, the Q-network is trained on cause we are primarily interested in simple bi- estimating the latent code variables with a soft- nary classification between bare and reduplicated max function (Beguš, 2021a). In other words, the forms, we use the ciwGAN variant of the proposal. Q-network takes the Generator’s outputs (wave- The model introduces a separate deep convolu- forms) and estimates the Generator’s latent code tional Q-network that learns to retrieve the Gener- variables c1 and c2 . Weights of both the Generator ator’s internal representations. Separating the Dis- network and the Q-network are updated according criminator and the Q-network into two networks to the Q-network’s loss function: to minimize the is advantageous from the cognitive modeling per- distance between the actual one-hot vector (c1 and spective: the architecture features a separate net- c2 ) used by the Generator and the one-hot vector work that models speech production (the Gener- estimated with a softmax in the Q-network’s final ator) and a separate network that models speech layer using cross-entropy. This forces the Genera- categorization (the Q-network). The latter intro- tor to output informative data. duces an inductive bias that forces the Generator to The advantage of the ciwGAN architecture is output informative data and encode linguistically that the network not only learns to output innova- meaningful properties into its code variables. The 2 network learns to generate data such that by ma- The number of latent variables were adopted from Rad- ford et al. (2016) and Donahue et al. (2019). Probing how the nipulating these code variables, we can force the number of z-variables affects learning of speech representa- desired linguistic property in the output (Beguš, tions is left for future work. 3 2021a). For all details about the architecture, see Beguš (2021a).
Backpropagation Latent space 98 random variables (z ) x̂ = z3−100 ∼ U(−1, 1) Generator 0.8546 network 2 features (cat. variables) c 0 c1 c2 G(z ) c= 0 1 -0.834 0 Time (s) 0.8352 1 0 2 c1 c2 16 64 256 1024 4096 16384 1024 1 reshape 512 1 Q network Discriminator 98 network Generated conv1 Estimates ĉ or real? [c1 , c2 ] D(x ) z 256 1 conv2 Backpropagation Backpropagation 128 1 Training data x= conv3 996 unpaired bare 0.1236 and reduplicated items 0 64 1 Ci Vj CV conv4 Ci Vj Ci Vj CV 1 -0.1664 1 0 0.7593 Time (s) conv5 Figure 1: (left) The ciwGAN architecture as proposed in Beguš (2021a) and used in this paper with training data as described in Section 3. (right) The structure of the Generator in the ciwGAN architecture as proposed in Beguš (2021a) (based on Donahue et al. 2019). tive data that resemble speech in the input, but also guages: partial CV reduplication found in lan- provides meaningful representations about data in guages such as Paamese, Roviana, Tawala, among an unsupervised manner. For example, as will be others (Inkelas and Zoll, 2005). Base items are of argued in Section 4, the ciwGAN network encodes the shape C1 V2 C3 V4 (C = consonant; V = vowel), reduplication as a meaningful category: it learns e.g. /tala/. Reduplicated forms are of the shape to assign a unique code for bare and reduplicated C1 V2 C1 V2 C3 V4 , where the first syllable (C1 V2 ) items. This encoding emerges in an unsupervised is repeated. The items were constructed so that C1 fashion from the requirement that the Generator contains a voiceless stop /p, t, k/, a voiced stop output data such that unique information is retriev- /b, d, g/, a labiodental voiced fricative /v/, and able from its acoustic outputs. Given the structure nasals /m, n/. The vowels V2 and V4 consist of of the training data, the Generator is most infor- /A (@), i, u/. C3 consists of /l, ô, j/. All permuta- mative if it encodes presence of reduplication in tions of these elements were created. The stress the code variables. was always placed on V2 in the base forms and on To replicate the results and to test learning of the same syllable in reduplicated forms (["ph Al@] an identity-based pattern without binarization and ∼ [p@"ph Al@]). Because the reader of the training without the requirement on the Generator to out- data was a speaker of American English, the train- put informative data, we run an independent ex- ing data is phonetically even more complex. The periment on a bare WaveGAN (Donahue et al., major phonetic effects in the training data include 2019) architecture using the same training data. (i) reduction of the vowel in the unstressed redu- The difference between the two architectures is plicated forms and in the final syllable (e.g. from that the bare GAN architecture does not involve [A] to [2/@]) and (ii) deaspiration of voiceless stops a Q-network and the latent space only includes la- in the unstressed reduplication syllable (e.g. from tent variables uniformly distributed on the interval [ph ] to [p]). The training data includes two unique (−1, 1). repetitions of each item and two repetitions of the Beguš (2020) and Beguš (2021a) also propose corresponding reduplicated forms. Table 1 illus- a technique for latent space interpretability in trates the training data. GANs: manipulating individual variables to val- The training data also includes base forms ues well beyond the training range can reveal un- C1 V2 C3 V4 with the initial consonant C1 being a derlying representations of different parts of the fricative [s]. These items, however, always appear latent space. We use this technique throughout the unreduplicated in the training data — the purpose paper to evaluate learning of reduplication. of [s]-initial item is to test how the network ex- tends the reduplicative pattern to novel unobserved 3 Reduplication in training data data. All 27 permutations of sV2 C3 V4 were in- The training data was constructed to test a sim- cluded. To increase representation of [s]-initial ple reduplication pattern, common in human lan- words, four or five repetitions of each unique [s]-
initial base were used in training.4 Altogether 132 C1 V2 C3 V4 "ph Ali voiceless C1 repetitions of the 27 unique unreduplicated words C1 V2 C1 V2 C3 V4 p2"ph Ali with an initial [s] were used in training. C1 V2 C3 V4 "bAli Sibilant fricative [s] was chosen as C1 for test- voiced C1 C1 V2 C1 V2 C3 V4 b2"bAli ing learning of reduplication because its frication C1 V2 C3 V4 "mAli noise is acoustically prominent and sufficiently C1 = [m, n, v] C1 V2 C1 V2 C3 V4 m2"mAli different from C1 s in the training data both acous- C1 V2 C3 V4 "sAli tically and phonologically. This satisfies the re- C1 = [s] C1 V2 C1 V2 C3 V4 — quirement that a model learns to generalize to novel segments and feature values (Berent, 2013; Table 1: A schematic illustration of the training data in Prickett et al., 2018).5 In phonological terms, the the International Phonetic Alphabet. model is tested on a novel feature (sibilant frica- tive or [±strident]; Hayes 2009) — the training ful feature about the data. The Q-network forces data did not consist of any bare or reduplicated the Generator to encode information in its latent forms with other sibilant fricatives. To make the space. In other words, the loss function of the Q- learning even more complex, voiceless fricatives network forces the Generator to output data such ([f, T, S]) are altogether absent from the training that the Q-network is effective in retrieving the la- data. All voiced fricatives except for [v] are absent tent code c1 and c2 from the Generator’s acoustic too. Spectral properties of the voiced non-sibilant outputs only. Nothing in the training data pairs fricative [v] in the training data (and in Standard- base and reduplicated forms. There is no overt ized American English in general) are so substan- connection between the bases and their redupli- tially different from a voiceless sibilant fricative cated correspondents. Yet, the structure of the data [s] that we kept them in the training data. We ex- is such that given two categories, the most infor- cluded all items with initial sequences /ti/, /tu/, mative way for the Generator to encode unique and /ki/ from the training data, because acous- information in its acoustic outputs is to associate tic properties of these sequences, especially frica- one unique code with base forms and another with tion of the aspiration of /t/ and /k/, are similar reduplicated forms. The Generator would thus to those of frication noise in /s/. Altogether 996 have a meaningful unique representation of redu- unique sliced items used in training were recorded plication that arises in an unsupervised manner ex- in a sound attenuated booth by a female speaker clusively from the requirement on the Generator to of American English with a MixPre 6 (SoundDe- output informative data. vices) preamp/recorder and the AKG C544L head- mounted microphone. To test whether the Generator encodes redupli- cation in latent codes, we train the network for 4 CiwGAN (Beguš, 2021a) 15,920 steps (or approximately 5,114 epochs) with the data described in Section 3. The choice of The Generator features two latent code variables, the number of steps is based on two objectives; c1 and c2 and 98 uniformly distributed variables first, the output data should approximate speech z (Figure 1). In the training phase, the two code to the degree that allows acoustic analysis. Sec- variables (c1 and c2 ) compose the one-hot vector ond, the Generator network should not be trained with two levels: [0, 1] and [1, 0]. This means that to the degree that it replicates data completely. As the network can encode two categories in its latent such, overfitting rarely occurs in the GAN archi- space structure that correspond to some meaning- tecture (Adlam et al., 2019; Donahue et al., 2019). 4 Items ["sala], ["suru], and ["suju] each miss one repetition The best evidence against overfitting in the ciw- (four altogether). GAN architecture comes from the fact that the 5 For an “across the board” generalization, Berent (2013) Generator outputs data that violate training distri- requires that generalization occur to segments fully absent from the inventory. It is challenging to elicit reduplication butions substantially (see Section 4.2 below) (Be- of segments that are fully absent from the training data in the guš, 2021a,b). Despite these guidelines, the choice proposed models. Even in human subject experiments testing of number of steps is somewhat arbitrary (for dis- the “across the board” generalization, subjects need to be ex- cussion, see Beguš 2020). posed to the novel segment at least as a prompt. In our case, the novel segment needs to be part of the training data, but We generate 100 outputs for each latent code only in unreduplicated forms. [0, 1] and [1, 0] (200 total) and annotate them for
Code Bare Redup. % Redup. [1.25, 0], etc.). From [0, 0] we further interpolate [1, 0] 78 22 22% in increments of 0.125 to [0, 1.5] (e.g. [0, 0.125], [0, 1] 40 60 60% [0, 0.25]). All other variables in the latent space [5, 0] 98 2 2% are kept constant across all interpolated values. [0, 5] 13 87 87% Each such set thus contains 25 generated samples. We generate 100 such sets (altogether 2500 out- Table 2: Counts of bare and reduplicated (redup.) out- puts) and analyze each output. Out of the 100 sets, puts when the latent codes c1 and c2 are set to [1, 0], the output was either bare or reduplicated through- [0, 1], [5, 0], and [0, 5]. out the interpolated values and did not change in 55 sets. As suggested by Section 4 and Table 2, the presence or absence of reduplication. All annota- number of bare and reduplicated forms for each tions here and in other sections are performed by level rises to near categorical values as the vari- the author in Praat (Boersma and Weenink, 2015). ables approach values of 5. Distinguishing unreduplicated from reduplicated In the 45/100 sets, the output changes from the is very salient; for less salient annotations, we pro- base form to a reduplicated form at some point as vide waveforms and spectrograms (e.g. Figures 4 the codes are interpolated. If the network only and 6).6 learned to randomly associate base and redupli- There is a significant correlation between the cated forms with each endpoint of the latent code, two levels of latent code and presence of redupli- we would expect base forms to be unrelated to cation. Counts are given in Table 2. When the reduplicated forms. For example, a base form code is set to [1, 0], 78% of the generated outputs ["kh ulu] could turn into reduplicated [d@"dAl@]. An are base forms; when set to [0, 1], 60% of outputs acoustic analysis of the generated sets, however, are reduplicated (odds ratio = 5.27, p < 0.0001, suggests that the latent code directly corresponds Fisher Exact Test). When the latent codes are set to reduplication. In approximately 25 out of 45 to [0, 5] and [5, 0], we get a near categorical distri- sets (55.6%) of generated outputs that undergo the bution of bare and reduplicated forms. For [5, 0], change from base to a reduplicated form (or 25% the Generator outputs an unreduplicated bare form of the total sets), the base form is identical to the in 98% samples. For [0, 5], it outputs a redu- reduplicated form with the only major difference plicated form in 87% outputs (odd ratio = 308.3, between the two being the presence of reduplica- p < 0.0001, Fisher Exact Test). These outcomes tion (waveforms and spectrograms of the 25 out- suggest that the Generator encodes reduplication puts are in Figure 6). This proportion would likely in its latent codes and again confirm that manipu- be even higher with a higher interpolation reso- lating latent variables well beyond training range lution (higher than 0.125) and because we do not reveals the underlying learning representations in count cases in which major changes of sounds oc- deep convolutional networks (as proposed in Be- cur besides the addition of the reduplication sylla- guš 2020; Beguš 2021a). ble (for example, if ["nAôi] changes to [nU"nuôi], we 4.1 Interpolation count the output as unsuccessful). In the remain- ing 20 outputs, several outputs undergo changes, That the Generator uses latent codes to encode where several segments or their features are kept reduplication is further suggested by another gen- constant, but the degree to which they differ can erative test performed on interpolated values of the vary (e.g. ["ph il@] ∼ [p@"ph iôi], ["th iju] ∼ [d@"dAji], latent code. To test how exactly the relationship ["nAô@] ∼ [d@"dAôi], or ["ph iô@] ∼ [t@"th Ali]). between the latent codes (c1 and c2 ) works, we created sets of generated outputs based on interpo- Under the null hypothesis, if the Generator lated values of the code c1 and c2 . We manipulate learns to pair the base and reduplicated forms ran- c1 and c2 from the value 1.5 towards 0 in incre- domly, each base form could be associated with ments of 0.125. For example, we start with [1.5, 0] any of the unique 243 reduplicated forms at the and interpolated first to [0, 0] (e.g. [1.375, 0], probability of 1/243 (0.004). Even if we assume 6 very conservatively that each base form could be The code is available at https://github.com/ gbegus/fiwGAN-ciwGAN. The generated data and associated with only each subgroup of redupli- checkpoints are available at https://doi.org/10. cated consonant (C1 ; e.g. voiceless stops, voiced 17605/osf.io/zbjcp. stops, [m], [n], [v]) disregarding the vowel and
0.7247 0.6697 1.00 ● 0 [0, 1.375] 0 [0, 0.875] ● Lasso regression estimates -0.9253 -0.8647 ● [0, 0.875] [0, 0.625] ● 0.75 ● ● [0, 0.375] ● [0, 0.625] ● [0, 0.375] [0.375, 0] 0.50 ●● ●●●● ●●●●●● ●●●● ●●● [0.125, 0] [0.625, 0] ●● ●●● ●●●●●● ●●●● 0.25 ●● 0 0.8202 0 0.7588 ●●● Time (s) Time (s) ●●●● ●● ●●●● ●●●●● ● ●●● ●●●● 0.00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●● 3 5 7 8 9 121520213035364250546166777981838792969714499939717819455628857631472340747563653443559846485233189473728658418464702993576982272538671768100538832378924 4 6244951060 6 51912213115926168090 Figure 2: Waveforms showing how interpolation of Latent variables (z) latent codes c1 and c2 has a direct effect on pres- ence of reduplicattion: as the values are interpolated Figure 3: Absolute Lasso regression estimates (sorted from [1.5, 0] to [0, 1.5], the reduplication gradually from highest on the right-hand side) for a ciwGAN appears/disappears from the output. Waveforms on model identifying presence of [s] after 1000 transcribed the left represent reduplication of ["ph iôu] to [p@"ph iôu]; outputs, 500 for each latent code (with the same latent waveforms on the right represent reduplication of variable structure of the remaining 98 variables across ["dAji] to [d@"dAji]. the two conditions). Variable z90 is identified as the variable corresponding to presence of [s] (the variable with the highest regression estimates). disregarding changes in the base, the probability of both forms being identical would still be at only 0.2 (for each of the five subgroups). In both and c2 ) to represent reduplication. Following Be- cases, the ratio of identical base-reduplication guš (2020) and Beguš (2021a), we can force any pairs, while not categorical, is highly significant phonetic property in the output by manipulating (CI = [0.4, 0.7], p < 0.0001 for both cases ac- the latent variables well beyond the training range. cording to Exact Binomial Test). Reduplication is forced by setting the latent code Figure 2 illustrates how, keeping the latent to values higher than [0, 1]. We can simultane- space constant except for the manipulation of ously force [s] in the output to test the network’s the latent code with which the Generator repre- performance on reduplication in unseen data. sents reduplication, the generated outputs gradu- To identify latent variables with which the Gen- ally transition from the base forms ["ph iôu] and erator encodes the sound [s] in the output, we gen- ["dAji] to the reduplicated forms [p@"ph iôu] and erate 1000 samples with randomly sampled latent [d@"dAji].7 Other major properties of the output are variables, but with the latent code variables (c1 and unchanged. c2 ) set at [0, 1] and [1, 0] (500 samples each with This interpolative generative test again suggests the same latent variable structure of the remaining that the network learns reduplication and encodes 98 variables across the two conditions). We anno- the process in the latent codes. By interpolating tate outputs for presence of [s] for the two sets and the codes we can actively force reduplication in fit the data to a Lasso logistic regression model in the output with no other substantial changes in the the glmnet package (Simon et al., 2011). Presence majority of cases. of [s] is the dependent variable coded as a success; the independent variables are the 98 latent vari- 4.2 Reduplication of unobserved data ables uniformly distributed on the interval (−1, 1) To test whether the ciwGAN network learns to (for the technique, see Beguš 2020). Lambda is generalize the reduplicative pattern on unobserved computed with 10-fold cross validation. Estimates data, we use latent space manipulation to force of the Lasso regression model (Figure 3) suggest reduplication at the same time as presence of [s] that z90 with the highest regression estimates is in the output. Items with a [s] as the initial con- one of the variables with which the Generator en- sonants (e.g. ["siju]) appear only in bare forms in codes presence of [s] in the output. For a genera- the training data. In Sections 4 and 4.1, we es- tive test providing evidence that Lasso regression tablished that the network uses the latent code (c1 estimates correlate with network’s internal repre- 7 sentations, see Beguš (2020). The exact vowel quality estimation in the generated out- puts is challenging, especially in short vocalic elements of re- We can thus set z90 to marginal levels well be- duced vowels in the reduplicative syllables. For this reason, yond the training range and the latent code (c1 , we default transcriptions to a [@]. c2 ) to levels well beyond [0, 1] in order to force
reduplication and [s] in the output simultaneously. 5 Replication: Bare WaveGAN For example, when the latent code is set to [0, 3] (Donahue et al., 2019) (which forces reduplication in the output) and z90 to 4 (forcing [s] in the output), the network outputs To test whether the learning of reduplicative pat- a reduplicated [s@"siji] (among other outputs) even terns in GANs is a robust or idiosyncratic prop- though items containing an [s] are never redupli- erty of the model presented in Section 4, we con- cated in the training data. When the code is set duct a replication experiment. We introduce two to even higher number, [0, 7.25], and z90 to 7, the crucial differences in the replication experiment: network outputs [s@"siru] in a different output. The we train the Generator without the requirement to spectrograms in Figure 4 show a clear period of produce informative data and without binary la- frication noise characteristic of a sibilant fricative tent codes. We use the model in Donahue et al. [s], interrupted by a reduplicative vowel and fol- (2019) which features a “bare” GAN architecture lowed by a repeated period of frication noise char- for audio data: only the Generator and Discrimi- acteristic of [s]. nator networks without the Q-network. This archi- tecture has the potential to inform us how GANs represent reduplicative patterns without an explicit requirement to learn informative data, i.e. without In fact, at the values [0, 7.25], and z90 = 7, the an explicit requirement to encode some salient fea- network generates approximately 33 (out of 100 ture of the training data in the latent space. The tested or 33%) outputs that can be reliably ana- data used for training is the same as in the experi- lyzed as reduplicated forms with initial sV- redu- ment in Section 3. We train the network for 15,930 plication unseen in the training data. The other steps or approximately 5,118 epochs, which is al- 67 outputs are reduplicated forms containing other most identical to the number of steps/epochs in the C1 s or unreduplicated [s]-forms. No outputs were ciwGAN experiment (Section 4). observed in which C1 of the reduplication syllable and C1 of the base would be substantially differ- 5.1 Identifying variables ent. While all the cases when z90 is manipulated Testing the learning of reduplication in the bare involve a front vowel [i] in the base item, we can GAN architecture requires that we force redupli- also elicit reduplication for other vowels. For ex- cation and presence of [s] in the output simulta- ample, we identify variable z4 as corresponding neously. To identify which latent variables cor- to an [s] and a low vowel [A] in the output (with respond to the two properties, we use the same the same technique as described for z90 above but technique as described in Section 4. We generate with presence of [sA] as the dependent variable in and annotate 500 outputs of the Generator network the Lasso regression model). By manipulating z4 with randomly sampled latent variables. We anno- to 9.5 (forcing [sA] in the output) and setting the tate the presence of [s] and the presence of redupli- latent codes to [0, 7.5], we get [s@"sAôu] in the out- cation. The annotations are fit to a Lasso logistic put (Figure 4). regression (as in Section 4.2): presence of redupli- cation or [s] are the dependent variables and each of the 100 latent z-variables are the independent For comparison, the same L1 speaker of En- predictors. Lambda values were computed with glish who read the words in the training data read 10-fold cross validation. Regression estimates are the reduplicated [s@"siji] and [s@"sAôu] which were given in Figure 5. not included in the training data. Figure 4 par- The plots illustrate a steep drop in regression es- allels the generated reduplicated forms based on timates between the few latent variables with the unobserved data (which were elicited by forcing highest estimates and the rest of the latent space. [s] and reduplication in the output) and the record- In fact, in both models, one or two variables per ing of the same reduplicated form read by a hu- model emerge with substantially higher regression man speaker. The spectrograms show clear acous- estimates: z91 and z5 when the dependent vari- tic parallels between the Generated outputs and the able is PRESENCE OF REDUPLICATION and z17 recording read by a human speaker (who read the when the dependent variable is PRESENCE OF [s] words prior to computational experiments and did in the output. We can assume the Generator net- not hear or analyze the generated outputs). work uses these two variables to encode presence
Generated Human recording Generated Human recording 8000 8000 Frequency (Hz) Frequency (Hz) 0 0 0 1.55 0 1.555 Time (s) Time (s) Generated Human recording 8000 Frequency (Hz) 0 0 1.486 Time (s) Figure 4: Waveforms and spectrograms (0–8000 Hz) of reduplicated forms containing an [s] which were absent from the training data. The generated forms on the left are paired with recordings of a female speaker reading reduplicated forms that were absent from the training data. (left) When the latent code is set to [0, 3] and z90 to 4, the network outputs a reduplicated [s@"siji]. (right) When the latent code is set to [0, 7.5] and z4 to 9.5, we get [s@"sAôu]. (bottom) In the bare GAN architecture, when z5 (forcing reduplication) is set to −9.25 and z17 (forcing [s] in the output) to −9.0, the Generator outputs a reduplicated [s@"siôi]. of reduplication and [s], respectively. ator network outputs reduplicated forms for un- observed data when both reduplication and [s] It has been argued in Beguš (2020) that GANs are forced in the output via latent space manip- learn to encode phonetic and phonological repre- ulation, but significantly less so than in the ciw- sentations with a subset of latent variables. The GAN architecture. When z91 (forcing reduplica- discretized representation of continuous phonetic tion) and z17 (forcing [s] in the output) are set properties in the latent space appears even more to value −8.5, a higher level compared to the radical in the present case. For example, in Beguš generated samples in the ciwGAN architecture (7 (2020), presence of [s] as a sound in the output is and 7.25), the network outputs only one redu- represented by at least seven latent variables, each plicated form with [s]-reduplication out of 100 of which likely controls different spectral proper- generated outputs. By comparison, the propor- ties of the frication noise. In the present exper- tion of the [s]-reduplication in the ciwGAN ar- iment, the Generator appears to learn to encode chitecture is 33/100 – a significantly higher ratio presence of [s] with a single latent variable, as is (OR = 48.1, p < 0.0001; Fisher Exact Test). suggested by a steep drop of regression estimates When z5 (forcing reduplication) is set to −9.25 after the first variables with the highest estimates. and z17 (forcing [s] in the output) to −9.0, the pro- For a generative test showing that regression esti- portion of reduplicated [s]-items is slightly higher mates correlate to actual rates of a given property (4/100), but still significantly lower than in the ci- in generated data, see Beguš (2020). Such near- wGAN architecture (OR = 11.7, p < 0.0001; categorical cutoff is likely a consequence of the Fisher Exact Test). Despite these lower propor- training data in the present case being consider- tions of reduplicated [s] in the output, the bare ably less variable compared to TIMIT (used for GAN network nevertheless extends reduplication training in Beguš 2020). The network also rep- on novel unobserved data. Figure 4 illustrates an resents an identity-based process, reduplication, example of a reduplicated [s]-item from the Gen- with only two latent variables and features a sub- erator network trained in the bare GAN architec- stantial drop in regression estimates after these ture: [s@"siôi]. The spectrogram reveals a clear pe- two variables. This discretized representation thus riod of frication noise characteristic of an [s], fol- emerges even without the requirement of the Gen- lowed by a reduplicative vowel period, followed erator to output informative data. by another period of frication. In the replication experiment too, the Gener-
a Presence of reduplication b Presence of [s] 1.25 ● ● 1.5 ● 1.00 Lasso regression estimates Lasso regression estimates ● ●● ● ● 1.0 ● ●● 0.75 ● ●●● ● ● ● ● ●● ● ●● ●●● ●●● ●●●● 0.50 ●●●● ● ● ●● ● ● ●● ● 0.5 ●● ●●● ● ●● ●● ●● ● ●●●●● ●●● ● ●● ●● ●●● 0.25 ●●●● ●● ●● ●●●● ●●●● ●● ●●● ●● ●●●●●●● ●●●● ●●●● ●●●●●● ● ●●● ●● ●● ●●●● ●●● ●●●● ●●● ● 0.0 ●●●●●●●●●●●●●●●●●●● 0.00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1 14153033343554565769727475768995982799554324442840 4 2631381121226271932037 6 58397812426373259048828594238697494677 7 536113708067108784 9 6032501645961792185165 8 598364192966100688852 2 79 3 47368141 5 91 1 3 6 9 1114161832343537404243444548495455575859616566676972757781828891941002751211286222546953378267413 7 8 398771801579854136937096992029476468246331987683238956 4 92 5 5028909784736230105219605338 2 17 Latent variables (z) Latent variables (z) Figure 5: Absolute Lasso regression estimates (sorted from highest on the right-hand side) for two models identi- fying (a) presence of reduplication and (b) presence of [s] in the generated outputs of the bare GAN model (Section 5). 6 Discussion latent codes are set to marginal levels outside of training range (e.g. to [5, 0] or [0, 5]), the outputs We perform four generative tests to model learning are almost categorically unreduplicated or redupli- of reduplication in deep convolutional networks: cated (at 98 % for [5, 0]). Beguš (2021a) shows (i) a test of proportion of outputs when latent codes that even higher values (e.g. 15) result in perfor- are manipulated to marginal values, (ii) a test of mance at 100% for a subset of variables. Not all interpolating latent variables, (iii) a test of redu- aspects of the models in this paper are categori- plication on unobserved data in the ciwGAN ar- cal (e.g. interpolation of latent codes does not al- chitecture, and (iv) a replication test of redupli- ways change an unreduplicated to a reduplicated cation on unobserved data in the bare WaveGAN form without other major changes). Improving architecture. All four tests suggest that deep con- performance on this particular task is left for fu- volutional networks learn a simple identity-based ture work. Inability to derive categorical processes pattern in speech called reduplication, i.e. a pro- has long been an argument against the connection- cess that copies some phonological material to ex- ist approaches to language modeling. The results press new meaning. The ciwGAN network learns of this experiments add to the work suggesting that to encode a meaningful representation — presence manipulating variables to extreme marginal val- of reduplication into its latent codes. There is a ues results in near categorical or categorical out- near one-to-one correspondence between the two puts (depending on the value) of a desired property latent codes c1 and c2 and reduplication. By in- (Beguš, 2020; Beguš, 2021a). terpolating latent codes, we cause the bare form to gradually turn into a reduplicated form with no In sum, three properties of rule-like symbolic other major changes in the output in the majority representations emerge in deep convolutional net- of cases. These results are close to what would work tested here: discretized representations, the be considered appearance of symbolic computa- ability to generate desired property by manipu- tion or algebraic rules. lating a small number of variables, and near cat- Additional evidence that an approximation of egoricity for a subset of representations. These symbolic computation emerges comes from the symbolic-like outcomes are facilitated by two in- bare GAN experiment: there is a substantial drop ductive biases: the binary nature of latent codes in regression estimates after the first one or two and the requirement on the Generator to output latent variables with highest regression estimates, informative data (forced by the Q-network). At suggesting that even without the requirement to least a subset of these properties also emerges in produce informative data, the network discretizes the bare WaveGAN architecture that lacks these the continuous and highly variable phonetic fea- biases, but at a reduced performance. ture — presence of reduplication — and uses a Encoding an identity-based pattern as a mean- small subset of the latent space to represent this ingful representation in the latent space emerges morphophonological property. in a completely unsupervised manner in the ciw- Finally, we can force the Generator to output GAN architecture — only from the requirement reduplication at nearly categorical levels. When that the Generator output informative data. Redu-
plicated and unreduplicated forms are never paired One of the advantages of probing learning in in the training data. The network is fed bare and deep convolutional neural networks on speech reduplicated forms randomly. This unsupervised data trained with GANs is that the innovative out- training approximates conditions in language ac- puts violate training data in structured and highly quisition (for hearing learners): the human lan- informative ways. The innovative outputs with guage learner needs to represent reduplication and reduplication of [s]-initial forms such as [s@siju] to pair bare and reduplicated forms from raw unla- can be directly paralleled to acoustic outputs read beled acoustic data. The ciwGAN learns to group by L1 speaker of American English that were ab- reduplicated and unreduplicated forms and assign sent from the training data. Acoustic analysis a unique representation to the process of redupli- shows a high degree of similarity between the gen- cation. In fact, the one-hot vector (c1 and c2 ) that erated reduplicated forms and human recordings, the Generator learns to associate with reduplica- meaning that the network learns to output novel tion in training can be modeled as a representation data that are linguistically interpretable and resem- of the unique meaning/function that reduplication ble human speech processes even though they are adds, in line with an approach to represent unique absent from the training data. Thus, the results semantics with one-hot vectors (e.g. in Steinert- of the experiments have implications for cognitive Threlkeld and Szymanik 2020). models of speech acquisition. It appears that one of the processes that has long been held as a hall- The paper also argues that deep convolutional mark of symbolic computation in language, redu- networks can learn a simple identity-based pattern plication, can emerge in deep convolutional net- (copying) from raw continuous data and extend the work without language-specific components in the pattern to novel unobserved data. While the net- model even when they are trained on raw acoustic work was not trained on reduplicated items that inputs. start with an [s], we were able to elicit reduplica- tion in the output following a technique proposed in Beguš (2020). First, we identify variables that correspond to some phonetic/phonological repre- The present paper tests a simple partial redu- sentation such as presence of [s]. We argue that plicative pattern where only CV is copied and setting single variables well above training range appears before the base item. This is perhaps can reveal the underlying value for each latent computationally the simplest reduplicative pat- variable and force the desired property in the out- tern. The training data are also highly controlled put. We can thus force both [s] and reduplication and recorded by a single speaker. We can use the in the output simultaneously. For example, the well-understood identity-based patterns in speech network outputs [s@siju] if we force both redupli- with various degrees of complexity (longer redu- cation and [s] in the output; however, it never sees plication, embedding into non-reduplicative pat- [s@siju] in the training data — only [siju] and other terns) to further test how inductive biases and reduplicated forms, none of which included an [s]. hyperparameter/architecture choices interact with Thus, these experiments again confirm that the learning in deep convolutional networks. Finally, network uses individual latent variables to repre- learning biases in the ciwGAN model can be (su- sent linguistically meaningful representations (Be- perficially) compared to learning biases in hu- guš, 2020; Beguš, 2021a). Setting these individual man subjects in future work. This paper sug- variables to values well above the training inter- gests that the Generator provides informative out- val reveals their underlying values. By manipu- puts even if trained on comparatively small data lating these individual variables, we can explore sets (for a similar conclusion for other processes, how the representations are learned as well as how see Beguš 2021b). This means we can use the interactions between different variables work (for same training data to probe learning in CNNs and example, between the representation of reduplica- in human artificial grammar learning experiments tion and presence of [s]). The results of this study (for a methodology, see Beguš 2021b). While suggest that the deep convolutional network is not these comparisons are necessarily superficial at only capable of encoding different phonetic prop- this point, they can provide insights into common erties in individual latent variables, but also pro- learning biases between human learners and com- cesses as abstract as copying or reduplication. putational models.
Acknowledgements Iris Berent. 2013. The phonological mind. Trends in Cognitive Sciences, 17(7):319 – 327. This work was supported by a grant to new fac- ulty at the University of Washington. I would like Paul Boersma and David Weenink. 2015. Praat: to thank Ella Deaton for recording and preparing Doing phonetics by computer [computer pro- stimuli as well as anonymous reviewers and the gram]. version 5.4.06. Retrieved 21 February Action Editor for useful comments on earlier ver- 2015 from http://www.praat.org/. sions of this paper. Simone Brugiapaglia, Matthew Liu, and Paul Tup- per. 2020. Generalizing outside the training References set: When can neural networks learn identity ef- fects? Ben Adlam, Charles Weill, and Amol Kapoor. 2019. Investigating under and overfitting in Xi Chen, Yan Duan, Rein Houthooft, John Schul- Wasserstein Generative Adversarial Networks. man, Ilya Sutskever, and Pieter Abbeel. 2016. In ICML Understanding and Improving Gener- InfoGAN: Interpretable representation learn- alization in Deep Learning Workshop (2019). ing by information maximizing Generative Ad- arXiv 1910.14137v1. versarial Nets. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, Raquel G. Alhama and Willem H. Zuidema. 2018. and Roman Garnett, editors, Advances in Neu- Pre-wiring and pre-training: What does a neu- ral Information Processing Systems 29, pages ral network need to learn truly general identity 2172–2180. Curran Associates, Inc. rules? Journal of Artificial Intelligence Re- Jan Chorowski, Ron J. Weiss, Samy Bengio, search, 61:927–946. and Aäron van den Oord. 2019. Unsu- Martin Arjovsky, Soumith Chintala, and Léon pervised speech representation learning using Bottou. 2017. Wasserstein Generative Adver- WaveNet autoencoders. IEEE/ACM Transac- sarial Networks. In Proceedings of the 34th In- tions on Audio, Speech, and Language Process- ternational Conference on Machine Learning, ing, 27(12):2041–2053. volume 70 of Proceedings of Machine Learning Yu-An Chung, Chao-Chung Wu, Chia-Hao Shen, Research, pages 214–223, International Con- Hung-Yi Lee, and Lin-Shan Lee. 2016. Au- vention Centre, Sydney, Australia. PMLR. dio word2vec: Unsupervised learning of au- dio segment representations using sequence-to- Alexei Baevski, Steffen Schneider, and Michael sequence autoencoder. In Interspeech 2016, Auli. 2020. vq-wav2vec: Self-supervised learn- pages 765–769. ing of discrete speech representations. In In- ternational Conference on Learning Represen- Hossep Dolatian and Jeffrey Heinz. 2018. Model- tations, pages 1–12. ing reduplication with 2-way finite-state trans- ducers. In Proceedings of the Fifteenth Work- Gašper Beguš. 2021a. Ciwgan and fiwgan: En- shop on Computational Research in Phonet- coding information in acoustic data to model ics, Phonology, and Morphology, pages 66–77, lexical learning with Generative Adversarial Brussels, Belgium. Association for Computa- Networks. Neural Networks, 139:305–325. tional Linguistics. Gašper Beguš. 2021b. Local and non-local depen- Hossep Dolatian and Jeffrey Heinz. 2020. Com- dency learning and emergence of rule-like rep- puting and classifying reduplication with 2-way resentations in speech data by Deep Convolu- finite-state transducers. Journal of Language tional Generative Adversarial Networks. Com- Modelling, 8(1):179–250. puter Speech & Language, page 101244. Chris Donahue, Julian J. McAuley, and Miller S. Gašper Beguš. 2020. Generative adversarial Puckette. 2019. Adversarial audio synthesis. phonology: Modeling unsupervised phonetic In 7th International Conference on Learning and phonological learning with neural net- Representations, ICLR 2019, New Orleans, LA, works. Frontiers in Artificial Intelligence, 3:44. USA, May 6-9, 2019. OpenReview.net.
You can also read