How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings
Page content transcription
If your browser does not render page correctly, please read the page content below
How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings Badr M. Abdullah Iuliia Zaitova Tania Avgustinova Bernd Möbius Dietrich Klakow Department of Language Science and Technology (LST) Saarland Informatics Campus, Saarland University, Germany Corresponding author: Abstract ances of an unknown but related language without being able to produce it. How do neural networks “perceive” speech sounds from unknown languages? Does the Human speech perception has been an active typological similarity between the model’s area of research in the past five decades which arXiv:2109.10179v1 [cs.CL] 21 Sep 2021 training language (L1) and an unknown lan- has produced a wealth of documented behavioral guage (L2) have an impact on the model rep- studies and experimental findings. Recently, there resentations of L2 speech signals? To an- has been a growing scientific interest in the cog- swer these questions, we present a novel ex- nitive modeling community to leverage the recent perimental design based on representational advances in speech representation learning to for- similarity analysis (RSA) to analyze acoustic malize and test theories of speech perception using word embeddings (AWEs)—vector represen- tations of variable-duration spoken-word seg- computational simulations on the one hand, and to ments. First, we train monolingual AWE mod- investigate whether neural networks exhibit similar els on seven Indo-European languages with behavior to humans on the other hand (Räsänen various degrees of typological similarity. We et al., 2016; Alishahi et al., 2017; Dupoux, 2018; then employ RSA to quantify the cross-lingual Scharenborg et al., 2019; Gelderloos et al., 2020; similarity by simulating native and non-native Matusevych et al., 2020b; Magnuson et al., 2020). spoken-word processing using AWEs. Our ex- periments show that typological similarity in- In the domain of modeling non-native speech deed affects the representational similarity of perception, Schatz and Feldman (2018) have shown the models in our study. We further discuss the that a neural speech recognition system (ASR) pre- implications of our work on modeling speech dicts Japanese speakers’ difficulty with the English processing and language similarity with neural phonemic contrast /l/-/ô/ as well as English speak- networks. ers’ difficulty with the Japanese vowel length dis- tinction. Matusevych et al. (2021) have shown 1 Introduction that a model of non-native spoken word processing Mastering a foreign language is a process that re- based on neural networks predicts lexical process- quires (human) language learners to invest time and ing difficulty of English-speaking learners of Rus- effort. If the foreign language (L2) is very distant sian. The latter model is based on word-level repre- from our native language (L1), not much of our sentations that are induced from naturalistic speech prior knowledge of language processing would be data known in the speech technology community relevant in the learning process. On the other hand, as acoustic word embeddings (AWEs). AWE mod- learning a language that is similar to our native els map variable-duration spoken-word segments language is much easier since our prior knowledge onto fixed-size representations in a vector space becomes more useful in establishing the correspon- such that instances of the same word type are (ide- dences between L1 and L2 (Ringbom, 2006). In ally) projected onto the same point in space (Levin some cases where L1 and L2 are closely related et al., 2013). In contrast to word embeddings in and typologically similar, it is possible for an L1 natural language processing (NLP), an AWE en- speaker to comprehend L2 linguistic expressions codes information about the acoustic-phonetic and to a great degree without prior exposure to L2. The phonological structure of the word, not its semantic term receptive multilingualism (Zeevaert, 2007) has content. been coined in the sociolinguistics literature to de- Although the model of Matusevych et al. (2021) scribe this ability of a listener to comprehend utter- has shown similar effects to what has been ob-
a simulation of a non-native Non-native Encoder speaker's phonetic perceptual space Language ( ) Spoken-word stimuli from native speakers of language ( ) a simulation of a native speaker's Native Encoder phonetic perceptual space Language ( ) a simulation of a non-native Non-native Encoder speaker's phonetic perceptual space Language ( ) Figure 1: An illustrated example of our experimental design. A set of N spoken-word stimuli from language λ are embedded using the encoder F (λ) which was trained on language λ to obtain a (native) view of the data: X(λ/λ) ∈ RD×N . Simultaneously, the same stimuli are embedded using encoders trained on other languages, namely F (α) and F (β) , to obtain two different (non-native) views of the data: X(λ/α) and X(λ/β) . We quantify the cross-lingual similarity between two languages by measuring the association between their embedding spaces using the representational similarity analysis (RSA) framework. served in behavioral studies (Cook et al., 2016), where at ∈ Rk is a spectral vector of k coefficients, it remains unclear to what extent AWE models an embedding is computed as can predict a facilitatory effect of language simi- larity on cross-language spoken-word processing. x = F(A; θ) ∈ RD (1) In this paper, we present a novel experimental de- sign to probe the receptive multilingual knowledge Here, θ are the parameters of the encoder, which of monolingual AWE models (i.e., trained with- are learned by training the AWE model in a mono- out L2 exposure) using the representational simi- lingual supervised setting. That is, the training larity analysis (RSA) framework. In a controlled spoken-word segments (i.e., speech intervals corre- experimental setup, we use AWE models to sim- sponding to spoken words) are sampled from utter- ulate spoken-word processing of native speakers ances of native speakers of a single language where of seven languages with various degrees of typo- the word identity of each segment is known. The logical similarity. We then employ RSA to char- model is trained with an objective that maps differ- acterize how language similarity affects the emer- ent spoken segments of the same word type onto gent representations of the models when tested on similar embeddings. To encourage the model to the same spoken-word stimuli (see Figure 1 for abstract away from speaker variability, the training an illustrated overview of our approach). Our ex- samples are obtained from multiple speakers, while periments demonstrate that neural AWE models of the resulting AWEs are evaluated on a held-out set different languages exhibit a higher degree of rep- of speakers. resentational similarity if their training languages Our research objective in this paper is to study are typologically similar. the discrepancy of the representational geometry of two AWE encoders that are trained on different 2 Proposed Methodology languages when tested on the same set of (mono- lingual) spoken stimuli. To this end, we train AWE A neural AWE model can be formally described as encoders on different languages where the train- an encoder function F : A → − RD , where A is the ing data and conditions are comparable across lan- (continuous) space of acoustic sequences and D guages with respect to size, domain, and speaker is the dimensionality of the embedding. Given an variability. We therefore have access to several acoustic word signal represented as a temporal se- encoders {F (α) , F (β) , . . . , , F (ω) }, where the su- quence of T acoustic events A = (a1 , a2 , ..., aT ), perscripts {α, β, . . . , ω} denote the language of the
training samples. tify the cross-lingual representational similarity be- Now consider a set of N held-out spoken-word tween languages λ and β as stimuli produced by native speakers of language (λ) (λ) (λ) λ: A1:N = {A1 , . . . , AN }. First, each acoustic sim(λ, β) := CKA(X(λ/λ) , X(λ/β) ) (3) word stimulus in this set is mapped onto an em- bedding using the encoder F (λ) , which yields a If sim(λ, α) > sim(λ, β), then we interpret matrix X(λ/λ) ∈ RD×N . Since the encoder F (λ) this as an indication that the native phonetic per- was trained on language λ, we refer to it as the ceptual space of language λ is more similar to native encoder and consider the matrix X(λ/λ) as the non-native phonetic perceptual space of lan- a simulation of a native speaker’s phonetic per- guage α, compared to that of language β. Note ceptual space. To simulate the phonetic percep- that while CKA(., .) is a symmetric metric (i.e., tual space of a non-native speaker, say a speaker CKA(X, Y) = CKA(Y, X)), our established (λ) similarity metric sim(., .) is not symmetric (i.e., of language α, the stimuli A1:N are embedded sim(λ, α) :6= sim(α, λ)). To estimate sim(α, λ), using the encoder F (α) , which yields a matrix we use word stimuli of language α and collect the X(λ/α) ∈ RD×N . Here, we read the superscript no- matrices X(α/α) and X(α/λ) . Then we compute tation (λ/α) as word stimuli of language λ encoded by a model trained on language α. Thus, the two sim(α, λ) := CKA(X(α/α) , X(α/λ) ) (4) matrices X(λ/λ) and X(λ/α) represent two differ- ent views of the same stimuli. Our main hypothesis When we apply the proposed experimental is that the cross-lingual representational similarity pipeline across M different languages, the effect between emergent embedding spaces should reflect of language similarity can be characterized by con- the acoustic-phonetic and phonological similarities structing a cross-lingual representational similarity between the languages λ and α. That is, the more matrix (xRSM) which is an asymmetric M × M distant languages λ and α are, the more dissimilar matrix where each cell represents the correlation their corresponding representation spaces are. To (or agreement) between two embedding spaces. quantify this cross-lingual representational simi- larity, we use Centered Kernel Alignment (CKA) 3 Acoustic Word Embedding Models (Kornblith et al., 2019). CKA is a representation- level similarity measure that emphasizes the dis- We investigate three different approaches of train- tributivity of information and therefore it obviates ing AWE models that have been previously intro- the need to establish the correspondence mapping duced in the literature. In this section, we formally between single neurons in the embeddings of two describe each one of them. different models. Moreover, CKA has been shown to be invariant to orthogonal transformation and 3.1 Phonologically Guided Encoder isotropic scaling which makes it suitable for our The phonologically guided encoder (PGE) is a analysis when comparing different languages and sequence-to-sequence model in which the network learning objectives. Using CKA, we quantify the is trained as a word-level acoustic model (Abdul- similarity between languages λ and α as lah et al., 2021). Given an acoustic sequence A and its corresponding phonological sequence sim(λ, α) := CKA(X(λ/λ) , X(λ/α) ) (2) ϕ = (ϕ1 , . . . , ϕτ ),1 the acoustic encoder F is trained to take A as input and produce an AWE x, Here, sim(λ, α) ∈ [0, 1] is a scalar that mea- which is then fed into a phonological decoder G sures the correlation between the responses of whose goal is to generate the sequence ϕ (Fig. 2–a). the two encoders, i.e., native F (λ) and non-native The objective is to minimize a categorical cross- (λ) F (α) , when tested with spoken-word stimuli A1:N . entropy loss at each timestep in the decoder, which sim(λ, α) = 1 is interpreted as perfect asso- is equivalent to ciation between the representational geometry of models trained on languages λ and α while τ X sim(λ, α) = 0 indicates that no association can L=− log PG (ϕi |ϕ
Acoustic Word Embedding (a) (b) (c) Figure 2: A visual illustration of the different learning objectives for training AWE encoders: (a) phonologically guided encoder (PGE): a sequence-to-sequence network with a phonological decoder, (b) correspondence auto- encoder (CAE): a sequence-to-sequence network with an acoustic decoder, and (c) contrastive siamese encoder (CSE): a contrastive network trained via triplet margin loss. where PG is the probability of the phone ϕi at has been explored in the AWEs literature with dif- timestep i, conditioned on the previous phone se- ferent underlying architectures (Settle and Livescu, quence ϕ
(DEU). We acknowledge that our language sample Encoder type Language is not typologically diverse. However, one of our PGE CAE CSE objectives in this paper is to investigate whether CZE 78.3 76.1 82.9 the cross-lingual similarity of AWEs can predict POL 67.6 63.5 73.8 the degree of mutual intelligibility between related RUS 64.3 57.7 71.0 languages. Therefore, we focus on the Slavic lan- BUL 72.1 68.9 78.4 guages in this sample (CZE, POL, RUS, and BUL), POR 74.5 72.2 80.4 which are known to be typologically similar and FRA 65.6 64.5 68.5 mutually intelligible to various degrees. DEU 67.9 70.3 75.8 To train our AWE models, we obtain time- Table 1: mAP performance on evaluation sets. aligned spoken-word segments using the Montreal Forced Aligner (McAuliffe et al., 2017). Then, we sample 42 speakers of balanced gender from aims to assess the ability of a model to determine each language. For each language, we sample whether or not two given speech segments corre- ~32k spoken-word segments that are longer than 3 spond to the same word type, which is quantified phonemes in length and shorter than 1.1 seconds in using a retrieval metric (mAP) reported in Table 1. duration (see Table 2 in Appendix A for word-level summary statistics of the data). For each word type, 5 Representational Similarity Analysis we obtain an IPA transcription using the grapheme- to-phoneme (G2P) module of the automatic speech Figure 3 shows the cross-lingual representational synthesizer, eSpeak. Each acoustic word segment similarity matrices (xRSMs) across the three differ- is parametrized as a sequence of 39-dimensional ent models using the linear CKA similarity metric.3 Mel-frequency spectral coefficients where frames Warmer colors indicate a higher similarity between are extracted over intervals of 25ms with 10ms two representational spaces. One can observe strik- overlap. ing differences between the PGE-CAE models on the one hand, and the CSE model on the other hand. 4.2 Architecture and Hyperparameters The PGE-CAE models yield representations that Acoustic Encoder We employ a 2-layer recurrent are cross-lingually more similar to each other com- neural network with a bidirectional Gated Recur- pared to those obtained from the contrastive CSE rent Unit (BGRU) of hidden state dimension of model. For example, the highest similarity score 512, which yields a 1024-dimensional AWE. from the PGE model is sim(RUS, BUL) = 0.748, which means that the representations of the Rus- Training Details All models in this study are sian word stimuli from the Bulgarian model F (BUL) trained for 100 epochs with a batch size of 256 exhibit the highest representational similarity to the using the ADAM optimizer (Kingma and Ba, 2015) representations of the native Russian model F (RUS) . and an initial learning rate (LR) of 0.001. The LR On the other hand, the lowest similarity score from is reduced by a factor of 0.5 if the mean average the PGE model is sim(POL, DEU) = 0.622, which precision (mAP) for word discrimination on the shows that the German model’s view of the Polish validation set does not improve for 10 epochs. The word stimuli is the view that differs the most from epoch with the best validation performance during the view of the native Polish model. Likewise, the training is used for evaluation on the test set. highest similarity score we observe from the CAE Implementation We build our models using Py- model is sim(BUL, RUS) = 0.763, while the low- Torch (Paszke et al., 2019) and use FAISS (John- est similarity is sim(DEU, BUL) = 0.649. If we son et al., 2017) for efficient similarity search dur- compare these scores to those of the contrastive ing evaluation. Our code is publicly available on CSE model, we observe significant cross-language GitHub.2 differences as the highest similarity scores are 4.3 Quantitative Evaluation sim(POL, RUS) = sim(POR, RUS) = 0.269, while the lowest is sim(BUL, DEU) = 0.170. This dis- We evaluate the models using the standard intrin- crepancy between the contrastive model and the sic evaluation of AWEs: the same-different word discrimination task (Carlin et al., 2011). This task 3 The xRSMs obtained using non-linear CKA with an RBF kernel are shown in Figure 6 in Appendix B, which show very 2 similar trends to those observed in Figure 3.
CZE POL RUS BUL POR FRA DEU CZE POL RUS BUL POR FRA DEU CZE POL RUS BUL POR FRA DEU 0.745 0.733 0.734 0.676 0.699 0.651 CZE 0.762 0.747 0.736 0.706 0.717 0.703 CZE 0.264 0.244 0.225 0.208 0.203 0.180 0.8 CZE POL RUS BUL POR FRA DEU 0.7 0.731 0.712 0.722 0.669 0.668 0.622 POL 0.759 0.721 0.721 0.693 0.700 0.681 POL 0.248 0.269 0.245 0.219 0.205 0.179 0.6 0.723 0.720 0.748 0.694 0.681 0.624 RUS 0.745 0.735 0.761 0.721 0.695 0.690 RUS 0.220 0.257 0.234 0.212 0.189 0.178 0.5 0.719 0.724 0.745 0.700 0.691 0.638 BUL 0.742 0.731 0.763 0.731 0.706 0.692 BUL 0.207 0.241 0.238 0.198 0.190 0.170 0.4 0.684 0.705 0.721 0.718 0.688 0.649 POR 0.737 0.722 0.735 0.736 0.726 0.705 POR 0.234 0.266 0.269 0.254 0.234 0.202 0.3 0.696 0.691 0.683 0.700 0.679 0.663 FRA 0.739 0.735 0.705 0.710 0.737 0.717 FRA 0.193 0.220 0.200 0.202 0.197 0.179 0.2 0.683 0.675 0.658 0.663 0.658 0.661 DEU 0.687 0.680 0.659 0.649 0.660 0.658 DEU 0.180 0.195 0.193 0.183 0.177 0.182 Figure 3: The cross-lingual representational similarity matrix (xRSM) for each model: PGE (Left), CAE (Middle), and CSE (Right). Each row corresponds to the language of the spoken-word stimuli while each column corresponds to the language of the encoder. Note that the matrices are not symmetric. For example in the PGE model, the cell at row CZE and column POL holds the value of sim(CZE, POL) = CKA(X(CZE/CZE) , X(CZE/POL) ) = 0.745, while the cell at row POL and column CZE holds the value of sim(POL, CZE) = CKA(X(POL/POL) , X(POL/CZE) ) = 0.731. other models suggests that training AWE models to make the Slavic cluster. Although Russian and with a contrastive objective hinders the ability of Bulgarian belong to two different Slavic branches, the encoder to learn high-level phonological ab- we observe that this pair forms the first sub-cluster stractions and therefore contrastive encoders are in both trees, at a distance smaller than that of the more sensitive to the cross-lingual variation during West Slavic cluster (Czech and Polish). At first, inference compared to their sequence-to-sequence this might seem surprising as we would expect the counterparts. West Slavic cluster to be formed at a lower dis- tance given the similarities among the West Slavic 5.1 Cross-Lingual Comparison languages which facilitates cross-language speech To get further insights into how language sim- comprehension, as documented by sociolinguistic ilarity affects the model representations of non- studies (Golubovic, 2016). However, Russian and native spoken-word segments, we apply hierarchi- Bulgarian share typological features at the acoustic- cal clustering on the xRSMs in Figure 3 using phonetic and phonological levels which distinguish the Ward algorithm (Ward, 1963) with Euclidean them from West Slavic languages.4 We further distance. The result of the clustering analysis is elaborate on these typological features in §6. Even shown in Figure 4. Surprisingly, the generated though Portuguese was grouped with French in the trees from the xRSMs of the PGE and CAE mod- cluster analysis, which one might expect given that els are identical, which could indicate that these both are Romance languages descended from Latin, two models induce similar representations when it is worth pointing out that the representations of trained on the same data. Diving a level deeper the Portuguese word stimuli from the Slavic mod- into the cluster structure of these two models, we els show a higher similarity to the representations observe that the Slavic languages form a pure clus- of the native Portuguese model compared to these ter. This could be explained by the observation obtained from the French model (with only two that some of the highest pair-wise similarity scores exceptions, the Czech PGE model and Polish CAE among the PGE models are observed between model). We also provide an explanation of why Russian and Bulgarian, i.e., sim(RUS, BUL) = this might be the case in §6. 0.748 and sim(BUL, RUS) = 0.745, and Czech The generated language cluster from the CSE and Polish, i.e., sim(CZE, POL) = 0.745. The model does not show any clear internal structure same trend can be observed in the CAE models: with respect to language groups since all cluster i.e., sim(RUS, BUL) = 0.761, sim(BUL, RUS) = pairs are grouped at a much higher distance com- 0.763, and sim(CZE, POL) = 0.762. Within the pared to the PGE and CAE models. Furthermore, Slavic cluster, the West Slavic languages Czech the Slavic languages in the generated tree do not and Polish are grouped together, while the Rus- 4 Note that our models do not have access to word orthogra- sian (East Slavic) is first grouped with Bulgarian phy. Thus, the similarity cannot be due to the fact that Russian (South Slavic) before joining the West Slavic group and Bulgarian use Cyrillic script.
Slavic Romance Germanic Czech Czech Russian Polish Polish Polish Russian Russian Czech Bulgarian Bulgarian Portuguese Portuguese Portuguese Bulgarian French French French German German German 0.2 0.5 0.0 0.0 0.4 0.4 0.2 0.0 1.0 Figure 4: Hierarchical clustering analysis on the cross-lingual representational similarity matrices (using linear CKA) of the three models: PGE (Left), CAE (Middle), and CSE (Right). form a pure cluster since Portuguese was placed representation spaces when trained on the same inside the Slavic cluster. We also do not observe data even though their decoding objectives oper- the West Slavic group as in the other two models ate over different modalities. That is, the decoder since Polish was grouped first with Russian, and not of the PGE model aims to generate the word’s Czech. We believe that this unexpected behavior of phonological structure in the form of a sequence the CSE models can be related to the previously at- of discrete phonological units, while the decoder tributed to the poor performance of the contrastive of the CAE model aims to generate an instance AWEs in capturing word-form similarity (Abdullah of the same word represented as a sequence of et al., 2021). (continuous) spectral vectors. Moreover, the se- Moreover, it is interesting to observe that Ger- quences that these decoders aim to generate vary in man seems to be the most distant language to the length (the mean phonological sequence length is other languages in our study. This observation ~6 phonemes while mean spectral sequence length holds across all three encoders since the represen- is ~50 vectors). Although the CAE model has no tations of the German word stimuli by non-native access to abstract phonological information of the models are the most dissimilar compared to the word-forms it is trained on, it seems that this model representations of the native German model. learns non-trivial knowledge about word phono- logical structure as demonstrated by the represen- 5.2 Cross-Model Comparison tational similarity of its embeddings to those of To verify our hypothesis that the models trained a word embedding model that has access to word with decoding objectives (PGE and CAE) learn phonology (i.e., PGE) across different languages. similar representations when trained on the same data, we conduct a similarity analysis between the 6 Discussion models in a setting where we obtain views of the 6.1 Relevance to Related Work spoken-word stimuli from native models, then com- pare these views across different learning objec- Although the idea of representational similarity tives while keeping the language fixed using CKA analysis (RSA) has originated in the neuroscience as we do in our cross-lingual comparison. The literature (Kriegeskorte et al., 2008), researchers in results of this analysis is shown in Figure 5. We NLP and speech technology have employed a simi- observe that across different languages the pairwise lar set of techniques to analyze emergent represen- representational similarity of the PGE-CAE models tations of multi-layer neural networks. For exam- (linear CKA = 0.721) is very high in comparison ple, RSA has previously been employed to analyze to that of PGE-CSE models (linear CKA = 0.239) the correlation between neural and symbolic repre- and CAE-CSE models (linear CKA = 0.230).5 Al- sentations (Chrupała and Alishahi, 2019), contex- though the non-linear CKA similarity scores tend tualized word representations (Abnar et al., 2019; to be higher, the general trend remains identical in Abdou et al., 2019; Lepori and McCoy, 2020; Wu both measures. This finding validates our hypoth- et al., 2020), representations of self-supervised esis that the PGE and CAE models yield similar speech models (Chung et al., 2021) and visually- grounded speech models (Chrupała et al., 2020). 5 CKA scores are averaged over languages. We take the inspiration from previous work and
typological features from the ancestor language 0.8 0.8 0.757 Cross-model CKA (RBF 0.5) 0.721 Cross-model CKA (linear) (see Bjerva et al. (2019) for a discussion on how 0.7 0.7 typological similarity and genetic relationships in- 0.6 0.6 teract when analyzing neural embeddings). 0.5 0.5 0.4 0.4 0.397 0.401 0.3 0.3 Within the language sample we study in this pa- 0.2 0.239 0.230 0.2 per, four of these languages belong to the Slavic 0.1 0.1 branch of Indo-European languages. Compared AE SE SE AE SE SE vs .C vs .C vs .C vs .C vs .C vs .C to the Romance and Germanic branches of Indo- E E GE GE CA GE GE CA P P P P European, Slavic languages are known to be re- Figure 5: Cross-model CKA scores: linear CKA (left) markably more similar to each other not only at and non-linear CKA (right). Each point in this plot is lexical and syntactic levels, but also at the pre- the within-language representational similarity of two lexical level (acoustic-phonetic and phonological models trained with different objectives when tested on features). These cross-linguistically shared fea- the same (monolingual) word stimuli. Each red point is tures between Slavic languages include rich conso- the average CKA score per model pair. nant inventories, phonemic iotation and complex consonant clusters. The similarities at different apply RSA in a unique setting: to analyze the im- linguistic levels facilitate spoken intercomprehen- pact of typological similarity between languages sion, i.e., the ability of a listener to comprehend on the representational geometry. Our analysis fo- an utterance (to some degree) in a language that is cuses on neural models of spoken-word processing unknown, but related to their mother tongue. Sev- that are trained on naturalistic speech corpora. Our eral sociolinguistic studies of mutual intelligibility goal is two-fold: (1) to investigate whether or not have reported a higher degree of intercomprehen- neural networks exhibit predictable behavior when sion among speakers of Slavic languages compared tested on speech from a different language (L2), to other language groups in Europe (Golubovic, and (2) to examine the extent to which the training 2016; Gooskens et al., 2018). On the other hand, strategy affects the emergent representations of L2 and even though French and Portuguese are both spoken-word stimuli. To the best of our knowledge, Romance languages, they are less mutually intel- our study is the first to analyze the similarity of ligible compared to Slavic language pairs as they emergent representations in neural acoustic models have diverged in their lexicons and phonological from a cross-linguistic perspective. structures (Gooskens et al., 2018). This shows that cross-language speech intelligibility is not mainly 6.2 A Typological Discussion driven by shared historical relationships, but by contemporary typological similarities. Given the cross-linguistic nature of our study, we choose to discuss our findings on the representa- Therefore, and given the documented phonologi- tional similarity analysis from a language typology cal similarities between Slavic languages, it is not point of view. Language typology is a sub-field surprising that Slavic languages form a pure clus- within linguistics that is concerned with the study ter in our clustering analysis over the xRSMs of and categorization of the world’s languages based two of the models we investigate. Furthermore, the on their linguistic structural properties (Comrie, grouping of Czech-Polish and Russian-Bulgarian 1988; Croft, 2002). Since acoustic word embed- in the Slavic cluster can be explained if we con- dings are word-level representations induced from sider word-specific suprasegmental features. Be- actual acoustic realizations of word-forms, we fo- sides the fact that Czech and Polish are both West cus on the phonetic and phonological properties of Slavic languages that form a spatial continuum the languages in our study. We emphasize that our of language variation, both languages have fixed goal is not to use the method we propose in this pa- stress. The word stress in Czech is always on the per as a computational approach for phylogenetic initial syllable, while Polish has penultimate stress, reconstruction, that is, discovering the historical that is, stress falls on the syllable preceding the relations between languages. However, typologi- last syllable of the word (Sussex and Cubberley, cal similarity usually correlates with phylogenetic 2006). On the other hand, Russian and Bulgarian distance since languages that have diverged from a languages belong to different Slavic sub-groups— common historical ancestor inherited most of their Russian is East Slavic while Bulgarian is South
Slavic. The representational similarity between 6.3 Implications on Cross-Lingual Transfer Russian and Bulgarian can be attributed to the ty- Learning pological similarities between Bulgarian and East A recent study on zero-resource AWEs has shown Slavic languages. In contrast to West Slavic lan- that cross-lingual transfer is more successful when guages, the stress in Russian and Bulgarian is free the source (L1) and target (L2) languages are more (it can occur on any syllable in the word) and mov- related (Jacobs and Kamper, 2021). We conducted able (it moves between syllables within morpholog- a preliminary experiment on the cross-lingual word ical paradigms). Slavic languages with free stress discrimination performance of the models in our tend to have a stronger contrast between stressed study and observed a similar effect. However, we and unstressed syllables, and vowels in unstressed also observed that cross-lingual transfer using the syllables undergo a process that is known as vowel- contrastive model is less effective compared to the quality alternation, or lexicalized vowel reduction models trained with decoding objectives. From (Barry and Andreeva, 2001). For example, con- our reported RSA analysis, we showed that the sider the Russian word ruka (‘hand’). In the singu- contrastive objective yields models that are cross- lar nominative case the word-form is phonetically lingually dissimilar compared to the other objec- realized as ruká [rU"ka] while in the singular ac- tives. Therefore, future work could investigate the cusative case rúku ["rukU] (Sussex and Cubberley, degree to which our proposed RSA approach pre- 2006). Here, the unstressed high back vowel /u/ dicts the effectiveness of different models in a cross- is realized as [U] in both word-forms. Although lingual zero-shot scenario. vowel-quality alternations in Bulgarian follow dif- ferent patterns, Russian and Bulgarian vowels in 7 Conclusion unstressed syllables are reduced in temporal dura- We presented an experimental design based on rep- tion and quality. Therefore, the high representa- resentational similarity analysis (RSA) whereby tional similarity between Russian and Bulgarian we analyzed the impact of language similarity on could be explained if we consider their typological representational similarity of acoustic word em- similarities in word-specific phonetics and prosody. beddings (AWEs). Our experiments have shown The Portuguese language presents us with an in- that AWE models trained using decoding objec- teresting case study of language variation. From a tives exhibit a higher degree of representational phylogenetic point of view, Portuguese and French similarity if their training languages are typolog- have diverged from Latin and therefore they are ically similar. We discussed our findings from a both categorized as Romance languages. The clus- typological perspective and highlighted pre-lexical tering of the xRSMs of the PGE and CAE models features that could have an impact on the models’ groups Portuguese and French together at a higher representational geometry. Our findings provide distance compared to the Slavic group, while Por- evidence that AWE models can predict the facilita- tuguese was grouped with Slavic languages when tory effect of language similarity on cross-language analyzing the representations of the contrastive speech perception and complement ongoing efforts CSE model. Similar to Russian and Polish, sibilant in the community to assess their utility in cognitive consonants (e.g., /S/, /Z/) and the velarized (dark) modeling. Our work can be further extended by L-sound (i.e., /ë/) are frequent in Portuguese. We considering speech segments below the word-level hypothesize that contrastive training encourages (e.g., syllables, phonemes), incorporating seman- the model to pay more attention to the segmental tic representations into the learning procedure, and information (i.e., individual phones) in the speech investigating other neural architectures. signal at the expense of phonotactics (i.e., phone Acknowledgements sequences). Given that contrastive learning is a pre- dominant paradigm in speech representation learn- We thank the anonymous reviewers for their con- ing (van den Oord et al., 2018; Schneider et al., structive comments and insightful feedback. We 2019), we encourage further research to analyze further extend our gratitude to Miriam Schulz and whether or not speech processing models trained Marius Mosbach for proofreading the paper. This with contrastive objectives exhibit a similar behav- research is funded by the Deutsche Forschungsge- ior to that observed in human listeners and closely meinschaft (DFG, German Research Foundation), examine their plausibility for cognitive modeling. Project ID 232722074 – SFB 1102.
References Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4146– Mostafa Abdou, Artur Kulmizev, Felix Hill, Daniel M. 4156, Online. Association for Computational Lin- Low, and Anders Søgaard. 2019. Higher-order com- guistics. parisons of sentence encoder representations. In Proceedings of the 2019 Conference on Empirical Yu-An Chung, Yonatan Belinkov, and James Glass. Methods in Natural Language Processing and the 2021. Similarity analysis of self-supervised speech 9th International Joint Conference on Natural Lan- representations. In ICASSP 2021-2021 IEEE Inter- guage Processing (EMNLP-IJCNLP), pages 5838– national Conference on Acoustics, Speech and Sig- 5845, Hong Kong, China. Association for Computa- nal Processing (ICASSP), pages 3040–3044. IEEE. tional Linguistics. Bernard Comrie. 1988. Linguistic typology. Annual Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova, Review of Anthropology, 17:145–159. Bernd Möbius, and Dietrich Klakow. 2021. Do Acoustic Word Embeddings Capture Phonological Svetlana V Cook, Nick B Pandža, Alia K Lancaster, Similarity? An Empirical Study. In Proc. Inter- and Kira Gor. 2016. Fuzzy nonnative phonolexical speech 2021, pages 4194–4198. representations lead to fuzzy form-to-meaning map- pings. Frontiers in Psychology, 7:1345. Samira Abnar, Lisa Beinborn, Rochelle Choenni, and Willem Zuidema. 2019. Blackbox meets blackbox: William Croft. 2002. Typology and Universals. Cam- Representational similarity & stability analysis of bridge University Press. neural language models and brains. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing Emmanuel Dupoux. 2018. Cognitive science in the and Interpreting Neural Networks for NLP, pages era of artificial intelligence: A roadmap for reverse- 191–203, Florence, Italy. Association for Computa- engineering the infant language-learner. Cognition, tional Linguistics. 173:43–59. Afra Alishahi, Marie Barking, and Grzegorz Chru- Lieke Gelderloos, Grzegorz Chrupała, and Afra Al- pała. 2017. Encoding of phonology in a recurrent ishahi. 2020. Learning to understand child-directed neural model of grounded speech. In Proceedings and adult-directed speech. In Proceedings of the of the 21st Conference on Computational Natural 58th Annual Meeting of the Association for Compu- Language Learning (CoNLL 2017), pages 368–378, tational Linguistics, pages 1–6, Online. Association Vancouver, Canada. Association for Computational for Computational Linguistics. Linguistics. William Barry and Bistra Andreeva. 2001. Cross- Jelena Golubovic. 2016. Mutual intelligibility in the language similarities and differences in spontaneous Slavic language area. Groningen: Center for Lan- speech patterns. Journal of the International Pho- guage and Cognition. netic Association, 31(1):51–66. Charlotte Gooskens, Vincent J van Heuven, Jelena Gol- Johannes Bjerva, Robert Östling, Maria Han Veiga, ubović, Anja Schüppert, Femke Swarte, and Ste- Jörg Tiedemann, and Isabelle Augenstein. 2019. fanie Voigt. 2018. Mutual intelligibility between What do language representations really represent? closely related languages in Europe. International Computational Linguistics, 45(2):381–389. Journal of Multilingualism, 15(2):169–193. Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Christiaan Jacobs and Herman Kamper. 2021. Multi- Säckinger, and Roopak Shah. 1994. Signature veri- lingual Transfer of Acoustic Word Embeddings Im- fication using a" siamese" time delay neural network. proves When Training on Languages Related to the In Proc. NIPS. Target Zero-Resource Language. In Proc. Inter- speech, pages 1549–1553. Michael A Carlin, Samuel Thomas, Aren Jansen, and Hynek Hermansky. 2011. Rapid evaluation of Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. speech representations for spoken term discovery. In Billion-scale similarity search with GPUs. IEEE Proc. Interspeech. Transactions on Big Data. Grzegorz Chrupała and Afra Alishahi. 2019. Corre- H. Kamper, W. Wang, and Karen Livescu. 2016. lating neural and symbolic representations of lan- Deep convolutional acoustic word embeddings us- guage. In Proceedings of the 57th Annual Meet- ing word-pair side information. In Proc. ICASSP. ing of the Association for Computational Linguis- tics, pages 2952–2962, Florence, Italy. Association Herman Kamper. 2019. Truly unsupervised acoustic for Computational Linguistics. word embeddings using weak top-down constraints in encoder-decoder models. In Proc. ICASSP. Grzegorz Chrupała, Bertrand Higy, and Afra Alishahi. 2020. Analyzing analytical methods: The case of Diederik P. Kingma and Jimmy Ba. 2015. Adam: A phonology in neural models of spoken language. In method for stochastic optimization. In Proc. ICLR.
Simon Kornblith, Mohammad Norouzi, Honglak Lee, Håkan Ringbom. 2006. Cross-linguistic similarity in and Geoffrey Hinton. 2019. Similarity of neural foreign language learning. Multilingual Matters. network representations revisited. In International Conference on Machine Learning, pages 3519–3529. Odette Scharenborg, Nikki van der Gouw, Martha Lar- PMLR. son, and Elena Marchiori. 2019. The representa- tion of speech in deep neural networks. In Interna- Nikolaus Kriegeskorte, Marieke Mur, and Peter A Ban- tional Conference on Multimedia Modeling, pages dettini. 2008. Representational similarity analysis- 194–205. Springer. connecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2:4. Thomas Schatz and Naomi H Feldman. 2018. Neu- ral network vs. hmm speech recognition systems as Michael Lepori and R. Thomas McCoy. 2020. Picking models of human cross-linguistic phonetic percep- BERT’s brain: Probing for linguistic dependencies tion. In Proceedings of the conference on cognitive in contextualized embeddings using representational computational neuroscience. similarity analysis. In Proceedings of the 28th Inter- national Conference on Computational Linguistics, Steffen Schneider, Alexei Baevski, Ronan Collobert, pages 3637–3651, Barcelona, Spain (Online). Inter- and Michael Auli. 2019. wav2vec: Unsupervised national Committee on Computational Linguistics. Pre-Training for Speech Recognition. In Proc. Inter- speech 2019, pages 3465–3469. Keith Levin, Katharine Henry, Aren Jansen, and Karen Livescu. 2013. Fixed-dimensional acoustic embed- Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe. dings of variable-length segments in low-resource 2013. GlobalPhone: A multilingual text and speech settings. In Proc. ASRU. database in 20 languages. In Proc. ICASSP. James S Magnuson, Heejo You, Sahil Luthra, Mon- Shane Settle and Karen Livescu. 2016. Discrimina- ica Li, Hosung Nam, Monty Escabi, Kevin Brown, tive acoustic word embeddings: Recurrent neural Paul D Allopenna, Rachel M Theodore, Nicholas network-based approaches. In Proc. IEEE Spoken Monto, et al. 2020. Earshot: A minimal neural net- Language Technology Workshop (SLT). work model of incremental human speech recogni- tion. Cognitive science, 44(4):e12823. Roland Sussex and Paul Cubberley. 2006. The slavic languages. Cambridge University Press. Yevgen Matusevych, Herman Kamper, and Sharon Goldwater. 2020a. Analyzing autoencoder-based Aäron van den Oord, Yazhe Li, and Oriol Vinyals. acoustic word embeddings. In Bridging AI and Cog- 2018. Representation learning with contrastive pre- nitive Science Workshop@ ICLR 2020. dictive coding. ArXiv, abs/1807.03748. Yevgen Matusevych, Herman Kamper, Thomas Schatz, Joe H Ward. 1963. Hierarchical grouping to optimize Naomi Feldman, and Sharon Goldwater. 2021. A an objective function. Journal of the American sta- phonetic model of non-native spoken word process- tistical association, 58(301):236–244. ing. In Proceedings of the 16th Conference of the European Chapter of the Association for Computa- John Wu, Yonatan Belinkov, Hassan Sajjad, Nadir Dur- tional Linguistics: Main Volume, pages 1480–1490, rani, Fahim Dalvi, and James Glass. 2020. Similar- Online. Association for Computational Linguistics. ity analysis of contextual word representation mod- els. In Proceedings of the 58th Annual Meeting Yevgen Matusevych, Thomas Schatz, Herman Kam- of the Association for Computational Linguistics, per, Naomi Feldman, and Sharon Goldwater. 2020b. pages 4638–4655. Evaluating computational models of infant phonetic learning across languages. In Proc. CogSci. Ludger Zeevaert. 2007. Receptive multilingualism. JD Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Ten Thije & L. Zeevaert (Eds.), 6:103–137. Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. In Interspeech. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proc. NeuRIPS. Okko Räsänen, Tasha Nagamine, and Nima Mesgarani. 2016. Analyzing distributional learning of phone- mic categories in unsupervised deep neural net- works. Annual Conference of the Cognitive Science Society (CogSci). Cognitive Science Society (U.S.). Conference, 2016:1757–1762.
Appendices A Experimental Data Statistics Table 2 shows a word-level summary statistics of our experimental data extracted from the Global- Phone speech daabase (GPS). B Representational Similarity with Non-Linear CKA We provide the cross-lingual representational simi- larity matrices (xRSMs) constructed by applying the non-linear CKA measure with a radial basis function (RBF) in Figure 6 below. We observe similar trends to those presented in Figure 3. Applying hierarchical clustering on the xRSMs in Figure 6 yields the language clusters shown in Figure 7. We observe that the clusters are iden- tical to those shown in Figure 4, except for the CSE model where Portuguese is no longer in the Slavic group. However, Portuguese remains the non-Slavic language that is the most similar to Slavic languages.
Language #Train #Train #Eval #Phones/word Word dur. (sec) Token-Type Lang. group spkrs samples samples (mean ± SD) (mean ± SD) ratio CZE West Slavic 42 32,063 9,228 6.77 ± 2.28 0.492 ± 0.176 0.175 POL West Slavic 42 31,507 9,709 6.77 ± 2.27 0.488 ± 0.175 0.192 RUS East Slavic 42 31,892 9,005 7.56 ± 2.77 0.496 ± 0.163 0.223 BUL South Slavic 42 31,866 9,063 7.24 ± 2.60 0.510 ± 0.171 0.179 POR Romance 42 32,164 9,393 6.95 ± 2.36 0.526 ± 0.190 0.134 FRA Romance 42 32,497 9,656 6.24 ± 1.95 0.496 ± 0.163 0.167 DEU Germanic 42 32,162 9,865 6.57 ± 2.47 0.435 ± 0.178 0.150 Table 2: Summary statistics of our experimental data. CZE POL RUS BUL POR FRA DEU CZE POL RUS BUL POR FRA DEU CZE POL RUS BUL POR FRA DEU 0.745 0.763 0.733 0.746 0.734 0.747 0.676 0.692 0.699 0.708 0.651 0.663 CZE 0.769 0.762 0.752 0.747 0.737 0.736 0.703 0.706 0.712 0.717 0.696 0.703 CZE 0.264 0.4 0.377 0.244 0.225 0.365 0.331 0.208 0.203 0.324 0.180 0.281 0.8 CZE POL RUS BUL POR FRA DEU 0.7 0.746 0.731 0.728 0.712 0.736 0.722 0.683 0.669 0.675 0.668 0.630 0.622 POL 0.759 0.763 0.724 0.721 0.717 0.721 0.683 0.693 0.683 0.700 0.67 0.681 POL 0.361 0.248 0.388 0.269 0.362 0.245 0.314 0.219 0.304 0.205 0.262 0.179 0.6 0.736 0.723 0.738 0.720 0.762 0.748 0.708 0.694 0.689 0.681 0.636 0.624 RUS 0.753 0.745 0.743 0.735 0.761 0.765 0.722 0.721 0.694 0.695 0.691 0.690 RUS 0.220 0.338 0.257 0.377 0.234 0.373 0.212 0.332 0.189 0.31 0.178 0.275 0.5 0.728 0.719 0.739 0.724 0.755 0.745 0.713 0.700 0.698 0.691 0.645 0.638 BUL 0.744 0.742 0.732 0.731 0.763 0.759 0.726 0.731 0.698 0.706 0.689 0.692 BUL 0.207 0.33 0.241 0.367 0.375 0.238 0.198 0.322 0.190 0.312 0.170 0.282 0.4 0.700 0.684 0.725 0.705 0.721 0.734 0.735 0.718 0.701 0.688 0.659 0.649 POR 0.737 0.735 0.720 0.722 0.730 0.735 0.728 0.736 0.712 0.726 0.694 0.705 POR 0.35 0.234 0.384 0.266 0.386 0.269 0.377 0.254 0.345 0.234 0.292 0.202 0.3 0.696 0.709 0.691 0.706 0.683 0.694 0.700 0.712 0.679 0.695 0.663 0.674 FRA 0.739 0.737 0.730 0.735 0.705 0.705 0.708 0.710 0.734 0.737 0.710 0.717 FRA 0.193 0.32 0.349 0.220 0.200 0.327 0.202 0.329 0.197 0.32 0.292 0.179 0.2 0.683 0.702 0.675 0.698 0.658 0.676 0.663 0.674 0.658 0.676 0.661 0.678 DEU 0.687 0.674 0.670 0.680 0.649 0.659 0.627 0.649 0.640 0.660 0.645 0.658 DEU 0.180 0.293 0.195 0.307 0.306 0.193 0.183 0.292 0.277 0.177 0.182 0.294 Figure 6: The cross-lingual representational similarity matrix (xRSM) for each model: PGE (Left), CAE (Middle), and CSE (Right), obtained using the non-linear CKA measure. Each row corresponds to the language of the spoken-word stimuli while each column corresponds to the language of the encoder. Slavic Romance Germanic Czech Czech Russian Polish Polish Polish Russian Russian Czech Bulgarian Bulgarian Bulgarian Portuguese Portuguese Portuguese French French French German German German 0.5 0.0 0.4 0.2 0.0 1.0 0.4 0.2 0.0 Figure 7: Hierarchical clustering analysis on the cross-lingual representational similarity matrices (using non-linear CKA) of the three models: PGE (Left), CAE (Middle), and CSE (Right).
You can also read