Does Typological Blinding Impede Cross-Lingual Sharing?
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Does Typological Blinding Impede Cross-Lingual Sharing? Johannes Bjerva Isabelle Augenstein Department of Computer Science Department of Computer Science Aalborg University University of Copenhagen jbjerva@cs.aau.dk augenstein@di.ku.dk Abstract Bridging the performance gap between high- and low-resource languages has been the fo- cus of much previous work. Typological fea- tures from databases such as the World Atlas of Language Structures (WALS) are a prime candidate for this, as such data exists even for very low-resource languages. However, previ- ous work has only found minor benefits from using typological information. Our hypothesis is that a model trained in a cross-lingual set- ting will pick up on typological cues from the input data, thus overshadowing the utility of explicitly using such features. We verify this hypothesis by blinding a model to typological Figure 1: A PoS tagger is exposed (or blinded with gra- information, and investigate how cross-lingual dient reversal, −λ) to typological features. Observing sharing and performance is impacted. Our α values tells us how typology affects sharing. model is based on a cross-lingual architecture in which the latent weights governing the shar- ing between languages is learnt during training. work differs significantly from such efforts in that We show that (i) preventing this model from exploiting typology severely reduces perfor- we blind a model to this information. Most previ- mance, while a control experiment reaffirms ous work includes language information as features, that (ii) encouraging sharing according to ty- by using language IDs, or language embeddings pology somewhat improves performance. (e.g. Ammar et al. (2016); O’Horan et al. (2016); Östling and Tiedemann (2017); Ponti et al. (2019); 1 Introduction Oncevay et al. (2020)). Notably, limited effects Most languages in the world have little access to are usually observed from including typological NLP technology due to data scarcity (Joshi et al., features explicitly. For instance, de Lhoneux et al. 2020). Nonetheless, high-quality multilingual rep- (2018) observe positive cross-lingual sharing ef- resentations can be obtained using only a raw text fects only in a handful of their settings. We there- signal, e.g. via multilingual language modelling fore hypothesise that relevant typological informa- (Devlin et al., 2019). Furthermore, structural sim- tion is learned as a by-product of cross-lingual train- ilarities of languages are to a large extent docu- ing. Hence, although models do benefit from this mented in typological databases such as the World information, it is not necessary to provide it ex- Atlas of Language Structures (WALS, Dryer and plicitly in a high-resource scenario, where there is Haspelmath (2013)). Hence, developing models abundant training data. This is confirmed by Bjerva which can take use typological similarities of lan- and Augenstein (2018a), who find that, e.g., lan- guages is an important direction in order to alleviate guage embeddings trained on a morphological task language technology inequalities. can encode morphological features from WALS. While previous work has attempted to use ty- In contrast with previous work, we blind a model pological information to inform NLP models, our to typological information, by using adversarial 480 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 480–486 April 19 - 23, 2021. ©2021 Association for Computational Linguistics
techniques based on gradient reversal (Ganin and we choose to interpret this differently by consid- Lempitsky, 2014). We evaluate on the structured ering each language as a task, yielding α ∈ Rl×l , prediction and classification tasks in XTREME (Hu where l is the number of languages for the given et al., 2020), yielding a total of 40 languages and task. Each activation e hA,k is then a linear com- 4 tasks. We show that when a model is blinded to bination of the language specific activations hA,k . typological signals relating to syntax and morphol- These are used for prediction in the downstream ogy, performance on related NLP tasks drops sig- tasks, as in the baselines from Hu et al. (2020). nificantly. For instance, the mean accuracy across Crucially, this model allows us to draw conclu- 40 languages for POS tagging drops by 1.8% when sions about parameter sharing between languages blinding the model to morphological features. by observing the α parameters under the blinding and prediction conditions. We will combine this in- 2 Model sight with observing downstream task performance in order to draw conclusions about the effects of An overview of the model is shown in Figure 1. typological feature blinding and prediction. We model each task in this paper using the fol- lowing steps. First, contextual representations 2.2 Blinding/Exposing a Model to Typology are extracted using multilingual BERT (m-BERT, Devlin et al. (2019)), a transformer-based model We introduce a component which can either blind (Vaswani et al., 2017), trained with shared word- or expose the model to typological features. We im- pieces across languages. We either blind m-BERT plement this as a single task-specific layer per fea- to typological features, with an added adversarial ture, using the [CLS] token from m-BERT model, component based on gradient reversal (Ganin and without access to any of the soft sharing between Lempitsky, 2014), or expose it to them via multi- languages from α-layers. Each layer optimises a task learning (MTL, (Caruana, 1997)). Representa- categorical cross-entropy loss function (Ltyp ). tions from m-BERT are fed to a latent multi-task For this task, we predict typological features architecture learning network (Ruder et al., 2019), drawn from WALS (Dryer and Haspelmath, 2013), which includes α parameters we seek to investi- inspired by previous work (Bjerva and Augenstein, gate. The model learns which parameters to share 2018a). Unlike previous work, we also blind the between languages (e.g. αes,f r denotes sharing be- model to such features by including a gradient re- tween Spanish and French). versal layer (Ganin and Lempitsky, 2014), which multiplies the gradient of the typological predic- 2.1 Sharing architecture tion task with a negative constant (−λ), inspired by Our sharing architecture is based on that of Ruder previous work on adversarial learning (Goodfellow et al. (2019), which has latent variables learned dur- et al., 2014; Zhang et al., 2019; Chen et al., 2019). ing training, governing which layers and subspaces We hypothesise that using a gradient reversal layer are shared between tasks, to what extent, as well as for typology will yield typology-invariant features, the relative weighting of different task losses. We and that this will perform worse on tasks for which are most interested in the parameters which control the typological feature at hand is important. For in- the sharing between the hidden layers allocated to stance, we expect that blinding a model to syntactic each task, referred to as α parameters (Ruder et al., features will severely reduce performance for tasks 2019). Consider a setting with two tasks A and which rely heavily on syntax, such as POS tagging. B. The outputs hA,k and hB,k of the k-th layer for task A and B interact through the α parameters, 3 Cross-Lingual Experiments for which the output is defined as: We investigate the effects of typological blinding, using typological parameters as presented in WALS " # (Dryer and Haspelmath, 2013). The experiments hA,k αAA αAB hA,k > , hB,k > (1) e are run on XTREME (Hu et al., 2020), which in- = hB,k e αBA αBB cludes up to 40 languages from 12 language fam- ilies and two isolates. We experiment on the fol- hA,k is a linear combination of the acti- where e lowing languages (ISO 639-1 codes): af, ar, bg, vations for task A at layer k, weighted with the bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, learned αs. While their model is an MTL model, ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw, 481
ta, te, th, tl, tr, ur, vi, yo, and zh. We experiment not occur for our set of languages. We hypothesise on four tasks: POS (part of speech tagging), NER that performance will drop for all four tasks, as (named entity recognition), XNLI (cross-lingual they all require syntactic understanding. natural language inference), and PAWS-X (para- phrase identification). Our general setup for the Morphological Features We next attempt to structured prediction tasks (POS and NER) is that blind/expose the model to the morphological fea- we train on all available languages, and downsam- tures in WALS. We use the same approach as above, ple to 1,000 samples per language. For the clas- resulting in a total of 8 morphological features. sification tasks XNLI and PAWS-X, we train on This includes features such as 26A: Prefixing vs. the English training data and fine-tune on the de- Suffixing in Inflectional Morphology, indicating to velopment sets, as no training data is available for what extent a language uses prefixing or suffix- other languages. Hence, typological differences ing morphology. We hypothesise that mainly the will be the main factor in our results, rather than POS tagging task will suffer under this condition, differences in dataset sizes. whereas other tasks only to some extent require morphology. 3.1 Typological Prediction and Blinding Phonological Features We next consider a We first investigate whether prohibiting or allow- control experiment, in which we attempt to ing access to typological features has an effect on blind/expose the model to phonological features model performance using our architecture. We hy- in WALS. We arrive at a total of 15 phonological pothesise that our multilingual model will leverage features, such as 1A: Consonant Inventories which signals related to the linguistic nature of a task indicates the size of the consonant inventory of a when optimising its its sharing parameters α. language. We expect the performance to remain rel- There exists a growing body of work on predic- atively unaffected by this task, as phonology ought tion of typological features (Daumé III and Camp- to have little importance given a textual input. bell, 2007; Murawaki, 2017; Bjerva and Augen- stein, 2018b; Bjerva et al., 2019a,b), most notably Genealogical Features Finally, we attempt to in a recent shared task on the subject (Bjerva et al., use what one might consider to be language meta- 2020). While we are inspired by this direction of data. We attempt to blind/expose the model to what research, our contribution is not concerned with language family a language belongs to. This can the accuracy of the prediction of such features, and be seen as a type of proxy to language similarity, this is therefore not evaluated in detail in the paper. and correlates relatively strongly with structural Moreover, an increasing amount of work mea- similarities in languages. Because of this correla- sures the correlation of predictive performance tion with structural similarities, we expect blinding of cross-lingual models with typological features under this condition to only slightly reduce perfor- as a way of probing what a model has learned mance for all tasks, as previous work has shown about typology (Malaviya et al., 2017; Choenni and this type of relationship not to be central in lan- Shutova, 2020; Gerz et al., 2018; Nooralahzadeh guage representations (Bjerva et al., 2019c). et al., 2020; Zhao et al., 2020). In contrast to such 3.2 Results post-hoc approaches, our experimental setting al- lows for measuring the impact of typology on cross- In general, we observe a drop in performance when lingual sharing performance in a direct manner as blinding the model to relevant typological infor- part of the model architecture. mation, and an increase in performance when ex- posing the model to it (Table 1). For phonological Syntactic Features We first blind/expose the blinding or prediction, none of the four tasks is no- model to syntactic features from WALS (Dryer ticeably affected. Although, e.g., both the syntactic and Haspelmath, 2013). We take the set of word or- and morphological prediction tasks increase perfor- der features which are annotated for all languages mance on POS tagging, it is not straightforward to in our experiments, resulting in 33 features. This draw conclusions on which of these is the most ef- includes features such as 81A: Order of Subject, ficient, as there is a substantial correlation between Object and Verb, which encodes what the preferred syntactic and morphological features. As for XNLI word ordering is (if any) in a transitive clause. For and PAWS-X, performance notably drops under all features, we exclude feature values which do both the syntactic and genealogical blinding tasks. 482
Figure 2: PoS tagging results per language family across blinding and prediction conditions Model POS NER XNLI PAWS-X 3.3 The Effect of Typology on Latent + Syntactic Blind. 85.3− 76.4 64.2− 80.6− Architecture Learning + Morphological Blind. 85.0− 77.2 64.9 81.4 + Phonological Blind. 86.7 77.1 65.0 81.6 The results show that preventing access to typolog- + Genealogical Blind. 86.1 77.0 64.7 81.1 ical features hampers performance, whereas pro- m-BERT baseline 86.8 77.3 65.1 81.7 viding access improves performance. We now turn + Syntactic Pred. 87.0 77.5 65.3+ 81.9+ to an analysis of how the model shares parameters + Morphological Pred. 87.2+ 77.3 65.2 81.7 across languages in this setting. Our hypothesis + Phonological Pred. 86.7 77.1 65.0 81.7 is that blinding will prevent models from sharing + Genealogical Pred. 87.0 77.6 65.3+ 81.8 parameters between similar languages, in spite of Table 1: Typological Blinding and Prediction. Mean typological similarities. Concretely, we expect that POS accuracy, NER F1 scores, XNLI accuracy and the drop in POS tagging performance under mor- PAWS-X accuracy across all languages. + and − indi- phological blinding is caused by lower α weights cate significantly better or worse performance respec- between languages which are morphologically sim- tively, as determined by a one-tailed t-test (p < 0.01). ilar, and higher α weights between languages which are dissimilar. Recall that these parameters are latent variables learned by the model, regulat- ing the amount of sharing between languages (see Figure 2 shows results for PoS tagging under Eq. 1). We investigate the correlations between the prediction and blinding across language families, α sharing parameters, and two proxies of language following the same scheme as Hu et al. (2020). In- similarity. We focus on the POS task, as the re- terestingly, the syntactic and morphological blind- sults from the typological blinding and prediction ing settings are robust across all language families, experiments were the most pronounced here, as yielding a drop in accuracy across the board. All both morphological and syntactic blinding affected other conditions yield mixed results. This further performance. strengthens our argument that preventing a model Our first measure of language similarity is based from learning syntactic and morphological features on Bjerva et al. (2019c), who introduce what they can be severely detrimental. refer to as structural similarity. This is based on 483
Model Struct. Lang. Emb. This could be the case, if only some latent represen- tation of, e.g., “SVO” ordering is used to represent Syntactic Blind. 0.31 0.27 a language identity. However, previous work has Morphological Blind. 0.34 0.29 shown that morphological information is encoded Phonological Blind. 0.40 0.41 by the type of model we investigate. Hence, since Genealogical Blind. 0.29 0.31 we only blind features in a single category at a No blind./pred. 0.43 0.40 time, we expect that the model’s representation of Syntactic Pred. 0.52 0.53 language identities is unaffected. Morphological Pred. 0.49 0.56 Not only do we observe a drop in performance Phonological Pred. 0.41 0.39 when blinding a model to syntactic features, but Genealogical Pred. 0.47 0.38 we also observe that the α sharing weights in our model do not appear to correlate with linguistic Table 2: Pearson correlations between α weights and similarities in this setting. Conversely, encouraging language similarity measures. a model to consider typology, by jointly optimis- ing it for typological feature prediction, improves performance in general. Furthermore, α weights dependency statistics from the Universal Depen- in this scenario converge towards correlating with dencies treebank (Zeman et al., 2020), resulting structural similarities of languages. This is in line in vectors which describe how different syntactic with recent work which has found that m-BERT relations are used in each language. Previous work uses fine-grained syntactic distinctions in its cross- has shown that this measure of similarity correlates lingual representation space (Chi et al., 2020). strongly with that learned in embedded language We interpret this as evidence for the fact that spaces during multilingual training. In addition to typology can be a necessity for modelling in NLP. considering these dependency statistics, we also Our results furthermore corroborate previous work use language embeddings drawn form Östling and in that we only find moderate benefits from includ- Tiedemann (2017). For each language similarity ing typological information explicitly. We expect measure we calculate its pairwise Pearson correla- that this to a large degree is due to the typological tion with the α values learned under each condition. similarities of languages being encoded implicitly Table 2 shows correlations between α weights based on correlations between patterns in the input and similarities increase when predicting typolog- data. As low-resource languages often do not even ical features, and decreases when blinded to such have access to any substantial amount of raw text, features. Hence, when the model has indirect ac- but often do have annotations in WALS, we expect cess to, e.g., the SVO word ordering features of that using typological information can go some way languages, sharing also reflects this. towards building truly language-universal models. 4 Discussion 5 Conclusions We have shown that blinding a multilingual model We have shown that preventing access to typology to typological features severely affects sharing can impede the performance of cross-lingual shar- across a relatively large language sample, and for ing models. Investigating latent weights govern- several NLP tasks. The effects on model perfor- ing the sharing between languages shows that this mance, as evaluated over 40 languages and 4 tasks prevents the model from sharing between typologi- from XTREME (Hu et al., 2020), were the largest cally similar languages, which is otherwise learned for POS tagging. The fact that smaller effects were based on patterns in the input. We therefore expect observed for NER, could be because this task re- that using typological information can be of partic- lies more on memorising NEs rather than using ular interest for building truly language-universal (morpho-)syntactic cues (Augenstein et al., 2017). models for low-resource languages. Furthermore, the relatively small effects on XNLI and PAWS-X can also be interpreted as evidence Acknowledgements for that typology is less important in these tasks than in more traditional linguistic analysis. This research has received funding from the A potential critique of our approach is that it Swedish Research Council (grant No 2019-04129), merely blinds the model to language identities. and the NVIDIA Corporation (Titan Xp GPU). 484
References Steven Chen, Nicholas Carlini, and David Wagner. 2019. Stateful detection of black-box adversarial at- Waleed Ammar, George Mulcaire, Miguel Ballesteros, tacks. arXiv preprint arXiv:1907.05587. Chris Dyer, and Noah A Smith. 2016. Many lan- guages, one parser. Transactions of the Association Ethan A Chi, John Hewitt, and Christopher D Manning. for Computational Linguistics, 4:431–444. 2020. Finding universal grammatical relations in multilingual bert. arXiv preprint arXiv:2005.04511. Isabelle Augenstein, Leon Derczynski, and Kalina Bontcheva. 2017. Generalisation in named entity Rochelle Choenni and Ekaterina Shutova. 2020. What recognition: A quantitative analysis. Computer does it mean to be language-agnostic? probing mul- Speech & Language, 44:61–83. tilingual sentence encoders for typological proper- ties. CoRR, abs/2009.12862. Johannes Bjerva and Isabelle Augenstein. 2018a. From Hal Daumé III and Lyle Campbell. 2007. A Bayesian phonology to syntax: Unsupervised linguistic typol- Model for Discovering Typological Implications. In ogy at different levels with language embeddings. In Proceedings of the 45th Annual Meeting of the Asso- Proceedings of the 2018 Conference of the North ciation of Computational Linguistics, pages 65–72, American Chapter of the Association for Compu- Prague, Czech Republic. Association for Computa- tational Linguistics: Human Language Technolo- tional Linguistics. gies, Volume 1 (Long Papers), pages 907–916, New Orleans, Louisiana. Association for Computational Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Linguistics. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Johannes Bjerva and Isabelle Augenstein. 2018b. standing. In Proceedings of the 2019 Conference Tracking Typological Traits of Uralic Languages in of the North American Chapter of the Association Distributed Language Representations. In Proceed- for Computational Linguistics: Human Language ings of the Fourth International Workshop on Com- Technologies, Volume 1 (Long and Short Papers), putational Linguistics of Uralic Languages, pages pages 4171–4186, Minneapolis, Minnesota. Associ- 76–86, Helsinki, Finland. Association for Computa- ation for Computational Linguistics. tional Linguistics. Matthew S. Dryer and Martin Haspelmath, editors. Johannes Bjerva, Yova Kementchedjhieva, Ryan Cot- 2013. WALS Online. Max Planck Institute for Evo- terell, and Isabelle Augenstein. 2019a. A probabilis- lutionary Anthropology, Leipzig. tic generative model of linguistic typology. In Pro- Yaroslav Ganin and Victor Lempitsky. 2014. Unsuper- ceedings of the 2019 Conference of the North Amer- vised domain adaptation by backpropagation. arXiv ican Chapter of the Association for Computational preprint arXiv:1409.7495. Linguistics: Human Language Technologies, Vol- ume 1 (Long and Short Papers), pages 1529–1540, Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi Minneapolis, Minnesota. Association for Computa- Reichart, and Anna Korhonen. 2018. On the relation tional Linguistics. between linguistic typology and (limitations of) mul- tilingual language modeling. In Proceedings of the Johannes Bjerva, Yova Kementchedjhieva, Ryan Cot- 2018 Conference on Empirical Methods in Natural terell, and Isabelle Augenstein. 2019b. Uncovering Language Processing, pages 316–327, Brussels, Bel- probabilistic implications in typological knowledge gium. Association for Computational Linguistics. bases. In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguis- Ian J Goodfellow, Jonathon Shlens, and Christian tics, pages 3924–3930, Florence, Italy. Association Szegedy. 2014. Explaining and harnessing adversar- for Computational Linguistics. ial examples. arXiv preprint arXiv:1412.6572. Johannes Bjerva, Elizabeth Salesky, Sabrina J. Mielke, Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- Aditi Chaudhary, Celano Giuseppe, Edoardo Maria ham Neubig, Orhan Firat, and Melvin Johnson. Ponti, Ekaterina Vylomova, Ryan Cotterell, and Is- 2020. Xtreme: A massively multilingual multi-task abelle Augenstein. 2020. SIGTYP 2020 shared task: benchmark for evaluating cross-lingual generaliza- Prediction of typological features. In Proceedings tion. arXiv preprint arXiv:2003.11080. of the Second Workshop on Computational Research Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika in Linguistic Typology, pages 1–11, Online. Associ- Bali, and Monojit Choudhury. 2020. The state and ation for Computational Linguistics. fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095. Johannes Bjerva, Robert Östling, Maria Han Veiga, Jörg Tiedemann, and Isabelle Augenstein. 2019c. Miryam de Lhoneux, Johannes Bjerva, Isabelle Augen- What Do Language Representations Really Repre- stein, and Anders Søgaard. 2018. Parameter sharing sent? Computational Linguistics, 45(2):381–389. between dependency parsers for related languages. In Proceedings of the 2018 Conference on Empiri- Rich Caruana. 1997. Multitask learning. Machine cal Methods in Natural Language Processing, pages Learning, 28 (1):41–75. 4992–4997. 485
Chaitanya Malaviya, Graham Neubig, and Patrick Lit- Huan Zhang, Hongge Chen, Zhao Song, Duane Boning, tell. 2017. Learning language representations for Inderjit S Dhillon, and Cho-Jui Hsieh. 2019. The typology prediction. In Proceedings of the 2017 limitations of adversarial training and the blind-spot Conference on Empirical Methods in Natural Lan- attack. arXiv preprint arXiv:1901.04684. guage Processing, pages 2529–2535, Copenhagen, Denmark. Association for Computational Linguis- Wei Zhao, Steffen Eger, Johannes Bjerva, and Is- tics. abelle Augenstein. 2020. Inducing Language- Agnostic Multilingual Representations. arXiv Yugo Murawaki. 2017. Diachrony-aware induction of preprint arXiv:2008.09112. binary latent representations from typological fea- tures. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 451–461, Taipei, Taiwan. Asian Federation of Natural Language Pro- cessing. Farhad Nooralahzadeh, Giannis Bekoulis, Johannes Bjerva, and Isabelle Augenstein. 2020. Zero-Shot Cross-Lingual Transfer with Meta Learning. In Pro- ceedings of EMNLP. Association for Computational Linguistics. Helen O’Horan, Yevgeni Berzak, Ivan Vulić, Roi Re- ichart, and Anna Korhonen. 2016. Survey on the use of typological information in natural language processing. arXiv preprint arXiv:1610.03349. Arturo Oncevay, Barry Haddow, and Alexandra Birch. 2020. Bridging linguistic typology and multilingual machine translation with multi-view language rep- resentations. In Proceedings of EMNLP. Associa- tion for Computational Linguistics. ArXiv preprint arXiv:2004.14923. Robert Östling and Jörg Tiedemann. 2017. Continuous multilinguality with language vectors. In Proceed- ings of the 15th Conference of the European Chap- ter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 644–649, Valencia, Spain. Association for Computational Linguistics. Edoardo Maria Ponti, Helen O’horan, Yevgeni Berzak, Ivan Vulić, Roi Reichart, Thierry Poibeau, Ekaterina Shutova, and Anna Korhonen. 2019. Modeling lan- guage variation and universals: A survey on typo- logical linguistics for natural language processing. Computational Linguistics, 45(3):559–601. Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. 2019. Latent multi-task archi- tecture learning. In Proceedings of the AAAI Con- ference on Artificial Intelligence, volume 33, pages 4822–4829. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Daniel Zeman, Joakim Nivre, and Mitchell Abrams et al. 2020. Universal dependencies 2.6. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University. 486
You can also read