On the ability of monolingual models to learn language-agnostic representations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
On the ability of monolingual models to learn language-agnostic representations Leandro Rodrigues de Souza,1 Rodrigo Nogueira,1,2,3 Roberto Lotufo1,2 1 Faculty of Electrical and Computer Engineering, University of Campinas (UNICAMP) 2 NeuralMind Inteligência Artificial 3 David R. Cheriton School of Computer Science, University of Waterloo Abstract ies and benchmarks reporting high transfer perfor- mance (Pires et al., 2019; Hu et al., 2020). At first, Pretrained multilingual models have become a de facto default approach for zero-shot cross- it has been speculated that a shared vocabulary with lingual transfer. Previous work has shown common terms between languages would result in arXiv:2109.01942v2 [cs.CL] 25 Oct 2021 that these models are able to achieve cross- a shared input representation space, making it pos- lingual representations when pretrained on two sible for the model to attain similar representations or more languages with shared parameters. In across languages (Pires et al., 2019). However, this work, we provide evidence that a model more recent work has shown that having a shared can achieve language-agnostic representations vocabulary is not critical for cross-lingual transfer even when pretrained on a single language. That is, we find that monolingual models pre- (Artetxe et al., 2020; Conneau et al., 2020). trained and finetuned on different languages Conneau et al. (2020) have released a large study achieve competitive performance compared to on bilingual models to identify the most important the ones that use the same target language. Sur- characteristics that enable cross-lingual transfer. prisingly, the models show a similar perfor- The authors found that a shared vocabulary with mance on a same task regardless of the pre- anchor points (common terms in both languages) training language. For example, models pre- plays a minor role. Sharing the Transformers’ pa- trained on distant languages such as German and Portuguese perform similarly on English rameters when training on datasets from two lan- tasks. guages seems to be critical. They also found an interesting property in monolingual models: they 1 Introduction have similar representations in the first layers of Pretrained language models, such as BERT (Devlin the network that can be aligned with a simple or- et al., 2019) and T5 (Raffel et al., 2020), still rely thogonal mapping function. on high quality labeled datasets for achieving good Similar findings have been presented by Artetxe results in NLP tasks. Since creating such datasets et al. (2020). They developed a procedure to learn can be challenging (Biswas et al., 2009; Sabou representations for a specific language (i.e., word et al., 2012), many resource-lean languages suffer embedding layer) aligned with pretrained English from the lack of data for training. layers of a Transformer model. This enables the Multilingual pretraining has been developed model to transfer knowledge between languages by to overcome this issue, giving rise to multilin- only swapping its lexicon. gual models like mBERT (Devlin et al., 2019), As it becomes clearer that models trained on XLM (Conneau et al., 2018a, 2019) and mT5 (Xue different languages achieve similar representations et al., 2021). Differently from monolingual models, and cross-lingual transfer can be reached with quite they are pretrained with mixed batches from a wide simple methods, we raise the following question: range of languages. The resulting model finetuned Would monolingual models be able to learn on a task in a high-resource language often per- from labeled datasets in a foreign language (i.e., forms well on the same task in a different language a different language from its pretraining) with- never seen during finetuning. Such models thus ex- out any adjustments at all? hibit the so called zero-shot cross-lingual transfer We ground our hypothesis on the fact that ability. masked language modeling has demonstrated to This approach has become the de facto default achieve strong representations even in different do- language transfer paradigm, with multiple stud- mains (Lu et al., 2021; Zügner et al., 2021). To
the best of our knowledge, our work is the first to an English BERT model’s lexicon in a target lan- experiment on monolingual models learning from guage dataset: only word embeddings are trainable. a dataset in a different language. The authors, then, have transferred knowledge by Based on these observations, we design an exper- taking a BERT model finetuned on English and iment to test if monolingual models can leverage swapping its lexicon to the desired language. They concepts acquired during pretraining and learn a obtained results similar to those obtained by Pires new language and a new task from finetuning only. et al. (2019). Therefore, they confirmed the hypoth- We pretrain a model on a source language and fine- esis that the model can learn representations that tune it on a different language. We also evaluate are language-agnostic and that do not depend on a models without pretraining, as a control experi- shared vocabulary nor joint pretraining. ment, to discard the hypothesis that the model is Conneau et al. (2020) have also demonstrated learning patterns only from the finetuning dataset. that a shared vocabulary with anchor points con- Main contribution. We demonstrate that monolin- tributes little to language transfer. In their experi- gual models can learn a task in a foreign language. ments, the authors carried out an extensive analysis Our results show that, despite those models have on the results of bilingual models to identify the inferior performance when compared to the stan- factors that contribute most to language transfer. dard approach (i.e., pretraining and finetuning a The results indicate that the main component is re- model on the same language), they can still learn lated to the sharing of model parameters, achieving representations for a downstream task and perform good results even in the absence of parallel corpus well when compared to models without pretraining. or shared vocabulary. They also found that mono- This raises the hypothesis that MLM pretraining lingual models pretrained on MLM have similar provides the model some language-agnostic proper- representations, that could be aligned with a simple ties that can be leveraged regardless of the language orthogonal mapping. of the task. Zhao et al. (2020) observed that having a cross- In contrast to the current literature, the mono- lingual training objective contributes to the model lingual models used in this work do not rely on learning cross-lingual representations, while mono- shared parameters during pretraining with multi- lingual objectives resulted in language-specific sub- lingual datasets. They are pretrained, instead, on spaces. These results indicate that there is a nega- a single language corpus and finetuned on a task tive correlation between the universality of a model in a different (foreign) language. The results raise and its ability to retain language-specific informa- questions about language-agnostic representations tion, regardless of the architecture. Our experi- during MLM pretraining and contribute to future ments show that a single language pretraining still directions in cross-lingual knowledge transfer re- enables the model to achieve competitive perfor- search. mance in other languages via finetuning (on super- vised data) only. 2 Related Work Libovickỳ et al. (2019) found that mBERT word Multilingual pretraining has been widely adopted embeddings are more similar for languages of the as a cross-lingual knowledge transfer paradigm same family, resulting in specific language spaces (Devlin et al., 2019; Conneau and Lample, 2019; that cannot be directly used for zero-shot cross- Conneau et al., 2019; Xue et al., 2021). Such mod- lingual tasks. They also have shown that good els are pretrained on a multilingual corpus using cross-lingual representations can be achieved with masked language modeling objectives. The result- a small parallel dataset. We show, however, that a ing models can be finetuned on high-resource lan- monolingual model can achieve competitive perfor- guages and evaluated on a wide set of languages mance in a different language, without parallel cor- without additional training. pora, suggesting that it contains language-agnostic Pires et al. (2019) first speculated that the gen- properties. eralization ability across languages is due to the While multilingual models exhibit great perfor- mapping of common word pieces to a shared space, mance across languages (Hu et al., 2020), mono- such as numbers and web addresses. lingual models still perform better in their main This hypothesis was tested by Artetxe et al. language. For instance, Chi et al. (2020) distill (2020). The experiment consisted of pretraining a monolingual model into a multilingual one and
achieve good cross-lingual performance. We take pretraining. However, the corpora used to pretrain on a different approach by using the monolingual our models have a very small amount of sentences model itself instead of extracting knowledge from in other languages. For instance, Portuguese pre- it. training corpus has only 14,928 sentences (0.01%) Rust et al. (2020) compared multilingual and in Vietnamese. monolingual models on monolingual tasks (i.e., the tasks whose language is the same as the mono- Control experiment. To discard the hypothesis lingual model). They found that both the size of that the monolingual model can learn patterns from pretraining data in the target language and vocabu- the finetuning dataset, instead of relying on more lary have a positive correlation with monolingual general concepts from both finetuning and pretrain- models’ performance. Based on our results, we ing, we perform a control experiment. We train hypothesize that a model pretrained with MLM the models on the target language tasks without using a large monolingual corpus develops both any pretraining. If models with monolingual pre- language-specific and language-agnostic proper- training have significantly better results, we may ties, being the latter predominant over the former. conclude that it uses knowledge from its pretraining instead of only learning patterns from finetuning 3 Methodology data. Our method consists of a pretrain-finetune ap- Evaluation tasks. We follow a similar selection proach that uses different languages for both. We as in Artetxe et al. (2020) and use two downstream call source language as the language used for pre- types of tasks for evaluation: natural language in- training our models. We refer to target language ference (NLI) and question answering (QA). Even as a second language, different from the one used though a classification task highlights the model’s for pretraining our model. We apply the following ability to understand the relationship between sen- steps: tences, it has been shown that the model may learn some superficial cues to perform well (Gururangan 1. Pretrain a monolingual model on the source et al., 2018). Because of that, we also select ques- language with masked language modeling tion answering, which requires natural language (MLM) objective using a large, unlabeled understanding as well. dataset. 2. Finetune and evaluate the model on a down- 4 Experiments stream task with a labeled dataset in the target In this section, we outline the models, datasets and language. tasks we use in our experiments. The novelty of our approach is to perform a cross-lingual evaluation using monolingual models 4.1 Models instead of bi-lingual or multi-lingual ones. We aim We perform experiments with four base models, to assess if the model is able to rely on its masked highlighted in Table 1. The experiments run on the language pretraining to achieve good representa- Base versions of the BERT model (Devlin et al., tions for a task even when finetuned on a different 2019), with 12 layers, 768 hidden dimensions and language. If successful, this would suggest that 12 attention heads. We use models initialized with MLM pretraining provides the model with repre- random weights. We also report the number of sentations for more abstract concepts rather than parameters (in millions) and the pretraining dataset learning a specific language. sizes in Table 1. Pretraining data. Our monolingual models are 4.2 Pretraining procedure pretrained on a large unlabeled corpus, using a source language’s vocabulary. Some high-resource The selected models are pretrained from random languages, such as English, have a high presence weights using a monolingual tokenizer, created in many datasets from other languages, often cre- from data in their native language. We also select ated from crawling web resources. This may influ- models that have been pretrained using the Masked ence the model’s transfer ability because it has seen Language Modeling (MLM) objective as described some examples from the foreign language during by Devlin et al. (2019).
Model name Initialization Language # of params. Data size BERT-EN (Devlin et al., 2019) Random English (en) 108 M 16 GB BERT-PT (ours) Random Portuguese (pt) 108 M 16 GB BERT-DE (Chan et al., 2020) Random German (de) 109 M 12 GB BERT-VI (Nguyen and Nguyen, 2020) Random Vietnamese (vi) 135 M 20 GB Table 1: Pretrained models used in this work. 4.3 Finetuning procedure Pretraining Finetune and Evaluation We use the AdamW optimizer (Loshchilov and en de pt vi Hutter, 2019) and finetune for 3 epochs, saving a None 21.50 13.38 27.23 13.94 checkpoint after every epoch. Results are reported en 88.32 39.28 45.22 50.97 using the best checkpoint based on validation met- de 78.25 68.63 43.85 36.41 rics. The initial learning rate is 3e − 5, followed by pt 79.31 44.28 59.28 55.99 a linear decay with no warm-up steps. vi 76.03 29.56 40.00 79.53 For NLI, we use a maximum input length of 128 Table 2: Results (F1 score) for question answering tokens and a batch size of 32. For QA, except for tasks. Results highlighted in red are considered the up- BERT-VI, our input sequence is restricted to 384 per bound, while blue is used for lower bound. Only tokens with a document stride of 128 tokens and cells of the same column are comparable since results a batch size of 16. For BERT-VI, we use an input belong to the same task. sequence of 256 tokens and a document stride of 85. This is due to the fact that BERT-VI was pretrained (Nguyen et al., 2020). on sequences of 256 tokens, resulting in positional token embeddings limited to this size. Natural Language Inference. We use: MNLI5 We have chosen this hyperparameter configura- for English (Williams et al., 2018); ASSIN26 for tion based on preliminary experiments with English Portuguese (Real et al., 2020); and XNLI7 for both and Portuguese models trained on question answer- Vietnamese and German (Conneau et al., 2018b). ing and natural language inference tasks in their native languages. We do not perform any additional 5 Results hyperparameter search. Table 2 reports the results for the question answer- For the control experiments, we finetune the ing experiments and Table 3 for natural language models for 30 epochs using the same hyperparam- inference. For every experiment, we provide a eters, with a patience of 5 epochs over the main characterization of the pretraining (source) and validation metric. This longer training aims to dis- finetune (target) languages we used. We also de- card the hypothesis that the pretrained models have note None as the models of our control experiment, an initial weight distribution that is favorable com- which are not pretrained. pared to a random initialization, which could be We use a color scheme for better visualization. the main reason for better performance. Results highlighted in red are considered the up- 4.3.1 Finetuning tasks per bound (models pretrained and finetuned on the same language); orange is used for models Question Answering. We select the following finetuned on a different language from pretraining; tasks for finetuning: SQuAD v1.11 for English blue is used for lower bound experiments (control (Rajpurkar et al., 2016); FaQuAD2 for Portuguese experiments). (Sayama et al., 2019); GermanQuAD3 for Ger- Models leverage pretraining, even when fine- man (Chan et al., 2020); ViQuAD4 for Vietnamese tuned on different languages. For all selected lan- 1 https://raw.githubusercontent.com/ guages, models can take advantage of its pretrain- rajpurkar/SQuAD-explorer/master/dataset/ 2 5 https://raw.githubusercontent.com/ https://cims.nyu.edu/~sbowman/ liafacom/faquad/master/data/ multinli/ 3 6 https://www.deepset.ai/datasets https://sites.google.com/view/assin2/ 4 7 https://sites.google.com/uit.edu.vn/ https://github.com/facebookresearch/ uit-nlp/datasets-projects XNLI
Pretraining Finetune and Evaluation task in any of the languages used during pretraining. en de pt vi Our results show, however, that we can still achieve None 61.54 59.48 70.22 54.09 a good performance in a foreign language (i.e., a en 83.85 67.71 82.56 64.20 language not seen during pretraining). de 71.22 78.27 77.45 47.98 Our experiments suggest that, during MLM pre- pt 75.56 65.06 86.56 63.15 training, the model learns language-agnostic repre- vi 72.05 66.63 80.64 76.25 sentations and that both language and task knowl- # of classes 3 3 2 3 edge can be acquired during finetuning only, with smaller datasets compared to pretraining. Table 3: Results (accuracy) for Natural Language Infer- ence in each language with different pretrainings. The 6.1 Monolingual models performance in last row highlights the number of classes for each task. foreign languages The most interesting finding in this work is that a ing even when finetuned on a different language. If monolingual model can learn a different language we compare, for instance, results on SQuAD (fine- during a finetune procedure. Differently from the tune and evaluation for English in Table 2), models literature, we use a model pretrained on a single pretrained in all languages outperform the control language and targeted to a task in a different lan- experiment by, at least, 50 points in F1 score. A guage (English models learning Portuguese tasks very similar observation can be drawn from the and vice versa). results in other languages. We see that those models achieve good results compared to models pretrained and finetuned on For a specific task, models that were pretrained the same language and largely outperform mod- on a different language achieve similar results. els without pretraining. This finding suggests that An interesting observation is that results reported MLM pretraining is not only about learning a lan- in a foreign language are quite similar. Taking guage model, but also concepts that can be applied SQuAD results, for instance, when using models to many languages. This had been explored in pretrained on different languages, we see that F1 more recent work that leverages language models scores are around 78. The same can be observed in in different modalities (Lu et al., 2021). FaQuAD (finetune and evaluation for Portuguese It is also noteworthy that those models could in Table 2), for which results are close a F1 score learn a different language with a finetune procedure of 43. even with small datasets (for instance, FaQuAD contains only 837 training examples; German- German-Vietnamese pairs have lower perfor- QuAD has 11,518; and ViQuAD is comprised mance compared to other language pairs. The of 18,579). The English version of BERT could intersection between German and Vietnamese con- achieve a performance close to the upper bound tains the lowest performance among models fine- for Portuguese on both NLI (82.56% of accuracy) tuned on a foreign language. For question answer- and QA (45.22 F1 score) tasks with small datasets. ing, we see that the Vietnamese model (BERT-VI) This, together with the low performance of control finetuned on German is 10 F1 points below the models, shows that the model leverage most of its English pretrained model. The BERT-DE model pretraining for a task. also scores about 14 points lower in the ViQuAD task. One explanation is that German is a more 6.2 Similar performance regardless of the distant language to Vietnamese than English or pretraining language Portuguese. However, since there is not a strong Another finding is that similar results were consensus in the linguistic community about the achieved by models pretrained on a different lan- similarity of languages, it is not straightforward to guage than the task’s. We can draw three different test this hypothesis. clusters from the results: the lower bound clus- 6 Discussion ter, containing results for models without pretrain- ing; the upper bound, with models pretrained on Previous work has shown that using bi-lingual or the same language as the task; and a third cluster, multi-lingual pretrained models resulted in cross- closer to the upper bound, with models pretrained lingual representations, enabling a model to learn a on a different language.
Since the models are comparable in pretraining Acknowledgments data size and number of parameters, we hypothe- This research was funded by a grant from Fundação size that this observation is related to the language- de Amparo à Pesquisa do Estado de São Paulo agnostic concepts learned during pretraining versus (FAPESP) #2020/09753-5. language-specific concepts. The performance gap between the upper bound and third clusters may be explained by the language-specific properties not References available to the latter group. Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. This conclusion can also be drawn from the ex- 2020. On the cross-lingual transferability of mono- periments performed by Artetxe et al. (2020). The lingual representations. In Proceedings of the 58th authors demonstrated that only swapping the lexi- Annual Meeting of the Association for Computa- tional Linguistics, pages 4623–4637, Online. Asso- con enabled knowledge transfer between two lan- ciation for Computational Linguistics. guages. Since the embedding layer corresponds to a small portion of the total number of parameters Priyanka Biswas, Monojit Choudhury, and Kalika Bali. of the model, it seems that language-specific prop- 2009. Complex Linguistic Annotation - No Easy Way Out! A Case from Bengali and Hindi POS La- erties represent a small portion of the whole model beling Tasks. In Proceedings of the Third Linguistic as well. Here, we go even further by showing that Annotation Workshop, pages 10–18. Association for not even tokenizers and token embeddings need to Computational Linguistics. be adapted to achieve cross-lingual transfer. Branden Chan, Stefan Schweter, and Timo Möller. 2020. German’s next language model. In Proceed- 7 Conclusion ings of the 28th International Conference on Com- putational Linguistics, pages 6788–6796, Barcelona, Spain (Online). International Committee on Compu- Recent work has shown that training a model with tational Linguistics. two or more languages and shared parameters is im- Zewen Chi, Li Dong, Furu Wei, Xianling Mao, and portant, but also that representations from monolin- Heyan Huang. 2020. Can monolingual pretrained gual models pretrained with MLM can be aligned models help cross-lingual classification? In Pro- with simple mapping functions. ceedings of the 1st Conference of the Asia-Pacific In this context, we perform experiments with Chapter of the Association for Computational Lin- guistics and the 10th International Joint Confer- monolingual models on foreign language tasks. ence on Natural Language Processing, pages 12–17, Differently from previous work, we pretrain our Suzhou, China. Association for Computational Lin- models on a single language and evaluate their guistics. ability to learn representations for a task using a Alexis Conneau, Kartikay Khandelwal, Naman Goyal, different language during finetuning, i.e., without Vishrav Chaudhary, Guillaume Wenzek, Francisco using any cross-lingual transfer technique. Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Our results suggest that models pretrained with moyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. CoRR, MLM achieve language-agnostic representations abs/1911.02116. that can be leveraged for target tasks. We argue that MLM pretraining contributes more to repre- Alexis Conneau and Guillaume Lample. 2019. Cross- lingual language model pretraining. In Advances in sentations of abstract concepts than to a specific Neural Information Processing Systems, volume 32. language itself. Curran Associates, Inc. The experiments conducted in this paper raise Alexis Conneau, Guillaume Lample, Ruty Rinott, Ad- questions about the language-agnostic properties ina Williams, Samuel R. Bowman, Holger Schwenk, of language models and open directions for future and Veselin Stoyanov. 2018a. XNLI: Evaluating work. Studying language-independent representa- Cross-lingual Sentence Representations. Proceed- tions created from MLM pretraining may result in ings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, EMNLP 2018, new ways to leverage this pretraining procedure pages 2475–2485. for enabling models to perform in a wider range of tasks. More experiments on a wide range of Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- ina Williams, Samuel R. Bowman, Holger Schwenk, languages and tasks are needed to strengthen our and Veselin Stoyanov. 2018b. XNLI: Evaluating findings, which may lead to innovative language- cross-lingual sentence representations. In Proceed- agnostic methods for language models. ings of the 2018 Conference on Empirical Methods
in Natural Language Processing. Association for Colin Raffel, Noam Shazeer, Adam Roberts, Kather- Computational Linguistics. ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle- the limits of transfer learning with a unified text-to- moyer, and Veselin Stoyanov. 2020. Emerging cross- text transformer. Journal of Machine Learning Re- lingual structure in pretrained language models. In search, 21(140):1–67. Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 6022– Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, 6034, Online. Association for Computational Lin- and Percy Liang. 2016. Squad: 100, 000+ ques- guistics. tions for machine comprehension of text. CoRR, abs/1606.05250. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Livy Real, Erick Fonseca, and Hugo Gonçalo Oliveira. deep bidirectional transformers for language under- 2020. The assin 2 shared task: A quick overview. standing. In Proceedings of the 2019 Conference In Computational Processing of the Portuguese Lan- of the North American Chapter of the Association guage, pages 406–412, Cham. Springer Interna- for Computational Linguistics: Human Language tional Publishing. Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian ation for Computational Linguistics. Ruder, and Iryna Gurevych. 2020. How Good is Your Tokenizer? On the Monolingual Performance Suchin Gururangan, Swabha Swayamdipta, Omer of Multilingual Language Models. Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural lan- Marta Sabou, Kalina Bontcheva, and Arno Scharl. guage inference data. In Proceedings of the 2018 2012. Crowdsourcing research opportunities: Conference of the North American Chapter of the Lessons from natural language processing. In Pro- Association for Computational Linguistics: Human ceedings of the 12th International Conference on Language Technologies, Volume 2 (Short Papers), Knowledge Management and Knowledge Technolo- pages 107–112, New Orleans, Louisiana. Associa- gies, i-KNOW ’12, New York, NY, USA. Associa- tion for Computational Linguistics. tion for Computing Machinery. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- H. F. Sayama, A. V. Araujo, and E. R. Fernandes. ham Neubig, Orhan Firat, and Melvin Johnson. 2019. Faquad: Reading comprehension dataset 2020. XTREME: A Massively Multilingual Multi- in the domain of brazilian higher education. In task Benchmark for Evaluating Cross-lingual Gener- 2019 8th Brazilian Conference on Intelligent Sys- alisation. In International Conference on Machine tems (BRACIS), pages 443–448. Learning, pages 4411–4421. PMLR. Adina Williams, Nikita Nangia, and Samuel Bowman. Jindřich Libovickỳ, Rudolf Rosa, and Alexander Fraser. 2018. A broad-coverage challenge corpus for sen- 2019. How language-neutral is multilingual bert? tence understanding through inference. In Proceed- arXiv preprint arXiv:1911.03310. ings of the 2018 Conference of the North American Ilya Loshchilov and Frank Hutter. 2019. Decoupled Chapter of the Association for Computational Lin- weight decay regularization. guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mor- Computational Linguistics. datch. 2021. Pretrained transformers as universal computation engines. Linting Xue, Noah Constant, Adam Roberts, Mi- hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. Barua, and Colin Raffel. 2021. mt5: A massively PhoBERT: Pre-trained language models for Viet- multilingual pre-trained text-to-text transformer. namese. In Findings of the Association for Computa- tional Linguistics: EMNLP 2020, pages 1037–1042. Wei Zhao, Steffen Eger, Johannes Bjerva, and Is- abelle Augenstein. 2020. Inducing language- Kiet Nguyen, Vu Nguyen, Anh Nguyen, and Ngan agnostic multilingual representations. arXiv Nguyen. 2020. A Vietnamese dataset for evaluat- preprint arXiv:2008.09112. ing machine reading comprehension. In Proceed- ings of the 28th International Conference on Com- Daniel Zügner, Tobias Kirschstein, Michele Catasta, putational Linguistics, pages 2595–2605, Barcelona, Jure Leskovec, and Stephan Günnemann. 2021. Spain (Online). International Committee on Compu- Language-agnostic representation learning of source tational Linguistics. code from structure and context. In International Conference on Learning Representations. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is Multilingual BERT? ACL 2019 - 57th Annual Meeting of the Association for Compu- tational Linguistics, Proceedings of the Conference, pages 4996–5001.
You can also read