GaBERT - an Irish Language Model
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
gaBERT — an Irish Language Model James Barry1 and Joachim Wagner2 and Lauren Cassidy1 and Alan Cowap3 and Teresa Lynn1 and Abigail Walsh1 and Mícheál J. Ó Meachair4 and Jennifer Foster2 1,2,3 School of Computing, Dublin City University 1 ADAPT Centre 3 SFI Centre for Research Training in Machine Learning at Dublin City University 4 Fiontar & Scoil na Gaeilge, Dublin City University 1 {firstname.lastname}@adaptcentre.ie 2 {firstname.lastname}@dcu.ie 3 alan.cowap2@mail.dcu.ie 4 micheal.omeachair@dcu.ie arXiv:2107.12930v2 [cs.CL] 28 Jul 2021 Abstract Since then, a number of monolingual BERT mod- els have been released (Nozza et al., 2020). In this The BERT family of neural language mod- els have become highly popular due to their paper, we describe gaBERT, a monolingual BERT ability to provide sequences of text with rich model for the Irish language. context-sensitive token encodings which are Although Irish is the first official language of the able to generalise well to many Natural Lan- Republic of Ireland, a minority, 1.5% of the popu- guage Processing tasks. Over 120 monolin- lation or 73,803 people (CSO, 2016) use it in their gual BERT models covering over 50 languages have been released, as well as a multilingual everyday lives outside of the education system. As model trained on 104 languages. We introduce, the less dominant language in a bilingual commu- gaBERT, a monolingual BERT model for the nity, the availability of Irish language technology is Irish language. We compare our gaBERT important since it makes it easier for Irish speakers, model to multilingual BERT and show that and those who wish to be Irish speakers, to use the gaBERT provides better representations for a language in practice. Building on recent progress downstream parsing task. We also show how in Irish NLP (Uí Dhonnchadha and Van Genabith, different filtering criteria, vocabulary size and the choice of subword tokenisation model af- 2006; Lynn et al., 2012, 2015; Walsh et al., 2019) fect downstream performance. We release we release gaBERT, an Irish BERT model, trained gaBERT and related code to the community. on approximately 7.9 million sentences. There are many downstream applications where we envision 1 Introduction gaBERT will be useful, including grammar check- The technique of fine-tuning a self-supervised lan- ing, computer-aided language learning, predictive guage model has become ubiquitous in Natural text, search and translation. Language Processing, because models trained in While there is evidence to suggest that dedi- this way have advanced evaluation scores on many cated monolingual models can be superior to a sin- tasks (Radford et al., 2018; Peters et al., 2018; De- gle multilingual model for within-language down- vlin et al., 2019). Arguably the most successful stream tasks (de Vries et al., 2019; Virtanen et al., architecture has been BERT (Devlin et al., 2019) 2019; Farahani et al., 2020), other studies (Wu which uses stacks of transformers to predict the and Dredze, 2020; Rust et al., 2020; Chau et al., identity of a masked token and to predict whether 2020) suggest that mBERT is a good choice for two sequences are contiguous. It has spawned low-resourced languages. We compare gaBERT to many variants (Liu et al., 2019; Lan et al., 2019) mBERT and to the monolingual Irish WikiBERT, and the nascent field of BERTology (Jawahar et al., trained on Irish language Wikipedia, on the task of 2019; Chi et al., 2020; Rogers et al., 2020). Ini- universal dependency parsing. We find that parsing tially, BERT models were released for English and accuracy improves when using gaBERT – by over Chinese as well as a multilingual model (mBERT), 4 LAS points over mBERT and WikiBERT. We trained on Wikipedia text from 104 languages. then use the gaBERT training data for continued
pretraining of mBERT, resulting in a recovery of Corpus Num. Sents Size (MB) 2.6 LAS points over the off-the-shelf version. We also compile a small test set which compares the CoNLL17 1,686,526 138 models on their ability to predict a masked token. IMT 1,352,989 124 We abstract away from vocabulary differences be- NCI 1,621,031 174 tween the models by masking pronouns, which are OSCAR 793,573 89 in all the models’ vocabularies. This evaluation ParaCrawl 3,056,198 380 clearly reveals the benefit of using our Irish data. Wikipedia 685,827 38 When training gaBERT, we examine the effect Overall 9,196,144 941 of vocabulary size, corpus filtering type and tokeni- sation model, finding that document-level filtering, Table 1: Sentence counts and plain text file size in with a SentencePiece tokeniser (Kudo and Richard- megabyyes for each corpus after tokenisation and seg- son, 2018) and a vocabulary size of 30k is the best mentation but before applying sentence filtering. configuration. We release the code used to run our experiments through GitHub1 and our models through the Hugging Face (Wolf et al., 2020) model The sentence counts in each corpus are listed repository.2 in Table 1 after tokenisation and segmentation but before filtering described below. Some of these 2 Data corpora overlap with others, e. g. the CoNLL17 corpus contains CommonCrawl data that is also We use the following to train gaBERT: used in the OSCAR corpus, and Irish Wikipedia CoNLL17 The Irish data from the CoNLL17 raw data that is also used in the Wikipedia corpus. See text collection (Ginter et al., 2017) released as part Appendix A for more information on the content of the 2017 CoNLL Shared Task on UD Parsing of these corpora, including license information. (Zeman et al., 2017). 3 Corpus Pre-processing IMT A wide collection of Irish texts used in Irish machine translation experiments (Dowling et al., Described in this section are the relevant pre- 2018, 2020); including legal text, general adminis- processing steps required by each of the various tration and data crawled from public body websites. corpora included in the training data. NCI The New Corpus for Ireland (Kilgarriff CoNLL17 The CoNLL17 corpus is already to- et al., 2006), which contains a wide range of texts kenised, as the files are converted from CoNLL-U in Irish, including fiction, news reports, informative format to plain text where each row in the text col- texts, and official documents. umn of a CoNLL-U sentence is treated as a token. OSCAR The unshuffled Irish portion of the OS- IMT, OSCAR and ParaCrawl The text files CAR Corpus (Ortiz Suárez et al., 2019) which is a from the IMT, OSCAR and ParaCrawl contain raw collection of documents from CommonCrawl and sentences requiring tokenisation. We describe the sorted by language ID. tokenisation process for these corpora in 3.1. ParaCrawl The Irish portion of ParaCrawl v7 Wikipedia For the Wikipedia articles, the Irish (ga-en bitext pair) (Bañón et al., 2020), which is a Wikipedia dump is downloaded and the WikiEx- collection of parallel corpora crawled from multi- tractor tool3 is then used to extract the Wikipedia lingual websites, with the sentences aligned and articles to plain text. Article headers are included in then filtered. the extracted text files. Once the articles have been converted to plain text, they are tokenised using the Wikipedia Text from Irish Wikipedia, an on- tokeniser described in 3.1. line encyclopedia with articles covering a wide variety of topics. We use the slightly older NCI As many of the NCI segments marked up articles from https://dumps.wikimedia. with xsy tags contain multiple sentences, we treat org/gawiki/20210520/ (27.7 MB). each segment boundary as a sentence boundary and 1 3 https://github.com/jbrry/Irish-BERT https://github.com/attardi/ 2 https://huggingface.co/DCU-NLP wikiextractor
further split segments into sentences recursively, • No-filter: all collected texts are included in finding the best split point according to a set of the pre-training data. heuristics described in Appendix B.2, splitting the • Document-filter: the default document-level segment in half and applying the same procedure filtering used in the WikiBERT pipeline. to each half until no suitable split point is found. • OpusFilter-basic: we use OpusFilter with ba- sic filtering described in Appendix B.3. 3.1 Tokenisation and Segmentation • OpusFilter-basic-char-lang we use OpusFil- Raw texts from the IMT, OSCAR, ParaCrawl and ter with basic filtering as well as character- Wikipedia corpora are tokenised and segmented script and language ID filters described in Ap- with UDPipe (Straka and Straková, 2017) trained pendix B.3. on a combination of the Irish-IDT and English- EWT corpora from version 2.7 of the Universal For each of these filtering configurations, we train Dependencies (UD) treebanks (Zeman et al., 2020). a BERT model for 500,000 pre-training steps using We include the English-EWT treebank in the train- all of the sentences which remain after filtering. ing data to expose the tokeniser to more incidences of punctuation symbols which are prevalent in our 4.2 Vocabulary Creation pre-training data. This also comes with the benefit To create the vocabulary for our models, we ex- of supporting the tokenisation of code-mixed data. periment with using the SentencePiece tokeniser We upsample the Irish-IDT treebank by ten times to and the WordPiece tokeniser. We take the highest- offset the larger English-EWT treebank size. This performing model from the filtering experiments tokeniser is applied to all corpora apart from the in Section 4.1 and use that filtering setting for our NCI, which is already tokenised by Kilgarriff et al. vocabulary experiments. For the SentencePiece to- (2006), and the CoNLL17 corpus as this corpus is keniser, we use vocabulary sizes of 15K, 20K, 30K, already tokenised in CoNLL-U format. 40K and 50K and train a BERT model on the data produced by those vocabularies. 4 Experimental Setup We then train a WordPiece tokeniser, keeping After initial corpus pre-processing, all corpora the vocabulary size that works best for the Sen- are merged and we use the WikiBERT pipeline tencePiece tokeniser, and train a respective BERT (Pyysalo et al., 2020) to create pretraining data for model. Furthermore, we train a BERT model us- BERT. We experiment with four corpus filter set- ing the union of the two vocabularies, i. e. our tings and seven subword unit vocabularies. We best-performing SentencePiece vocabulary and the perform a greedy search for the optimal parame- WordPiece vocabulary. ters, first choosing the filter settings, then the size 4.3 BERT Pretraining Parameters of the BERT vocabulary and finally considering a WordPiece vocabulary instead of a SentencePiece We use the original BERT implementation of De- vocabulary and the union of the two vocabularies. vlin et al. (2019). For the development experiments, we train our BERT model for 500,000 steps with a 4.1 Corpus Filtering sequence length of 128. We use whole word mask- The WikiBERT pipeline contains a number of fil- ing and the default hyperparameters and model ters for document-level filtering. As we are work- architecture of BERTBASE (Devlin et al., 2019) ing with data sources where there may not be clear except a lower batch size of 32 in order to train document boundaries, or where there are no line on NVIDIA RTX 6000 GPUs with 24 GB RAM. breaks over a large number of sentences, document- This corresponds to a bidirectional transformer level filtering may be inadequate for such texts. (Vaswani et al., 2017) with 12 layers, 12 atten- Consequently, we also experiment with using the tion heads, a hidden layer size of 768, and GELU OpusFilter library (Aulamo et al., 2020), which activations (Hendrycks and Gimpel, 2016). filters out individual sentences, thereby giving us the flexibility of filtering noisy sentences while not 4.4 Final Models discarding full documents. For the final gaBERT model, we train for 900k We train different BERT models with varying steps with sequence length 128 and a further 100k levels of filtering: steps with sequence length 512. We train on a TPU-
v2-8 with 128GB of memory on Google Compute 5.2 Cloze Test Engine4 and increase the batch size to 128. To compile a test set for a cloze task, 100 strings Furthermore, we train ELECTRA (Clark et al., of Irish text (4–77 words in length) containing the 2020) on the same data on a TPU-v3-8. ELEC- pronouns ‘é’, ‘í’ or ‘iad’ are selected from Irish TRA replaces the MLM pre-training objective of corpora and online publications. One of these pro- BERT with a binary classification task discriminat- nouns is masked in each string, providing a context ing between authentic tokens and alternative tokens cue for the cloze test. The pronouns ‘é’ (‘him/it’), generated by a smaller model for higher training ‘í’ (‘her/it’) and ‘iad’(‘them’) usually appear as sin- efficiency. We use the default settings of the “Base” gle words, bound by whitespace or punctuation. configuration of their implementation.5 As with Less commonly, they appear with an initial muta- BERT, we train for 1M steps and evaluate every tion e.g. ‘hí’ or suffix e.g. ‘iadsan’. Where such 100k steps. However, we train on more data per constructions occur in the cloze test, the language step as the batch size is increased to 256 and a models must predict a subword unit in order to sequence length of 512 is used throughout. form a plausible string. Each language model is evaluated on the token it predicts for each masked 5 Evaluation Measures token.7 Following Rönnqvist et al. (2019), the models are This section describes the evaluation measures we evaluated on their ability to generate the original use to assess the BERT models in our experiments. masked token, and a manual evaluation of the mod- els is performed wherein predictions are classified 5.1 Dependency Parsing in the following exclusive categories: The evaluation measure we use to make develop- • Match: The predicted token fits the context ment decisions is dependency parsing LAS. To grammatically and semantically. This may oc- obtain this measure, we fine-tune a given BERT cur when the model predicts the original token model in the task of dependency parsing and mea- or another token which also fits the context sure LAS on the development set of the Irish-IDT cue. In order for a predicted pronoun to render treebank (Lynn and Foster, 2016; McGuinness the given cue grammatically correct, it must et al., 2020) in version 2.8 of UD. As parser per- agree with regard to the number and gender formance also depends on the initialisation of the of the noun phrase it refers to. additional layers added by the parser, we report the • Mismatch The predicted token is a valid Irish median of five runs with different random initiali- word but is unsuitable given the context. sation, and in figures, we show a box plot.6 • Copy The predicted token is an implausible The Irish-IDT consists of 4,005 training sen- repetition of another token in the context. tences, 451 sentences in the development set and • Gibberish The predicted token is not a valid 454 test sentences. For the dependency parser, we Irish word. use a multitask model which uses a graph-based parser with biaffine attention (Dozat and Manning, 6 Results 2016) as well as additional classifiers for predicting POS tags and morphological features. We use the 6.1 Development Results AllenNLP (Gardner et al., 2018) library to develop This section reports the results of our parameter our multitask model. The trained BERT model is search for corpus filtering and BERT vocabulary, used as an encoder where the BERT representations as well as the effect of restricting the training data are passed to the appropriate multitask components. to publicly available data. 4 TPU access was kindly provided to us through the Google Filter Settings The overall number of sen- Research TPU Research Cloud. tences which remain after applying each filter 5 https://github.com/google-research/ electra 7 Due to limitations of Hugging Face’s “fill-mask” pipeline 6 The box plots do not show the variation to be expected module, we can only have a single mask per input sequence when pre-training BERT with different random initialisation and only a single subword unit can be predicted for each mask. multiple times as obtaining multiple BERT models for each All the masked tokens exist in the vocabularies of the candidate setting would be too computationally expensive. BERT models and are therefore possible predictions.
Filter Num. Sents configurations but has the lowest median score, suggesting that some level of filtering is beneficial. No-filter 9,192,060 Document-filter 7,947,664 OpusFilter-basic 8,974,681 OpusFilter-basic-char-lang 7,696,428 Table 2: The number of sentences which remain after applying the specific filter. are shown in Table 2. If no filtering is ap- plied, we have 9,192,060 sentences.8 The Document-filter removes about 1.2 mil- Figure 2: Dependency parsing LAS for each vocabu- lion sentences, while OpusFilter-basic lary type. Each box-plot shows five LAS scores ob- only removes about 200,000 sentences. tained by fine-tuning the respective BERT model five OpusFilter-basic-char-lang removes times with different initialisation to obtain five differ- the most number of sentences. ent parsers. Vocabulary Settings The results of the five runs testing different vocabulary sizes are shown in Fig- ure 2. A vocabulary size of 30K performs best for the SentencePiece tokeniser, which outperforms the WordPiece tokeniser with the same vocabulary size. The union of the two vocabularies results in 32,314 entries, and does not perform as well as the two vocabularies on their own. Figure 1: Dependency parsing LAS for each filter type. Each box-plot shows five LAS scores obtained by fine- tuning the respective BERT model five times with dif- ferent initialisation to obtain five different parsers. The results of training a dependency parser with the gaBERT model produced by each set- ting are shown in Figure 1. Document-Filter, which keeps all Paracrawl articles and discards more Wikipedia articles than the other filters, Figure 3: Dependency parsing LAS for each model has the highest LAS score. As the BERT type. Every 100k steps, we show the median of five model requires contiguous text for its next- LAS scores obtained from fine-tuning the respective sentence-prediction task, filtering out full doc- model five times with different initialisation. uments may be more appropriate than filtering individual sentences. The two OpusFilter Final Training Runs Figure 3 shows the devel- configurations perform marginally worse than opment LAS of training gaBERT and gaELEC- the Document-Filter. In the case of TRA in their final configuration (§4.4). The best OpusFilter-basic-char-lang, perhaps gaBERT checkpoint is reached at step 1 million, the lower number of sentences overall translates which may indicate that there are still gains to be to lower LAS scores. Finally, No-Filter per- made from training for more steps. The highest me- forms in the same range as the two OpusFilter dian LAS for gaELECTRA is reached at step 400k. 8 It is worth noting that although the two models are This number differs slightly to the number in Table 1 due to using an earlier Wikipedia dump (20/04/2021) for the compared at the same number of steps, gaBERT filtering experiment. is trained with a sequence length of 128 for the
LAS Model Original Token Prediction Model UD Dev Test mBERT 16 mBERT 2.8 81.5 79.8 WikiBERT 53 WikiBERT 2.8 81.5 79.7 mBERT-cp 78 mBERT-cp 2.8 84.1 82.4 gaBERT 75 gaBERT 2.8 85.5 83.9 gaELECTRA 75 gaELECTRA 2.8 85.0 83.3 Table 4: The number of times the original masked to- Chau et al. (2020) 2.5 - 76.2 ken was predicted. gaBERT 2.5 - 77.5 Model Match Mism. Copy Gib Table 3: LAS in dependency parsing (UD v2.8) for se- lected models. Median of five fine-tuning runs. mBERT 41 42 4 13 WikiBERT 62 31 1 6 first 900k steps and 512 for the remaining 100k, mBERT-cp 85 12 1 2 whereas gaELECTRA uses a sequence length of gaBERT 83 14 2 1 512 throughout. We also use a larger batch size gaELECTRA 82 8 1 9 of 256 for gaELECTRA than the 128 used for gaBERT, thus the two models have been exposed to Table 5: The number of matches, mismatches, copies differing numbers of training tokens at each step. and gibberish predicted by each model. 6.2 Model Comparison the most likely to predict subword units, while We compare our two final models (gaBERT and gaBERT was the least likely. With regard to the gaELECTRA) with off-the-shelf mBERT and pronouns selected for this task, ‘é’ tends to occur WikiBERT-ga, as well as an mBERT model ob- approximately 3 times more frequently in Irish text tained with continued pre-training on our corpora. than both ‘í’ and ‘iad’, so language models can be Dependency Parsing Table 3 shows the results expected to predict ‘é’ more often. mBERT was for dependency parsing.9 Using mBERT off-the- most heavily biased towards the token ‘é’, rarely shelf results in a test set LAS of 79.8. The predicting ‘í’ or ‘iad’, while gaBERT was the most WikiBERT-ga model performs at a similar level. likely of the models to predict ‘í’ and ‘iad’. By training mBERT for more steps on our cor- Table 5 shows the manual evaluation of the to- pora, LAS can be improved by 2.6 points. Our kens generated by each model. Given that many of gaBERT model has the highest LAS of 83.9 and the context cues in the test could have several plau- our gaELECTRA model follows closely behind sible answers and that there is variation with regard with a score of 83.3. It is interesting to note that the to the quality and quantity of context clues pro- gaBERT model performs better than gaELECTRA, vided, the manual classification of predictions as despite seeing less training data per step. The last either match, mismatch, copy or gibberish provides two rows compare gaBERT, on v2.5 of the treebank, further clarity on the language models’ capabilities. with the system of Chau et al. (2020) who augment These results echo those of the original masked the mBERT vocabulary with the 99 most frequent token prediction evaluation in so far as they rank Irish tokens and fine-tune on Irish Wikipedia. the models in the same order. Further detail, examples and analysis of the cloze Cloze Test Table 4 shows the accuracy of each test can be found in Appendix C. model with regard to predicting the original masked token. mBERT-cp is the most accurate, with 7 Conclusions gaBERT and gaELECTRA close behind. The prediction of subword units proved to be We release gaBERT, a BERT model trained on over a challenge for most of the models. mBERT was 7.9 million Irish sentences. We plan to further 9 explore the space of possible gaBERT models by A competitive parser, UDPipe, trained on the Irish-IDT Treebank 2.8 without external embeddings achieves 72.59 experimenting with other fine-tuning tasks and ex- LAS on the test set. amining the contribution of each corpus.
8 Ethical Considerations the released model card on Hugging Face: We note that some data used to pretrain gaBERT No dataset is released with this paper, however and gaELECTRA was scraped from the web most of the corpora are publicly available as de- which potentially contains ethically problem- scribed in our Data Availability Statement (Ap- atic text (bias, hate, adult content, etc.). Con- pendix A). Furthermore, where an anonymised ver- sequently, downstream tasks/applications us- sion of a dataset was available it was used. We re- ing gaBERT or gaELECTRA should be thor- lease the gaBERT and gaELECTRA language mod- oughly tested with respect to ethical consider- els based on the BERTBASE (Devlin et al., 2019) ations. and ELECTRA-Base (Clark et al., 2020) autoen- coder architectures respectively. We note that an 4. Does the paper describe how the technol- autoregressive architecture may be susceptible to ogy would be deployed in actual use cases? training data extraction, and that larger language There are many downstream tasks which models may be more susceptible (Carlini et al., can use gaBERT and gaELECTRA; includ- 2021). However, gaBERT and gaELECTRA are au- ing Machine Translation, educational appli- toencoder architectures and smaller language mod- cations, predictive text, search (“information els which may help mitigate this potential vulnera- retrieval”), and games. The authors hope bility. gaBERT and gaELECTRA will contribute to We now address specific Ethics Questions10 . the ongoing effort to preserve the Irish lan- guage as a living language in the technological 1. Does the paper describe the characteristics of age. Supporting a low resource language like the dataset in enough detail for a reader to un- Irish in a bilingual community will make it derstand which speaker populations the tech- easier for Irish speakers, and those who wish nology could be expected to work for? Yes, §2 to be Irish speakers, to use the language in "Data" describes each of the six datasets used practice. to train gaBERT and gaELECTRA. Further- more, the use of ‘ga’, the ISO 639-1:200211 5. Does the task carried out by the computer Language Code for Gaeilge (Irish), in the lan- match how it would be deployed? The fine- guage model names gaBERT and gaELEC- tuning tasks used to evaluate gaBERT and ga- TRA emphasises that Irish speakers, those who ELECTRA are used as a proxy for other NLP wish to be Irish speakers, and those with an in- downstream tasks. We release the gaBERT terest in languages (particularly low resource and gaELECTRA models and code publicly, languages) are populations that we hope these and look forward to seeing what the develop- language models will benefit. ment community creates. 2. Do the claims in the paper match the experi- 6. Does the paper address possible harms when mental results, in terms of how far the results the technology is being used as intended and can be expected to generalize? Yes, we spec- functioning correctly? The model could pro- ify the downstream tasks on which we have duce unsuitable text outputs when used to gen- trained gaBERT and gaELECTRA and pre- erate text and this could be particularly prob- sented the results. We note that many other lematic if deployed in particular settings such downstream tasks can be performed which we as a school. To combat this possibility, we hope will benefit from these models. will include a caveat as stated in Ethical Con- 3. Does the paper describe the steps taken to sideration 3 above. evaluate the quality of the dataset? Yes, §2 7. Does the paper address possible harms when "Data", §3 "Corpus Pre-processing", the filter- the technology is being used as intended but ing experiments, and Appendix B all speak to giving incorrect results? As stated in the pre- the quality of the datasets used. In addition, vious answer, a caveat will be included. the following caveat (or similar) will be spec- ified with the released code on GitHub, and 8. Does the paper address possible harms follow- 10 https://2021.emnlp.org/call-for-papers/ethics-faq ing from potential misuse of the technology? 11 https://www.iso.org/iso-639-language-codes.html The possible harms due to potential misuse are
no different to those from any small language guistics: System Demonstrations, pages 150–156, model released publicly; we anticipate no new Online. Association for Computational Linguistics. expected misuse potential with the release of Marta Bañón, Pinzhen Chen, Barry Haddow, Ken- gaBERT and gaELECTRA. neth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, 9. If the system learns from user input once de- Philipp Koehn, Sergio Ortiz Rojas, Leopoldo ployed, does the paper describe checks and Pla Sempere, Gema Ramírez-Sánchez, Elsa Sar- rías, Marek Strelec, Brian Thompson, William limitations to the learning? gaBERT and ga- Waites, Dion Wiggins, and Jaume Zaragoza. 2020. ELECTRA are static models and do not learn ParaCrawl: Web-scale acquisition of parallel cor- from user input when deployed. pora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 10. Are any of the possible harms you’ve identi- pages 4555–4567, Online. Association for Compu- tational Linguistics. fied likely to fall disproportionately on popula- tions that already experience marginalization Nicholas Carlini, Florian Tramer, Eric Wallace, or are otherwise vulnerable? Nothing above Matthew Jagielski, Ariel Herbert-Voss, Katherine and beyond those already there. It could be Lee, Adam Roberts, Tom Brown, Dawn Song, Ul- far Erlingsson, Alina Oprea, and Colin Raffel. 2021. argued the Irish language is marginalized and Extracting training data from large language models. vulnerable so we are opening it up to both ben- efits and potential harms. But, as with every Ethan C. Chau, Lucy H. Lin, and Noah A. Smith. 2020. language model released, we release it in the Parsing with multilingual BERT, a small corpus, and a small treebank. In Findings of the Association hope it will be used for good; while acknowl- for Computational Linguistics: EMNLP 2020, pages edging the potential for misuse. It might be 1324–1334, Online. Association for Computational reasonable to characterize Irish speakers, and Linguistics. those who wish to be, as one example of “pop- Ethan A. Chi, John Hewitt, and Christopher D. Man- ulations that already experience marginaliza- ning. 2020. Finding universal grammatical rela- tion or are otherwise vulnerable” and that tions in multilingual BERT. In Proceedings of the gaBERT and gaELECTRA are intended to 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 5564–5577, Online. As- support them and the Irish language. Go n- sociation for Computational Linguistics. éirí an bóthar libh. Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Acknowledgements Christopher D. Manning. 2020. ELECTRA: Pre- training text encoders as discriminators rather than This research is supported by Science Founda- generators. In Proceedings of The Eighth Inter- tion Ireland (SFI) through the ADAPT Centre for national Conference on Learning Representations (ICLR). Digital Content Technology, which is funded un- der the SFI Research Centres Programme (Grant CSO. 2016. Census of Population 2016 – Profile 10 13/RC/2106) and is co-funded under the European Education, Skills and the Irish Language. Publisher: Central Statistics Office. Regional Development Fund. This research is also supported through the SFI Frontiers for the Wietse de Vries, Andreas van Cranenburgh, Arianna Future programme (19/FFP/6942) and SFI Grant Bisazza, Tommaso Caselli, Gertjan van Noord, and 18/CRT/6183 as well as by the Irish Government Malvina Nissim. 2019. BERTje: A Dutch BERT model. ArXiv 1912.09582v1. Department of Culture, Heritage and the Gaeltacht under the GaelTech Project. For the purpose of Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Open Access, the authors have applied a CC BY Kristina Toutanova. 2019. BERT: Pre-training of public copyright licence to any Author Accepted deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference Manuscript version arising from this submission. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), References pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. 2020. OpusFilter: A configurable parallel corpus Meghan Dowling, Sheila Castilho, Joss Moorkens, filtering toolbox. In Proceedings of the 58th Annual Teresa Lynn, and Andy Way. 2020. A human evalu- Meeting of the Association for Computational Lin- ation of English-Irish statistical and neural machine
translation. In Proceedings of the 22nd Annual Con- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, ference of the European Association for Machine Kevin Gimpel, Piyush Sharma, and Radu Sori- Translation, pages 431–440, Lisboa, Portugal. Euro- cut. 2019. ALBERT: A lite BERT for self- pean Association for Machine Translation. supervised learning of language representations. ArXiv 1909.11942v6. Meghan Dowling, Teresa Lynn, Alberto Poncelas, and Andy Way. 2018. SMT versus NMT: Preliminary Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- comparisons for Irish. In Proceedings of the AMTA dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 2018 Workshop on Technologies for MT of Low Re- Luke Zettlemoyer, and Veselin Stoyanov. 2019. source Languages (LoResMT 2018), pages 12–20, Roberta: A robustly optimized BERT pretraining ap- Boston, MA. Association for Machine Translation proach. ArXiv 1907.11692v1. in the Americas. Teresa Lynn, Ozlem Cetinoglu, Jennifer Foster, Elaine Uí Dhonnchadha, Mark Dras, and Josef van Timothy Dozat and Christopher D. Manning. 2016. Genabith. 2012. Irish treebanking and parsing: A Deep biaffine attention for neural dependency pars- preliminary evaluation. In Proceedings of the Eight ing. CoRR, abs/1611.01734. International Conference on Language Resources and Evaluation (LREC-2012), pages 1939–1946, Is- Mehrdad Farahani, Mohammad Gharachorloo, tanbul, Turkey. European Language Resources Asso- Marzieh Farahani, and Mohammad Manthouri. ciation (ELRA). 2020. ParsBERT: Transformer-based model for Per- sian language understanding. ArXiv 2005.12515v1. Teresa Lynn and Jennifer Foster. 2016. Universal De- pendencies for Irish. In Proceedings of the Sec- ond Celtic Language Technology Workshop (CLTW Matt Gardner, Joel Grus, Mark Neumann, Oyvind 2016), pages 79–92, Paris, France. Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- ters, Michael Schmitz, and Luke Zettlemoyer. 2018. Teresa Lynn, Kevin Scannell, and Eimear Maguire. AllenNLP: A deep semantic natural language pro- 2015. Minority Language Twitter: Part-of-Speech cessing platform. In Proceedings of Workshop for Tagging and Analysis of Irish Tweets. In Proceed- NLP Open Source Software (NLP-OSS), pages 1– ings of the Workshop on Noisy User-generated Text, 6, Melbourne, Australia. Association for Computa- pages 1–8, Beijing, China. Association for Compu- tional Linguistics. tational Linguistics. Filip Ginter, Jan Hajič, Juhani Luotolahti, Milan Straka, Sarah McGuinness, Jason Phelan, Abigail Walsh, and and Daniel Zeman. 2017. CoNLL 2017 shared task Teresa Lynn. 2020. Annotating MWEs in the - automatically annotated raw texts and word embed- Irish UD treebank. In Proceedings of the Fourth dings. LINDAT/CLARIAH-CZ digital library at the Workshop on Universal Dependencies (UDW 2020), Institute of Formal and Applied Linguistics (ÚFAL), pages 126–139, Barcelona, Spain (Online). Associa- Faculty of Mathematics and Physics, Charles Uni- tion for Computational Linguistics. versity. Debora Nozza, Federico Bianchi, and Dirk Hovy. 2020. What the [MASK]? making sense of language- Dan Hendrycks and Kevin Gimpel. 2016. Gaussian er- specific BERT models. ArXiv 2003.02912v1. ror linear units (GELUs). ArXiv 1606.08415v1. Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. Romary. 2019. Asynchronous pipeline for process- 2019. What does BERT learn about the structure ing huge corpora on medium to low resource infras- of language? In Proceedings of the 57th Annual tructures. In 7th Workshop on the Challenges in the Meeting of the Association for Computational Lin- Management of Large Corpora (CMLC-7), pages 9 guistics, pages 3651–3657, Florence, Italy. Associa- – 16, Cardiff, United Kingdom. Leibniz-Institut für tion for Computational Linguistics. Deutsche Sprache. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Adam Kilgarriff, Michael Rundell, and Elaine Gardner, Christopher Clark, Kenton Lee, and Luke Uí Dhonnchadha. 2006. Efficient corpus develop- Zettlemoyer. 2018. Deep contextualized word rep- ment for lexicography: building the New Corpus resentations. In Proceedings of the 2018 Confer- for Ireland. Language Resources and Evaluation, ence of the North American Chapter of the Associ- 40:127–152. ation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages Taku Kudo and John Richardson. 2018. SentencePiece: 2227–2237, New Orleans, Louisiana, USA. Associ- A simple and language independent subword tok- ation for Computational Linguistics. enizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Sampo Pyysalo, Jenna Kanerva, Antti Virtanen, and Methods in Natural Language Processing: System Filip Ginter. 2020. WikiBERT models: deep Demonstrations, pages 66–71, Brussels, Belgium. transfer learning for many languages. ArXiv Association for Computational Linguistics. 2006.01538v1.
Alec Radford, Karthik Narasimhan, Tim Salimans, and System Demonstrations, pages 38–45, Online. Asso- Ilya Sutskever. 2018. Improving language under- ciation for Computational Linguistics. standing by generative pre-training. OpenAI. Shijie Wu and Mark Dredze. 2020. Are all languages Anna Rogers, Olga Kovaleva, and Anna Rumshisky. created equal in multilingual BERT? In Proceedings 2020. A primer in BERTology: What we know of the 5th Workshop on Representation Learning for about how BERT works. Transactions of the Associ- NLP, pages 120–130, Online. Association for Com- ation for Computational Linguistics, 8:842–866. putational Linguistics. Samuel Rönnqvist, Jenna Kanerva, Tapio Salakoski, Daniel Zeman, Joakim Nivre, Mitchell Abrams, and Filip Ginter. 2019. Is multilingual BERT flu- Elia Ackermann, Noëmi Aepli, Hamid Aghaei, ent in language generation? In Proceedings of the Željko Agić, Amir Ahmadi, Lars Ahrenberg, First NLPL Workshop on Deep Learning for Natural Chika Kennedy Ajede, Gabrielė Aleksandravičiūtė, Language Processing, pages 29–36, Turku, Finland. Ika Alfina, Lene Antonsen, Katya Aplonova, An- Linköping University Electronic Press. gelina Aquino, Carolina Aragon, Maria Jesus Aran- zabe, Hórunn Arnardóttir, Gashaw Arutie, Jes- Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian sica Naraiswari Arwidarasti, Masayuki Asahara, Ruder, and Iryna Gurevych. 2020. How good is your Luma Ateyah, Furkan Atmaca, Mohammed Attia, tokenizer? on the monolingual performance of mul- Aitziber Atutxa, Liesbeth Augustinus, Elena Bad- tilingual language models. maeva, Keerthana Balasubramani, Miguel Balles- Milan Straka and Jana Straková. 2017. Tokenizing, teros, Esha Banerjee, Sebastian Bank, Verginica POS tagging, lemmatizing and parsing UD 2.0 with Barbu Mititelu, Victoria Basmov, Colin Batche- UDPipe. In Proceedings of the CoNLL 2017 Shared lor, John Bauer, Seyyit Talha Bedir, Kepa Ben- Task: Multilingual Parsing from Raw Text to Univer- goetxea, Gözde Berk, Yevgeni Berzak, Irshad Ah- sal Dependencies, pages 88–99, Vancouver, Canada. mad Bhat, Riyaz Ahmad Bhat, Erica Biagetti, Eck- Association for Computational Linguistics. hard Bick, Agnė Bielinskienė, Kristín Bjarnadóttir, Rogier Blokland, Victoria Bobicev, Loïc Boizou, E. Uí Dhonnchadha and J. Van Genabith. 2006. A Emanuel Borges Völker, Carl Börstell, Cristina part-of-speech tagger for Irish using finite-state mor- Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, phology and constraint grammar disambiguation. In Kristina Brokaitė, Aljoscha Burchardt, Marie Can- Proceedings of the Fifth International Conference dito, Bernard Caron, Gauthier Caron, Tatiana Cav- on Language Resources and Evaluation (LREC’06), alcanti, Gülşen Cebiroğlu Eryiğit, Flavio Massimil- Genoa, Italy. European Language Resources Associ- iano Cecchini, Giuseppe G. A. Celano, Slavomír Čé- ation (ELRA). plö, Savas Cetin, Özlem Çetinoğlu, Fabricio Chalub, Ethan Chi, Yongseok Cho, Jinho Choi, Jayeol Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Chun, Alessandra T. Cignarella, Silvie Cinková, Au- Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz rélie Collomb, Çağrı Çöltekin, Miriam Connor, Ma- Kaiser, and Illia Polosukhin. 2017. Attention is all rine Courtin, Elizabeth Davidson, Marie-Catherine you need. In Proceedings of the 31st International de Marneffe, Valeria de Paiva, Mehmet Oguz Conference on Neural Information Processing Sys- Derin, Elvis de Souza, Arantza Diaz de Ilar- tems (NIPS’17), NIPS’17, pages 6000–6010, Red raza, Carly Dickerson, Arawinda Dinakaramani, Hook, NY, USA. Curran Associates Inc. Bamba Dione, Peter Dirix, Kaja Dobrovoljc, Tim- othy Dozat, Kira Droganova, Puneet Dwivedi, Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Hanne Eckhoff, Marhaba Eli, Ali Elkahky, Binyam Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Ephrem, Olga Erina, Tomaž Erjavec, Aline Eti- Sampo Pyysalo. 2019. Multilingual is not enough: enne, Wograine Evelyn, Sidney Facundes, Richárd BERT for Finnish. ArXiv 1912.07076v1. Farkas, Marília Fernanda, Hector Fernandez Al- Abigail Walsh, Teresa Lynn, and Jennifer Foster. 2019. calde, Jennifer Foster, Cláudia Freitas, Kazunori Ilfhocail: A lexicon of Irish MWEs. In Proceed- Fujita, Katarína Gajdošová, Daniel Galbraith, Mar- ings of the Joint Workshop on Multiword Expres- cos Garcia, Moa Gärdenfors, Sebastian Garza, sions and WordNet (MWE-WN 2019), pages 162– Fabrício Ferraz Gerardi, Kim Gerdes, Filip Gin- 168, Florence, Italy. Association for Computational ter, Iakes Goenaga, Koldo Gojenola, Memduh Linguistics. Gökırmak, Yoav Goldberg, Xavier Gómez Guino- vart, Berta González Saavedra, Bernadeta Griciūtė, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Matias Grioni, Loïc Grobol, Normunds Grūzı̄tis, Chaumond, Clement Delangue, Anthony Moi, Pier- Bruno Guillaume, Céline Guillot-Barbance, Tunga ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- Güngör, Nizar Habash, Hinrik Hafsteinsson, Jan icz, Joe Davison, Sam Shleifer, Patrick von Platen, Hajič, Jan Hajič jr., Mika Hämäläinen, Linh Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Hà Mỹ, Na-Rae Han, Muhammad Yudistira Han- Teven Le Scao, Sylvain Gugger, Mariama Drame, ifmuti, Sam Hardwick, Kim Harris, Dag Haug, Quentin Lhoest, and Alexander Rush. 2020. Trans- Johannes Heinecke, Oliver Hellwig, Felix Hen- formers: State-of-the-art natural language process- nig, Barbora Hladká, Jaroslava Hlaváčová, Florinel ing. In Proceedings of the 2020 Conference on Em- Hociung, Petter Hohle, Eva Huber, Jena Hwang, pirical Methods in Natural Language Processing: Takumi Ikeda, Anton Karl Ingason, Radu Ion,
Elena Irimia, O.lájídé Ishola, Tomáš Jelínek, Anders Yanin Sawanakunanon, Kevin Scannell, Salvatore Johannsen, Hildur Jónsdóttir, Fredrik Jørgensen, Scarlata, Nathan Schneider, Sebastian Schuster, Markus Juutinen, Sarveswaran K, Hüner Kaşıkara, Djamé Seddah, Wolfgang Seeker, Mojgan Seraji, Andre Kaasen, Nadezhda Kabaeva, Sylvain Ka- Mo Shen, Atsuko Shimada, Hiroyuki Shirasu, Muh hane, Hiroshi Kanayama, Jenna Kanerva, Boris Shohibussirri, Dmitry Sichinava, Einar Freyr Sig- Katz, Tolga Kayadelen, Jessica Kenney, Václava urðsson, Aline Silveira, Natalia Silveira, Maria Simi, Kettnerová, Jesse Kirchner, Elena Klementieva, Radu Simionescu, Katalin Simkó, Mária Šimková, Arne Köhn, Abdullatif Köksal, Kamil Kopacewicz, Kiril Simov, Maria Skachedubova, Aaron Smith, Is- Timo Korkiakangas, Natalia Kotsyba, Jolanta Ko- abela Soares-Bastos, Carolyn Spadine, Steinhór Ste- valevskaitė, Simon Krek, Parameswari Krishna- ingrímsson, Antonio Stella, Milan Straka, Emmett murthy, Sookyoung Kwak, Veronika Laippala, Lu- Strickland, Jana Strnadová, Alane Suhr, Yogi Les- cia Lam, Lorenzo Lambertino, Tatiana Lando, mana Sulestio, Umut Sulubacak, Shingo Suzuki, Septina Dian Larasati, Alexei Lavrentiev, John Lee, Zsolt Szántó, Dima Taji, Yuta Takahashi, Fabio Tam- Phùòng Lê Hồng, Alessandro Lenci, Saran Lert- burini, Mary Ann C. Tan, Takaaki Tanaka, Sam- pradit, Herman Leung, Maria Levina, Cheuk Ying son Tella, Isabelle Tellier, Guillaume Thomas, Li- Li, Josie Li, Keying Li, Yuan Li, KyungTae Lim, isi Torga, Marsida Toska, Trond Trosterud, Anna Krister Lindén, Nikola Ljubešić, Olga Loginova, Trukhina, Reut Tsarfaty, Utku Türk, Francis Ty- Andry Luthfi, Mikko Luukko, Olga Lyashevskaya, ers, Sumire Uematsu, Roman Untilov, Zdeňka Ure- Teresa Lynn, Vivien Macketanz, Aibek Makazhanov, šová, Larraitz Uria, Hans Uszkoreit, Andrius Utka, Michael Mandl, Christopher Manning, Ruli Manu- Sowmya Vajjala, Daniel van Niekerk, Gertjan van rung, Cătălina Mărănduc, David Mareček, Katrin Noord, Viktor Varga, Eric Villemonte de la Clerg- Marheinecke, Héctor Martínez Alonso, André Mar- erie, Veronika Vincze, Aya Wakasa, Joel C. Wallen- tins, Jan Mašek, Hiroshi Matsuda, Yuji Matsumoto, berg, Lars Wallin, Abigail Walsh, Jing Xian Wang, Ryan McDonald, Sarah McGuinness, Gustavo Men- Jonathan North Washington, Maximilan Wendt, donça, Niko Miekka, Karina Mischenkova, Mar- Paul Widmer, Seyi Williams, Mats Wirén, Chris- garita Misirpashayeva, Anna Missilä, Cătălin Mi- tian Wittern, Tsegay Woldemariam, Tak-sum Wong, titelu, Maria Mitrofan, Yusuke Miyao, AmirHossein Alina Wróblewska, Mary Yako, Kayo Yamashita, Mojiri Foroushani, Amirsaeid Moloodi, Simonetta Naoki Yamazaki, Chunxiao Yan, Koichi Yasuoka, Montemagni, Amir More, Laura Moreno Romero, Marat M. Yavrumyan, Zhuoran Yu, Zdeněk Žabokrt- Keiko Sophie Mori, Shinsuke Mori, Tomohiko ský, Shorouq Zahra, Amir Zeldes, Hanzhi Zhu, and Morioka, Shigeki Moro, Bjartur Mortensen, Bohdan Anna Zhuravleva. 2020. Universal dependencies 2.7. Moskalevskyi, Kadri Muischnek, Robert Munro, LINDAT/CLARIAH-CZ digital library at the Insti- Yugo Murawaki, Kaili Müürisep, Pinkey Nain- tute of Formal and Applied Linguistics (ÚFAL), Fac- wani, Mariam Nakhlé, Juan Ignacio Navarro Horñi- ulty of Mathematics and Physics, Charles Univer- acek, Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, sity. Lùòng Nguyễn Thi., Huyền Nguyễn Thi. Minh, Yoshihiro Nikaido, Vitaly Nikolaev, Rattima Nitis- Daniel Zeman, Martin Popel, Milan Straka, Jan Ha- aroj, Alireza Nourian, Hanna Nurmi, Stina Ojala, jič, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Atul Kr. Ojha, Adédayò. Olúòkun, Mai Omura, Sampo Pyysalo, Slav Petrov, Martin Potthast, Fran- Emeka Onwuegbuzia, Petya Osenova, Robert cis Tyers, Elena Badmaeva, Memduh Gokirmak, Östling, Lilja Øvrelid, Şaziye Betül Özateş, Arzucan Anna Nedoluzhko, Silvie Cinková, Jan Hajič jr., Özgür, Balkız Öztürk Başaran, Niko Partanen, Elena Jaroslava Hlaváčová, Václava Kettnerová, Zdeňka Pascual, Marco Passarotti, Agnieszka Patejuk, Guil- Urešová, Jenna Kanerva, Stina Ojala, Anna Mis- herme Paulino-Passos, Angelika Peljak-Łapińska, silä, Christopher D. Manning, Sebastian Schuster, Siyao Peng, Cenel-Augusto Perez, Natalia Perkova, Siva Reddy, Dima Taji, Nizar Habash, Herman Le- Guy Perrier, Slav Petrov, Daria Petrova, Jason Phe- ung, Marie-Catherine de Marneffe, Manuela San- lan, Jussi Piitulainen, Tommi A Pirinen, Emily Pitler, guinetti, Maria Simi, Hiroshi Kanayama, Valeria Barbara Plank, Thierry Poibeau, Larisa Ponomareva, de Paiva, Kira Droganova, Héctor Martínez Alonso, Martin Popel, Lauma Pretkalnin, a, Sophie Prévost, Çağrı Çöltekin, Umut Sulubacak, Hans Uszkoreit, Prokopis Prokopidis, Adam Przepiórkowski, Tiina Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Puolakainen, Sampo Pyysalo, Peng Qi, Andriela Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Rääbis, Alexandre Rademaker, Taraka Rama, Lo- Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily ganathan Ramasamy, Carlos Ramisch, Fam Rashel, Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirch- Mohammad Sadegh Rasooli, Vinit Ravishankar, ner, Hector Fernandez Alcalde, Jana Strnadová, Livy Real, Petru Rebeja, Siva Reddy, Georg Rehm, Esha Banerjee, Ruli Manurung, Antonio Stella, At- Ivan Riabov, Michael Rießler, Erika Rimkutė, suko Shimada, Sookyoung Kwak, Gustavo Men- Larissa Rinaldi, Laura Rituma, Luisa Rocha, Eiríkur donça, Tatiana Lando, Rattima Nitisaroj, and Josie Rögnvaldsson, Mykhailo Romanenko, Rudolf Rosa, Li. 2017. CoNLL 2017 shared task: Multilingual Valentin Ros, ca, Davide Rovati, Olga Rudina, Jack parsing from raw text to Universal Dependencies. In Rueter, Kristján Rúnarsson, Shoval Sadde, Pegah Proceedings of the CoNLL 2017 Shared Task: Mul- Safari, Benoît Sagot, Aleksi Sahala, Shadi Saleh, tilingual Parsing from Raw Text to Universal Depen- Alessio Salomoni, Tanja Samardžić, Stephanie Sam- dencies, pages 1–19, Vancouver, Canada. Associa- son, Manuela Sanguinetti, Dage Särg, Baiba Saulı̄te, tion for Computational Linguistics.
A Data Licenses • Text crawled from articles generated by Tea- gasc, available under PSI licence. This Appendix provides specific details of the li- cence for each of the datasets used in the experi- • Text generated by Conradh na Gaeilge, shared ments. with us for research purposes. A.1 CoNLL17 • The Irish text from a parallel English–Irish The Irish annotated CoNLL17 corpus can be corpus of legal texts from the Department of found here: https://lindat.mff.cuni. Justice. This dataset is available for reuse cz/repository/xmlui/handle/11234/ on the ELRC-SHARE repository under a PSI 1-1989 (Ginter et al., 2017). license: https://elrc-share.eu The automatically generated annotations on the raw text data are available under the CC BY-SA- • Text from the Directorate-General for Transla- NC 4.0 licence. Wikipedia texts are available tion (DGT), available for download from the under the CC BY-SA 3.0 licence. Texts from European Commission website. Reuse of the Common Crawl are subject to Common Crawl texts are subject to Terms of Use, found on Terms of Use, the full details of which can be the website: https://ec.europa.eu/ found here: https://commoncrawl.org/ jrc/en/language-technologies/ terms-of-use/full/. dgt-translation-memory. A.2 IMT • Text reports and notices generated by Dublin The Irish Machine Translation datasets contains City Council, shared with us for research pur- text from the following sources: poses. • Text crawled from the Citizen’s Infor- mation website, contains Irish Public • Text uploaded to ELRC-share via the National Sector Data licensed under a Creative Relay Station, shared with us for research pur- Commons Attribution 4.0 International poses. (CC BY 4.0) licence: https://www. citizensinformation.ie/ga/. • Text reports and reference files generated by the Language Commissioner, available • Text crawled from Comhairle na on ELRC-share under PSI license: https: Gaelscolaíochta website: https: //elrc-share.eu/. //www.comhairle.org/gaeilge/. • Text generated by the magazine Nós, shared • Text crawled from the FÁS website (http: with us for research purposes. //www.fas.ie/), accessed in 2017. The website has since been dissolved. • Irish texts available for download on OPUS, • Text crawled from the Galway County Coun- under various licenses: https://opus. cil website: http://www.galway.ie/ nlpl.eu/ ga/. • Text generated from in-house translation pro- • Text crawled from https://www.gov. vided by the then titled Department of Culture, ie/ga/, the central portal for government Heritage and Gaeltacht (DCHG), provided for services and information. research purposes. The anonymised dataset is available on ELRC-share, under a CC-BY 4.0 • Text crawled from articles on the Irish Times license: https://elrc-share.eu/. website. • Text crawled from the Kerry County Council • Text reports created by Údarás na Gaeilge, website: https://ciarrai.ie/. uploaded to ELRC-share available under PSI license: https://elrc-share.eu/. • Text crawled from the Oideas Gael website: http://www.oideas-gael.com/ • Text generated by the University Times, ga/. shared with us for research purposes.
A.3 NCI We do not modify the seven occurrences of The corpus is compiled and owned by Foras na \x\x13 as it is not clear from their contexts how Gaeilge, and is provided to us for research pur- they should be replaced. After pre-processing and poses. treating all whitespace as token separators, e.g. in the NCI token “go leor”, we obtain 33,472,496 A.4 OSCAR tokens from the NCI. The unshuffled version of the Irish part of the OS- B.2 Sentence Boundary Detection CAR corpus was provided to us by the authors for Many of the NCI segments marked up with xsy tags research purposes. contain multiple sentences. We treat each segment boundary as a sentence boundary and further split A.5 ParaCrawl segments into sentences recursively, finding the Text from ParaCrawl v7, available here: https: best split point according to the following heuris- //www.paracrawl.eu/v7. The texts them- tics, splitting the segment into two halves and ap- selves are not owned by ParaCrawl, the actual pack- plying the same procedure to each half until no aging of these parallel data are under the Creative suitable split point is found. Commons CC0 licence ("no rights reserved"). • Reject if the left half contains no letters and is A.6 Wikipedia short. This covers cases where the left half is only a decimal number. The texts used are available under a CC BY-SA 3.0 licence and/or a GNU Free Documentation Li- • Reject if the right half has no letters and is cense. short or is an ellipsis. B Corpus Pre-processing • Reject if the right half’s first letter, skipping enumerations, is lowercase. This Appendix provides specific details on Corpus Pre-processing, and Opus Filter filters used. • Reject if the left half only contains a Roman number (in addition to the full-stop). B.1 NCI • Reject if inside round, square, curly or angle Foras na Gaeilge provided us with a .vert file12 brackets and the brackets not far away from containing 33,088,532 tokens in 3,485 documents. the candidate split point. We extract the raw text from the first tab-separated column and carry out the following conversions • If sentence-ending punctuation is followed by (number of events): two quote tokens we also consider splitting between the quotes and prefer this split point • Replace " with a neutral double quote if not rejected by above rules. (4408). • If sentence-ending punctuation is followed by • Replace the standard xml/html entities quot, a closing bracket we also consider splitting lt, gt and amp tokenised into three tokens, e.g. between the quotes and prefer this split point &êquotê;, with the appropriate characters if not rejected by above rules. (128). • If a question mark is followed by more ques- • Replace the numeric html entities 38, 60, tion marks we also consider splitting after the 147, 148, 205, 218, 225, 233, 237, 243 end of the sequence of question marks and and 250, again spanning three tokens, e. g. prefer this split point if not rejected by above &ê#38ê;, with the appropriate Unicode rules. characters (3679). • If a exclamation mark is followed by more • Repeat from the start until the text does not exclamation marks we also consider splitting change. after the end of the sequence of exclamations marks and prefer this split point if not rejected 12 MD5 7be5c0e9bc473fb83af13541b1cd8d20 by above rules.
• If a full-stop is the first full-stop in the overall C Cloze Test Examples segment, the preceding token is “1”, there C.1 Prediction Classification are more tokens before this “1” and the token directly before “1” is not a comma or semi- Table 6 provides one example per classification cat- colon we assume that this is an enumeration egory of masked token predictions generated by the following a heading and prefer splitting before language models during our cloze test evaluation. the “1”. In the match example in Table 6 , the original meaning (‘What are those radical roots?’) differs • We do not insert new sentence boundaries at to the meaning of the resulting string (‘What about a full-stop after “DR”, “Prof” and “nDr”, and, those radical roots?’) in which the masked token if followed by a decimal number, after “No”, is replaced by the predicted by mBERT-cp. How- “Vol” and “Iml”. ever, the latter construction is grammatically and • Splitting after a full-stop following decimal semantically acceptable. numbers in all other cases is dispreferred, giv- In the mismatch example in Table 6, the pre- ing the largest penalty to small numbers as dicted token is a valid Irish word, however the these are most likely to be part of enumera- resulting generated text is nonsensical. tions. An exception is “Airteagal” followed Though technically grammatical, the predicted by a token ending with a full-stop, a num- token in the copy example in Table 6 results in a ber, a full-stop, another number and another string with an unnatural repetition of a noun phrase full-stop. Here, we implemented a preference where a pronoun would be highly preferable (’Seán for splitting after the first separated full-stop, bought a book and he read a book.’). assuming the last number is part of an enumer- In the gibberish example in Table 6, the pre- ation. dicted token does not form a valid Irish word and the resulting from the token prediciton is ungram- • Prefer a split point balancing the lengths of matical and meaningless. the halves in characters. B.3 OpusFilter Filters C.2 Effect of Length of Context on Accuracy of Prediction For OpusFilter-basic, we include the following filters: In order to observe the effect that the amount of context provided has on the accuracy of the model, • LengthFilter: Filter sentences contain- Table 7 shows the proportion of matches achieved ing more than 512 words. by each language model when the results are seg- • LongWordFilter: Filter sentences con- mented by the length of the context cues. taining words longer than 40 characters. All the models tested are least accurate when tested on the group of short context cues. All except • HTMLTagFilter: Filter sentences contain- mBERT achieved the highest accuracy on the group ing HTML tags. of long sentences. • PunctuationFilter: Filter sentences C.3 Easy and Difficult Context Cues which are over 60% punctuation. A context cue may be considered easy or difficult • DigitsFilter: Filter sentences which are based on: over 60% numeric symbols. For OpusFilter-basic-char-lang, we use the • Whether the tokens occur frequently in the same filters as in OpusFilter-basic but include the training data following character script and language ID filters: • The number of context clues • CharacterScoreFilter: Filter sen- tences which are below a ratio r of Latin char- • The distance of the context clues from the acters, where r P t1.0u. masked token • LanguageIDFilter: Filter sentences Two Irish language context cues which vary in where the language ID tools have a lower con- terms of difficulty are exemplified below. fidence score than c, where c P t0.8u.
You can also read