CROSS-LINGUAL NAMED ENTITY RECOGNITION USING PARALLEL CORPUS: A NEW APPROACH USING XLM-ROBERTA ALIGNMENT
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-RoBERTa Alignment Bing Li Yujie He Wenjin Xu Microsoft Microsoft Microsoft libi@microsoft.com yujh@microsoft.com wenjinxu@microsoft.com Abstract ticular zero-shot transfer which leverages the ad- vances in high resource language such as English We propose a novel approach for cross-lingual to benefit other low resource languages. In this pa- Named Entity Recognition (NER) zero-shot per, we focus on the cross-lingual transfer of NER transfer using parallel corpora. We built arXiv:2101.11112v1 [cs.CL] 26 Jan 2021 an entity alignment model on top of XLM- task, and more specifically using parallel corpus RoBERTa to project the entities detected on and pretrained multilingual language models such the English part of the parallel data to the as mBERT (Devlin, 2018) and XLM-RoBERTa target language sentences, whose accuracy (XLM-R) (Lample and Conneau, 2019; Conneau surpasses all previous unsupervised models. et al., 2020). With the alignment model we can get pseudo- Our motivations are threefold. (1) Parallel cor- labeled NER data set in the target language to train task-specific model. Unlike using trans- pus is a great resource for transfer learning and is lation methods, this approach benefits from rich between many language pairs. Some recent natural fluency and nuances in target-language research focus on using completely unsupervised original corpus. We also propose a modified machine translations (e.g. word alignment (Con- loss function similar to focal loss but assigns neau et al., 2017)) for cross-lingual NER, how- weights in the opposite direction to further ever inaccurate translations could harm the trans- improve the model training on noisy pseudo- fer performance. For example in the word-to-word labeled data set. We evaluated this proposed approach over 4 target languages on bench- translation approach, word ordering may not be mark data sets and got competitive F1 scores well represented during translations, such gaps in compared to most recent SOTA models. We translation quality may harm model performance also gave extra discussions about the impact in down stream tasks. (2) A method could still of parallel corpus size and domain on the final provide business value-add even if it only works transfer performance. for some major languages that have sufficient par- allel corpus as long as it has satisfying perfor- 1 Introduction mance. It is a common issue in industry prac- Named entity recognition (NER) is a fundamental tices where there is a heavily customized task that task in natural language processing, which seeks need to be extended into major markets but you to classify words in a sentence to predefined se- do not want to annotate large amounts of data in mantic types. Due to its nature that the ground other languages. (3) Previous attempts using par- truth label exists at word level, supervised train- allel corpus are mostly heuristics and statistical- ing of NER models often requires large amount model based (Jain et al., 2019; Xie et al., 2018; of human annotation efforts. In real-world use Ni et al., 2017). Recent breakthroughs in multi- cases where one needs to build multi-lingual mod- lingual language models have not been applied to els, the required human labor scales at least lin- such scenarios yet. Our work bridges the gap and early with number of languages, or even worse revisits this topic with new technologies. for low resource languages. Cross-Lingual trans- We propose a novel semi-supervised method for fer on Natural Language Processing(NLP) tasks the cross-lingual NER transfer, bridged by parallel has been widely studied in recent years (Conneau corpus. First we train an NER model on source- et al., 2018; Kim et al., 2017; Ni et al., 2017; Xie language data set - in this case English - assuming et al., 2018; Ni and Florian, 2019; Wu and Dredze, that we have labeled task-specific data. Second we 2019; Bari et al., 2019; Jain et al., 2019), in par- label the English part of the parallel corpus with
this model. Then, we project those recognized en- annotation projection on aligned parallel corpora, tities onto the target language, i.e. label the span translations between a source and a target lan- of the same entity in target-language portion of gauge (Ni et al., 2017; Ehrmann et al., 2011), the parallel corpus. In this step we will leverage or to utilize Wikipedia hyperlink structure to ob- the most recent XLM-R model (Lample and Con- tain anchor text and context as weak labels (Al- neau, 2019; Conneau et al., 2020), which makes a Rfou et al., 2015; Tsai et al., 2016). Different major distinction between our work and previous variants exist in annotation projection, e.g. Ni et attempts. Lastly, we use this pseudo-labeled data al. (Ni et al., 2017) used maximum entropy align- to train the task-specific model in target language ment model and data selection to project English directly. For the last step we explored the option of annotated labels to parallel target language sen- continue training from a multilingual model fine- tences. Some other work used bilingual mapping tuned on English NER data to maximize the ben- combined with lexical heuristics or used embed- efits of model transfer. We also tried a series of ding approach to perform word-level translation methods to mitigate the noisy label issue in this with which naturally comes the annotation projec- semi-supervised approach. tion (Mayhew et al., 2017; Xie et al., 2018; Jain et The main contributions of this paper are as fol- al., 2019). This kind of translation + projection ap- lows: proach is used not just in NER, but in other NLP tasks as well such as relation extraction (Kumar, • We leverage the powerful multilingual model 2015). There are obvious limitations to the transla- XLM-R for entity alignment. It was trained tion + projection approach, word or phrase-based in a supervised manner with easy-to-collect translation makes annotation projection easier, but data, which is in sharp contrast to previous sacrifices the native fluency and language nuances. attempts that mainly rely on unsupervised In addition, orthographic and phonetic based fea- methods and human engineered features. tures for entity matching may only be applicable to languages that are alike, and requires extensive • Pseudo-labeled data set typically contains human engineered features. To address these lim- lots of noise, we propose a novel loss func- itations, we proposed a novel approach which uti- tion inspired by the focal loss (Lin et al., lizes machine translation training data and com- 2017). Instead of using native focal loss, bined with pretrained multilingual language model we went the opposite direction by weighting for entity alignment and projection. hard examples less as those are more likely to be noise. 3 Model Design • By leveraging existing natural parallel corpus We will demonstrate the entity alignment model we got competitive F1 scores of NER transfer component and the full training pipeline of our on multiple languages. We also tested that work in following sections. the domain of parallel corpus is critical in an effective transfer. 3.1 Entity Alignment Model 2 Related Works Translation from a source language to target lan- guage may break the word ordering therefore an There are different ways to conduct zero-shot mul- alignment model is needed to project entities from tilingual transfer. In general, there are two cate- source language sentence to target language, so gories, model-based transfer and data-based trans- that the labels from source language can be zero- fer. Model-based transfer often use source lan- shot transferred. In this work, we use XLM- guage to train an NER model with language in- R (Lample and Conneau, 2019; Conneau et al., dependent features, then directly apply the model 2020) series of models which introduced the trans- to target language for inference (Wu and Dredze, lation language model(TLM) pretraining task for 2019; Wu et al., 2020). Data-based transfer fo- the first time. TLM trains the model to pre- cus on combining source language task specific dict a masked word using information from both model, translations, and entity projection to cre- the context and the parallel sentence in another ate weakly-supervised training data in target lan- language, making the model acquire great cross- guage. Some previous attempts includes using lingual and potentially alignment capability.
Our alignment model is constructed by concate- 2012). In our experiments, we used the following nating the English name of the entity and the target data sets: language sentence, as segment A and segment B of the input sequence respectively. For token level • Ted2013: consists of volunteer transcriptions outputs of segment B, we predict 1 if it is inside and translations from the TED web site and the translated entity, 0 otherwise. This formula- was created as training data resource for the tion transforms the entity alignment problem into International Workshop on Spoken Language a token classification task. An implicit assumption Translation 2013. is that the entity name will still be a consecutive • OpenSubtitles: a new collection of trans- phrase after translation. The model structure is il- lated movie subtitles that contains 62 lan- lustrated in Fig 1. guages. (Lison and Tiedemann, 2016) Label - - - - - 1 0 0 0 - • WikiMatrix: Mined parallel sentences from Wikipedia in different languages. Only pairs C H2 H3 H4 H5 H6 H7 H8 H9 H10 with scores above 1.05 are used. (Schwenk XLM-RoBERTa et al., 2019) • UNPC: Manually translated United Nations QTok 1 QTok 2 Tok 1 Tok 2 Tok 3 Tok 4 documents from 1994 to 2014. (Ziemski '▁Colo' 'gne' '▁Köln' '▁liegt' '▁in' '▁Deutschland' et al., 2016) Cologne Köln liegt in Deutschlands • Europarl: A parallel corpus extracted from Figure 1: Entity alignment model. The query en- the European Parliament web site. (Koehn, tity on the left is ’Cologne’, and the German sentence 2005) on the right is ’Köln liegt in Deutschlands’, which is ’Cologne is located in Germany’ in English, and ’Köln’ • WMT-News: A parallel corpus of News Test is the German translation of ’Cologne’. The model pre- Sets provided by WMT for training SMT that dicts the word span aligned with the query entity. contains 18 languages.1 • NewsCommentary: A parallel corpus of 3.2 Cross-Lingual Transfer Pipeline News Commentaries provided by WMT for Fig 2 shows the whole training/evaluation pipeline training SMT, which contains 12 languages.2 which includes 5 stages: (1) Fine-tune pretrained language model on CoNLL2003 (Sang and Meul- • JW300: Mined, parallel sentences from the der, 2003) to obtain English NER model; (2) Infer magazines Awake! and Watchtower. (Agić labels for English sentences in the parallel corpus; and Vulić, 2019) (3) Run the entity alignment model from the pre- In this work, we focus on 4 languages, German, vious subsection and find corresponding detected Spanish, Dutch and Chinese. We randomly se- English entities in the target language, failed ex- lect data points from all data sets above with equal amples are filtered out during the alignment pro- weights. There might be slight difference in data cess; (4) Fine-tune the multilingual model with distribution between languages due to data avail- data generated from step (3); (5) Evaluate the new ability and relevance. model on the target language test sets. 4.2 Alignment Model Training 4 Experiments And Discussions The objective of alignment model is to find the 4.1 Parallel Corpus entity from a foreign paragraph given its En- In our method, we leveraged the availability of glish name. We feed the English name and the large-scale parallel corpus to transfer the NER paragraph as segment A and B into the XLM- knowledge obtained in English to other languages. R model (Lample and Conneau, 2019; Conneau Existing parallel corpora is easier to obtain than et al., 2020). Unlike NER task, the alignment task annotated NER data. We used parallel corpus 1 http://www.statmt.org/wmt19/translation-task.html 2 crawled from the OPUS website (Tiedemann, http://opus.nlpl.eu/News-Commentary.php
1 Train English model 5 CoNLL EN CoNLL DE Evaluate on target Training Set Pretrained model Test Set language test set Finetuned EN model 2 4 Fine-tune on Infer on English parallel corpus target language (parallel corpus) 3 Run XLM alignment model to project entity to the target language (parallel corpus) Figure 2: Training Pipeline Diagram. Yellow pages represent English documents while light blue pages represent German documents. Step 1 and 5 used original CoNLL data for train and test respectively; step 2, 3, and 4 used machine translation data from OPUS website. Pretrained model is either mBert or XLM-R. The final model was first fine-tuned on English NER data set then fine-tuned on target language pseudo-labeled NER data set. has no requirement for label completeness, since 4.3 Cross-Lingual Transfer we only need one entity to be labeled in one train- We used the CoNLL2003 (Sang and Meulder, ing example. The training data set can be created 2003) and CoNLL20023 data sets to test our cross- from Wikipedia documents where anchor text in lingual transfer method for German, Spanish and hyperlinks naturally indicate the location of enti- Dutch. We ignored the training sets in those lan- ties and one can get the English entity name via guages and only evaluated our model on test sets. linking by Wikipedia entity Id. An alternative to For Chinese, we used People’s Daily4 as the ma- get the English name for mentions in another lan- jor evaluation set, and we also reported numbers guage is through state-of-the-art translation sys- on MSRA (Levow, 2020) and Weibo (Peng and tem. We took the latter approach to make it simple Dredze, 2015) data sets in the next section. One and leveraged Microsoft’s Azure Cognitive Ser- notable difference for People’s Daily data set is vice to do the translation. that it only covers three entity types, LOC, ORG During training, we also added negative exam- and PER, so we suppressed the MISC type from ples with faked English entities which did not ap- English during the transfer by training English pear in the other language’s sentence. The intu- NER model with ’MISC’ marked as ’O’. ition is to force model to focus on English en- To enable cross-lingual transfer, we first trained tity (segment A) and its translation instead of do- an English teacher model using the CoNLL2003 ing pure NER and picking out any entity in other EN training set with XLM-R-large as the base language’s sentence (segment B). We also added model. We trained with focal loss (Lin et al., examples of noun phrases or nominal entities to 2017) for 5 epochs. We then ran inference with make the model more robust. this model on the English part of the parallel data. Finally, with the alignment model, we projected We generated a training set of 30K samples with entity labels onto other languages. To ensure the 25% negatives for each language and trained a quality of target language training data, we dis- XLM-R-large model (Conneau et al., 2020) for 3 carded examples if any English entity failed to epochs with batch size 64. The initial learning rate map to tokens in the target language. We also dis- is 5e-5 and other hyperparameters were defaults carded examples where there are overlapping tar- from HuggingFace Transformers library for the get entities because it will cause conflicts in token token classification task. The precision/recall/F1 labels. Furthermore, when one entity is mapped on the reserved test set reached 98%. The model training was done on 2 Tesla V100 and took about 3 http://lcg-www.uia.ac.be/conll2002/ner/ 4 20 minutes. https://github.com/zjy-ucas/ChineseNER
DE ES NL ZH Model P R F1 P R F1 P R F1 P R F1 Bari et al. (2019) - - 65.24 - - 75.9 - - 74.6 - - - Wu and Dredze (2019) - - 71.1 - - 74.5 - - 79.5 - - - Moon et al. (2019) - - 71.42 - - 75.67 - - 80.38 - - - Wu et al. (2020) - - 73.16 - - 76.75 - - 80.44 - - - Wu et al. (2020) - - 73.22 - - 76.94 - - 80.89 - - - Wu et al. (2020) - - 74.82 - - 79.31 - - 82.90 - - - Our Models mBERT zero-transfer 67.6 77.4 72.1 72.4 78.2 75.2 77.8 79.3 78.6 64.1 65.0 64.6 mBERT fine-tune 73.1 76.2 74.6 77.7 77.6 77.6 80.5 76.7 78.6 80.8 63.3 71.0 (+2.5) (+2.4) (+0.0) (+6.4) XLM-R zero-transfer 67.9 79.8 73.4 79.8 81.9 80.8 82.2 80.3 81.2 68.7 65.5 67.1 XLM-R fine-tune 75.5 78.4 76.9 76.9 81.0 78.9 81.2 78.4 79.7 77.8 65.9 71.3 (+3.5) (-1.9) (-1.5) (+4.2) Table 1: Cross-Lingual Transfer Results on German, Spanish, Dutch and Chinese: Experiments are done with both mBERT and XLM-RoBERTa model. For each of them we compared zero-transfer result (trained on CoNLL2003 English only) and fine-tune result using zero-transfer pseudo-labeled target language NER data. Test sets for German, Spanish, Dutch are from CoNLL2003 and CoNLL2002, and People’s Daily data set for Chinese. to multiple, we only keep the example if all the from 1 to 5 and the value of 4 it works best. target mention phrases are the same. This is to From Table 1, we can see for mBERT model, accommodate the situation where same entity is fine-tuning with pseudo-labeled data has signif- mentioned more than once in one sentence. icant effects on all languages except NL. The largest improvement is in Chinese, 6.4% increase As the last step, we fine-tuned the multilin- in F1 on top of the zero-transfer result, this num- gual model pre-trained on English data set with ber is 2.5% for German and 2.4% for Spanish. lower n(0, 3, 6, etc.) layers frozen, on the tar- The same experiment with XLM-R model shows get language pseudo-labeled data. We used both a different pattern, F1 increased 3.5% for German mBERT (Devlin et al., 2019; Devlin, 2018) and but dropped a little bit on Spanish and Dutch af- XLM-R(Conneau et al., 2020) with about 40K ter fine-tuning. For Chinese, we see a comparable training samples for 1 epoch. The results are improvement with mBERT of 4.2%. The nega- shown in Table 1. All the inference, entity pro- tive result on Spanish and Dutch is probably be- jection and model training experiments are done cause XLM-R has already had very good pretrain- on 2 Tesla V100 32G gpus and the whole pipeline ing and language alignment for those European takes about 1-2 hours. All numbers are reported languages, which can be seen from the high zero- as an average of 5 random runs with the same set- transfer numbers, therefore a relatively noisy data tings. set did not bring much gain. On the contrary, Chi- For loss function we used something similar nese is a relatively distant language from the per- to focal loss (Lin et al., 2017) but with opposite spective of linguistics so that the add-on value of weight assignment. The focal loss was designed to task specific fine-tuning with natural data is larger. weight hard examples more. This intuition holds Another pattern we observed from Table 1 is true only when training data is clean. In some sce- that across all languages, the fine-tuning step with narios such as the cross-lingual transfer task, the pseudo-labeled data is more beneficial to precision pseudo-labeled training data contains lots of noise compared with recall. We observed a consistent propagated from upper-stream of the pipeline, in improvement in precision but a small drop in re- which case, those ’hard’ examples are more likely call in most cases. to be just errors or outliers and could hurt the train- ing process. We went the opposite direction and 5 Discussions lowered their weights instead, so that the model could focus on less noisy labels. More specifically, 5.1 Impact of the amount of parallel data we added weight (1 + pt )γ instead of (1 − pt )γ on One advantage of the parallel corpora method top of the regular cross entropy loss, and for the is high data availability compared to the super- hyper-parameter γ we experimented with values vised approach. Therefore a natural question next
is whether more data is beneficial for the cross- Domain PER ORG LOC All OpenSubtitles 24,036 3,809 5,196 33,041 lingual transfer task. To answer this question, we UN 1,094 25,875 12,718 39,687 did a series of experiments with varying number of News 10,977 9,568 28,168 48,713 All Domains Combined 10,454 14,269 17,412 42,135 training examples ranging from 5K to 200K, and the model F1 score increases with the amount of Table 2: Entity Count by type in the pseudo-labeled data at the beginning and plateaued around 40K. Chinese NER training data set. We listed multiple do- All the numbers displayed in Table 1 are reported mains that were extracted from different parallel data for training on a generated data set of size around source. And AllDomains is a combination of all re- 40K (sentences). One possible explanation for the sources. plateaued performance might be due to the propa- gated error in the pseudo-labeled data set. Domain mismatch may also limit the effectiveness of trans- sets from OPUS (Tiedemann, 2012), OpenSubti- fer between languages. More discussions on this tles, UN(contains UN and UNPC), News(contains topic in the next section. WMT and News-Commentary) and another one with all these three combined. OpenSubtitles is 5.2 Impact of the domain of parallel data from movie subtitles and language style is infor- mal and oral. UN is from united nation reports and language style is more formal and political fla- vored. News data is from newspaper and content is more diverse and closer to CoNLL data sets. We evaluated the F1 on three Chinese testsets, Weibo, MSRA and People’s Daily, where Weibo contains messages from social media, MSRA is more for- mal and political, and People’s Daily data set are Figure 3: F1 scores evaluated on three data sets using newspaper articles. different domains’ parallel data. Blue column on the From Fig 3 we see OpenSubtitles performs best left is the result of zero-shot model transfer. To its right on Weibo but poorly on the other two. On the con- are F1 scores for 3 different domains and all domains trary UN performs worst on Weibo but better on combined. the other two. News domain performs the best on People’s Daily, which is also consistent with one’s Learnings from machine translation community intuition because they are both from newspaper ar- showed that the quality of neural machine trans- ticles. All domains combined approach has a de- lation models usually strongly depends on the do- cent performance on all three testsets. main they are trained on, and the performance of a model could drop significantly when evaluated on Different domains of data have quite large gap a different domain (Koehn and Knowles, 2017). in the density and distribution of entities, for ex- Similar observation can be used to explain chal- ample OpenSubtitles contains more sentences that lenges of the NER cross-lingual transfer. In NER does not have any entity. In the experiments above transfer, the first domain mismatch comes from we did filtering to keep the same ratio of ’empty’ the natural gap in entity distributions between dif- sentences among all domains. We also exam- ferent language corpus. Many entities only live ined the difference in type distributions. In Tab 2, inside the ecosystem of a specific group of lan- we calculated entity counts by type and domain. guages and may not be translated naturally to oth- OpenSubtitles has very few ORG and LOC en- ers. The second domain mismatch is between the tities, whereas UN data has very few PER data. parallel corpora and the NER data set. The English News and All domain data are more balanced. model might not have good domain adaptation In Fig 4, we show evaluation on People’s Daily ability and could perform well on the CoNLL2003 by type. We want to understand how does paral- data set but poorly on the parallel data set. lel data domain impacts transfer performance for To study the impact of the domain of parallel different entity types. News data has the best per- data on transfer performance, we did an experi- formance on all types, and Subtitles has very bad ment on Chinese using parallel data from differ- result on ORG. All these observations are consis- ent domains. We picked three representative data tent with the type distribution in Table 2.
grained information in some hard examples and results in insufficiently optimized model. Another observation is that training the mBERT with a combination of English and German using cross entropy loss, we can get almost the same score as our best model, which is trained with two stages. 7 Conclusion In this paper, we proposed a new method of do- Figure 4: F1 score by type evaluated on People’s Daily data set. The same as Fig 3, we compare the results ing NER cross-lingual transfer with parallel cor- using different domains of parallel data for the NER pus. By leveraging the XLM-R model for entity transfer. projection, we are able to make the whole pipeline automatic and free from human engineered feature Test P Test R F1 or data, so that it could be applied to any other lan- Sequential fine-tune with RW 73.1 76.2 74.6 guage that has a rich resource of translation data Zero-transfer 67.6 77.4 72.1 Sequential fine-tune with CE 71.2 74.3 72.7 without extra cost. This method also has the po- Skip fine-tune on English 71.0 73.6 72.3 tential to be extended to other NLP tasks, such as Fine-tune on En/De mixed (CE) 73.1 75.9 74.5 question answering. In this paper, we thoroughly Fine-tune on En/De mixed (RW) 65.5 42.6 51.6 tested the new method in four languages, and it is most effective in Chinese. We also discussed Table 3: Ablation study results evaluated on the impact of parallel data domain on NER trans- CoNLL2002 German NER data. All experiments used fer performance and found that a combination of the mBERT-base model. RW denotes the re-weighted different domain parallel corpora yielded the best loss we proposed in the paper; CE denotes the regular cross entropy loss. average results. We also verified the contribution of the pseudo-labeled parallel data by an ablation study. In the future we will further improve the 6 Ablation Study alignment model precision and also explore alter- native ways of transfer such as self teaching in- To better understand the contribution of each stage stead of direct fine-tuning. We are also interested in the training process, we conducted an ablation to see how the propose approach generalizes to study on German data set with mBERT model. cross-lingual transfers for other NLP tasks. We compared 6 different settings: (1) approach proposed in this work, i.e. fine-tune the En- glish model with pseudo-labeled data with the new References loss, denoted as re-weighted (RW) loss; (2) zero- Jian Ni, Georgiana Dinu, and Radu Florian. 2017. transfer is direct model transfer which was trained Weakly supervised cross-lingual named entity only on English NER data; (3) fine-tune the En- recognition via effective annotation and represen- glish model with pseudo-labeled data with regu- tation projection. In Proceedings of the 55th An- nual Meeting of the Association for Computational lar cross-entropy (CE) loss; (4) Skip fine-tune on Linguistics (Volume 1: Long Papers), pages 1470– English and directly fine-tune mBERT on pseudo- 1480. Association for Computational Linguistics. labeled data. (5)(6) fine-tune the model directly with mixed English and pseudo-labeled data si- Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, and Eric Fosler-Lussier. 2017. Cross-lingual transfer multaneously with RW and CE losses respectively. learning for POS tagging without cross-lingual re- From Table 3, we see that both pretraining on sources. In Proceedings of the 2017 Conference on English and fine-tuning with pseudo German data Empirical Methods in Natural Language Process- are essential to get the best score. The RW loss ing, pages 2832–2838, Copenhagen, Denmark. As- sociation for Computational Linguistics. performed better in sequential fine-tuning than in simultaneous training with mixed English and Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- German data, this is probably because noise por- ina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating tion in English training set is much smaller than cross-lingual sentence representations. In Proceed- in the German pseudo-labeled training set, using ings of the 2018 Conference on Empirical Methods RW loss on English data failed to exploit the fine- in Natural Language Processing, pages 2475–2485,
Brussels, Belgium. Association for Computational Alexis Conneau, Guillaume Lample, Marc’Aurelio Linguistics. Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Word translation without parallel data. arXiv Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A. preprint arXiv:1710.04087. Smith, and Jaime Carbonell. 2018. Neural cross- lingual named entity recognition with minimal re- Guillaume Lample and Alexis Conneau. 2019. Cross- sources. In Proceedings of the 2018 Conference on lingual language model pretraining. arXiv preprint Empirical Methods in Natural Language Process- arXiv:1901.07291. ing, pages 369–379, Brussels, Belgium. Association for Computational Linguistics. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Jian Ni and Radu Florian. 2019. Neural cross-lingual Vishrav Chaudhary, Guillaume Wenzek, Francisco relation extraction based on bilingual word embed- Guzmán, Edouard Grave, Myle Ott, Luke Zettle- ding mapping. In Proceedings of the 2019 Con- moyer, and Veselin Stoyanov. 2020. Unsupervised ference on Empirical Methods in Natural Language cross-lingual representation learning at scale. In Processing and the 9th International Joint Confer- Proceedings of the 58th Annual Meeting of the Asso- ence on Natural Language Processing (EMNLP- ciation for Computational Linguistics, pages 8440– IJCNLP), pages 399–409, Hong Kong, China. As- 8451, Online. Association for Computational Lin- sociation for Computational Linguistics. guistics. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Jörg Tiedemann. 2012. Parallel Data, Tools and In- Focal loss for dense object detection. arXiv preprint terfaces in OPUS. In Proceedings of the Eight In- arXiv:1708.02002, 2017. ternational Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Shijie Wu and Mark Dredze. 2019. Beto, bentz, be- Language Resources Association (ELRA). cas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Pierre Lison and Jörg Tiedemann. 2016. OpenSubti- Empirical Methods in Natural Language Processing tles2016: Extracting Large Parallel Corpora from and the 9th International Joint Conference on Natu- Movie and TV Subtitles. In Proceedings of the Tenth ral Language Processing (EMNLP-IJCNLP), pages International Conference on Language Resources 833–844, Hong Kong, China. Association for Com- and Evaluation (LREC 2016), Paris, France. Euro- putational Linguistics. pean Language Resources Association (ELRA). Alankar Jain, Bhargavi Paranjape, and Zachary C. Lip- Holger Schwenk, Vishrav Chaudhary, Shuo Sun, ton. Entity projection via machine translation for Hongyu Gong, and Francisco Guzmán. 2019. Wiki- cross-lingual NER. In EMNLP, pages 1083–1092, Matrix: Mining 135M Parallel Sentences in 1620 2019. Language Pairs from Wikipedia. arXiv preprint M Saiful Bari, Shafiq Joty, and Prathyusha Jwala- arXiv:11907.05791. puram. Zero-resource cross-lingual named en- tity recognition. arXiv preprint arXiv:1911.09812, Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno 2019. Pouliquen. 2016. The United Nations Parallel Cor- pus v1.0. In Proceedings of the Tenth International Erik F. Tjong Kim Sang and Fien De Meulder. Conference on Language Resources and Evaluation 2003. Introduction to the conll-2003 shared task: (LREC 2016), Paris, France. European Language Language-independent named entity recognition. In Resources Association (ELRA). Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooper- Philipp Koehn. 2005. Europarl: A Parallel Corpus for ation with HLT-NAACL 2003, Edmonton, Canada, Statistical Machine Translation. In Conference Pro- May 31 - June 1, 2003, pages 142–147. ceedings: the tenth Machine Translation Summit, pages 79–86, Phuket, Thailand. AAMT, AAMT. Shijie Wu and Mark Dredze. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. Željko Agić and Ivan Vulić. 2019. JW300: A Wide- arXiv preprint arXiv:1904.09077, 2019. Coverage Parallel Corpus for Low-Resource Lan- Jacob Devlin. 2018. Multilingual bert readme docu- guages. In Proceedings of the 57th Annual Meet- ment. ing of the Association for Computational Linguis- tics, pages 3204–3210, Florence, Italy. Association Jacob Devlin, Ming-Wei Chang, Kenton Lee, and for Computational Linguistics. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Maud Ehrmann, Marco Turchi, and Ralf Stein- standing. In Proceedings of the 2019 Conference berger. 2011. Building a multilingual named of the North American Chapter of the Association entity-annotated corpus using annotation projec- for Computational Linguistics: Human Language tion. In Proceedings of Recent Advances Technologies, Volume 1 (Long and Short Papers), in Natural Language Processing. Association pages 4171–4186, Minneapolis, Minnesota. Associ- for Computational Linguistics, pages 118–124. ation for Computational Linguistics. http://aclweb.org/anthology/R11-1017.
Rami Al-Rfou, Vivek Kulkarni, Bryan Per- ozzi, and Steven Skiena. 2015. Polyglot-ner: Massive multilingual named entity recog- nition. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, Vancouver, British Columbia, Canada. https://doi.org/10.1137/1.9781611974010.66. Chen-Tse Tsai, Stephen Mayhew, and Dan Roth. 2016. Cross-lingual named entity recognition via wikifica- tion. In CoNLL, pages 219–228. Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. 2017. Cheap translation for cross-lingual named entity recognition. In EMNLP, pages 2526–2535. Faruqui Kumar, S. 2015. Multilingual open relation extraction using cross-lingual In Proceedings of NAACL-HLT, 1351–1356. Qianhui Wu, Zijia Lin, Guoxin Wang, Hui Chen, Börje F Karlsson, Biqing Huang, and Chin-Yew Lin. Enhanced meta-learning for cross-lingual named en- tity recognition with minimal resources. In AAAI, 2020. Taesun Moon, Parul Awasthy, Jian Ni, and Radu Florian. 2019. Towards lingua franca named entity recognition with bert. arXiv preprint arXiv:1912.01389. Wu, Q.; Lin, Z.; Karlsson, B. F.; Lou, J.-G.; and Huang, B. 2020. Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language. In Association for Computational Linguistics. Qianhui Wu and Zijia Lin and Börje F. Karlsson and Biqing Huang and Jian-Guang Lou. UniTrans: Uni- fying Model Transfer and Data Transfer for Cross- Lingual Named Entity Recognition with Unlabeled Data. In IJCAI 2020. Gina-Anne Levow. The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, 2006. . Nanyun Peng and Mark Dredze. Named Entity Recog- nition for Chinese Social Media with Jointly Trained Embeddings. In Proceedings of the 2015 Confer- ence on Empirical Methods in Natural Language Processing, 2015. . Philipp Koehn and Rebecca Knowles. Six Challenges for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39. 2017. .
You can also read