Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering Arij Riabi‡∗ Thomas Scialom?∗ Rachel Keraron? Benoı̂t Sagot‡ Djamé Seddah‡ Jacopo Staiano? ‡ INRIA Paris Sorbonne Université, CNRS, LIP6, F-75005 Paris, France ? reciTAL, Paris, France {thomas,rachel,jacopo}@recital.ai {arij.riabi,benoit.sagot,djame.seddah}@inria.fr Abstract et al., 2020), covering respectively 10 and 7 lan- guages. Due to the cost of annotation, both are Coupled with the availability of large scale limited only to an evaluation set. They are compa- datasets, deep learning architectures have en- arXiv:2010.12643v1 [cs.CL] 23 Oct 2020 abled rapid progress on the Question Answer- rable to the validation set of the original SQuAD ing task. However, most of those datasets (see more details in Section 3.3). In both datasets, are in English, and the performances of state- each paragraphs is paired with questions in various of-the-art multilingual models are significantly languages. It enables the evaluation of models in lower when evaluated on non-English data. an experimental scenario for which the input con- Due to high data collection costs, it is not real- text and question can be in two different languages. istic to obtain annotated data for each language This scenario has practical applications, e.g. query- one desires to support. ing a set of documents in various languages. We propose a method to improve the Cross- Performing this cross-lingual task is complex lingual Question Answering performance and remains challenging for current models, as- without requiring additional annotated data, suming only English training data. Transfer results leveraging Question Generation models to pro- duce synthetic samples in a cross-lingual fash- are shown to rank behind training-language perfor- ion. We show that the proposed method al- mance (Artetxe et al., 2020; Lewis et al., 2020). In lows to significantly outperform the baselines other words, multilingual models fine-tuned only trained on English data only. We report a new on English data are found to perform better for state-of-the-art on four multilingual datasets: English than for other languages. MLQA, XQuAD, SQuAD-it and PIAF (fr). To the best of our knowledge, no alternative methods to this simple zero-shot transfer have been 1 Introduction proposed so far for this task. In this paper, we pro- Question Answering is a fast growing research pose to generate synthetic data in a cross-lingual field, aiming to improve the capabilities of ma- fashion, borrowing the idea from monolingual QA chines to read and understand documents. Signifi- research efforts (Duan et al., 2017). On English cant progress has recently been enabled by the use corpora, generating synthetic questions has been of large pre-trained language models (Devlin et al., shown to significantly improve the performance of 2019; Raffel et al., 2020), which reach human-level QA models (Golub et al., 2017; Alberti et al., 2019). performances on several publicly available bench- However, the adaptation of this technique to cross- marks, such as SQuAD (Rajpurkar et al., 2016) and lingual QA is not straightforward: cross-lingual NewsQA (Trischler et al., 2017). text generation is a challenging task per se which Given that the majority of large scale Ques- has not been yet extensively explored, in particular tion Answering (QA) datasets are in English (Her- when no multilingual training data is available. mann et al., 2015; Rajpurkar et al., 2016; Choi We explore two Question Generation scenarios: et al., 2018), the importance to develop QA sys- i) requiring only SQuAD data; and ii) using a trans- tems targeting other languages, is currently been lator tool to obtain translated versions of SQuAD. addressed with two cross-lingual QA datasets: As expected, the method leveraging on a transla- XQuAD (Artetxe et al., 2020) and MLQA (Lewis tor has shown to perform the best. Leveraging on ∗ : equal contribution. The work of Arij Riabi was partly such synthetic data, our best model obtains signif- carried out while she was working at reciTAL. icant improvements on XQuAD and MLQA over
the state-of-the-art for both Exact Match and F1 2019). Similar to QA, significant performance im- scores. In addition, we evaluate the QA models provements have been obtained using pre-trained on languages not seen during training (even in the language models (Dong et al., 2019). Still, due synthetic data). On SQuAD-it and PIAF (fr), two to the lack of multilingual datasets, most previous Italian and French evaluation sets, we report a new works have been limited to monolingual text gen- state-of-the-art. This indicates that the proposed eration. We note the exceptions of (Kumar et al., method allows to capture better multilingual rep- 2019) and (Chi et al., 2020), who resorted to multi- resentations beyond the training languages. Our lingual pre-training before fine-tuning on monolin- method paves the way toward multilingual QA do- gual downstream NLG tasks. However, the quality main adaptation, especially for under-resourced of the generated questions is reported to be below languages. the corresponding English ones. 2 Related Work Question Generation for Question Answering Question Answering (QA) QA is the task for Data augmentation via synthetic data generation which given a context and a question, a model has is a well known technique to improve models’ ac- to find the answer. The interest for Question An- curacy and generalisation. It has found success- swering goes back a long way: in a 1965 survey, ful application in several areas, such as time series Simmons (1965) reported fifteen implemented En- analysis (Forestier et al., 2017) and computer vision glish language question-answering systems. More (Buslaev et al., 2020). In the context of QA, gener- recently, with the rise of large scale datasets (Her- ating synthetic questions to complete a dataset has mann et al., 2015), and large pre-trained models shown to improve QA performances (Duan et al., (Devlin et al., 2019), the performance drastically in- 2017; Alberti et al., 2019). So far, all these works creased, approaching human-level performance on have focused on English QA given the difficulty standard benchmarks (see for instance the SQuAD to generate questions in other languages without leader board 1 . More challenging evaluation bench- available data. This lack of data and the difficulty marks have recently been proposed: Dua et al. to obtain some constitutes the main motivation of (2019) released the DROP dataset, for which the our work and justify exploring cost-effective ap- annotators are encouraged to annotate adversarial proaches such as data augmentation via the genera- questions; Burchell et al. (2020) released the MSQ tion of questions. dataset, consisting of multi-sentence questions. However, all these works are focused on English. 3 Data Another popular research direction focuses on the development of multilingual QA models. For this 3.1 English Training Data purpose, the first step has been to provide the com- SQuADen The original SQuAD (Rajpurkar et al., munity with multilingual evaluation sets: Artetxe 2016), which we refer as SQuADen for clarity in et al. (2020) and Lewis et al. (2020) proposed con- this paper. It is one of the first, and among the currently two different evaluation sets which are most popular, large scale QA datasets. It contains comparable to the SQuAD development set. Both about 100K question/paragraph/answer triplets in reach the same conclusion: due to the lack of non- English, annotated via Mechanical Turk.2 English training data, models don’t achieve the same performance in Non-English languages than they do in English. To the best of our knowledge, QG datasets Any QA dataset can be reversed no method has been proposed to fill this gap. into a QG dataset, by switching the generation tar- gets from the answers to the questions. In this Question Generation QG can be seen as the paper, we use the qg subscript to specify when the dual task of QA: the input is composed of the an- dataset is used for QG (e.g. SQuADen;qg indicates swer and paragraph containing it, and the model is the English SQuAD data in QG format). trained to generate the question. Proposed by (Rus et al., 2010), it has leveraged on the development of 2 Two versions of SQuAD have been released: v1.1, used new QA datasets (Zhou et al., 2017; Scialom et al., in this work, and v2.0. The latter contains “unanswerable ques- tions” in addition to those from v1.1. We use the former, since 1 https://rajpurkar.github.io/ the multi-lingual evaluation datasets, MLQA and XQUAD, do SQuAD-explorer/ not include unanswerable questions.
3.2 Synthetic Training Sets PIAF Keraron et al. (2020) provided an evalua- SQuADtrans is a machine translated version of tion set in French following the SQuAD protocol, the SQuAD train set in the seven languages of and containing 3835 examples. MLQA, released by the authors together with their KorQuAD 1.0 the Korean Question Answering paper. Dataset (Lim et al., 2019), a Korean dataset also WikiScrap We collected 500 Wikipedia articles, built following the SQuAD protocol. following the SQuADen protocol, for all the lan- guages present in MLQA. They are not paired with SQuAD-it Derived from SQuADen , it was ob- any question or answer. We use them as contexts tained via semi-automatic translation to Italian to generate synthetic multilingual questions, as de- (Croce et al., 2018). tailed in Section 4.2. We use project Nayuki’s code3 to parse the top 10K Wikipedia pages ac- 4 Models cording to the PageRank algorithm (Page et al., Recent works (Raffel et al., 2020; Lewis et al., 1999). We then filter out the the paragraphs with 2019) have shown that classification tasks can be character length outside of a [500, 1500] interval. framed as a text-to-text problem, achieving state- Articles with less than 5 paragraphs are discarded, of-the-art results on established benchmark, such since they tend to be less developed, in a lower as GLUE (Wang et al., 2018). Accordingly, we quality or being only redirection pages. Out of employ the same architecture for both Question the filtered articles, we randomly selected 500 per Answering and Generation tasks. This also allows language. fairer comparisons for our purposes, by removing 3.3 Multilingual Evaluation Sets differences between QA and QG architectures and their potential impact on the results obtained. In XQuAD (Artetxe et al., 2020) is a human trans- particular, we use a distilled version of XLM-R lation of the SQuADen development set in 10 lan- (Conneau et al., 2020): MiniLM-M (Wang et al., guages (Arabic, Chinese, German, Greek, Hindi, 2020) (see Section 4.3 for further details). Russian, Spanish, Thai, Turkish, and Vietnamese), with 1k QA pairs for each language. 4.1 Baselines MLQA (Lewis et al., 2020) is an evaluation QANo-synth Following previous works, we fine- dataset in 7 languages (English, Arabic, Chinese, tuned the multilingual models on SQuADen , and German, Hindi, and Spanish). The dataset is built consider them as our baselines. from aligned Wikipedia sentences across at least two languages (full alignment between all lan- English as Pivot Leveraging on translation mod- guages being impossible), with the goal of provid- els, we consider a second baseline method, which ing natural rather than translated paragraphs. The uses English as a pivot. First, both the question in QA pairs are manually annotated on the English language Lq and the paragraph in language Lp are sentences and then human translated on the aligned translated into English. We then invoke the base- sentences. The dataset contains about 46K aligned line model described above, QANo-synth , to predict QA pairs in total. the answer. Finally, the predicted answer is trans- lated back into the target language Lp . We used the Language-specific benchmarks In addition to google translate API.4 the two aforementioned multilingual evaluation data, we benchmark our models on three language- 4.2 Synthetic Data Augmentation specific datasets for French, Italian and Korean, In this work we consider data augmentation via as detailed below. We choose these datasets since generating synthetic questions, to improve the QA none of these languages are present in XQuAD performance. Different training schemes for the or MLQA. Hence, they allow us to evaluate our question generator are possible, resulting in dif- models in a scenario where the target language is ferent quality of synthetic data. Before this work, not available during training, even for the synthetic its impact on the final QA system remained unex- questions. plored, in the multilingual context. 3 https://www.nayuki.io/page/ 4 computing-wikipedias-internal-pageranks https://translate.google.com
For all the following experiments, only the syn- our models, we used the official MLQA evaluation thetic data changes. Given a specific set of syn- scripts.6 thetic data, we always follow the same two-stages protocol, similar to (Alberti et al., 2019): we first 5 Results train the QA model on the synthetic QA data, then 5.1 Question Generation on SQuADen . We also tried to train the QA model in one stage, with all the synthetic and human data We report examples of generated questions in Ta- shuffled together, but observed no improvements ble 1. over the baseline. Controlling the Target Language In the con- We explored two different synthetic generation text of multilingual text generation, controlling the modes: target language is not trivial. Synth the QG model is trained on SQuADen,qg When a QA model is trained only on English (i.e., English data only) and the synthetic data are data, at inference, given a non-English paragraph, generated on WikiScrap. Under this setup, the only it predicts the answer in the input language, as one annotated samples this model has access to are would expect, since it is an extractive process. Ide- those from SQuAD-en. ally, we would like to observe the same behavior for a Question Generation model trained only on En- Synth+trans the QG model is trained on glish data (such as Synth), leveraging on the mul- SQuADtrans,qg in addition to SQuADen,qg . The tilingual pre-training. Conversely to QA, QG is a questions can be in a different languages than language generation task. Multilingual generation the context. Hence, the model needs an indica- is much more challenging, as the model’s decoding tion about the language it is expected to gener- ability plays a major role. When a QG model is ate the question. To control the target language, fine-tuned only on English data (i.e SQuAD-en), its we use a specific prompt per language, defin- controllability of the target language suffers from ing a special token , which corresponds catastrophic forgetting: the input language do not to the desired target language Y . Thus, the in- propagate to the generated text. The questions are put is structured as Answer generated in English, while still being related to the Context where is a special to- context. For instance, in Table 1, we observe that ken separator, and indicates to the model the QGsynth model outputs English questions for in what language the question should be generated. the paragraphs in Chinese and Spanish. The same These attributes offer flexibility on the target lan- phenomenon was reported by Chi et al. (2020). guage. Similar techniques are used in the literature to control the style of the output (Keskar et al., Cross-language training To tackle the afore- 2019; Scialom et al., 2020; Chi et al., 2020). mentioned limitation on target language control- lability (i.e. to enable the generation in other lan- 4.3 Implementation details guages than English), multilingual data is needed. For all our experiments we use Multilingual We can leverage on the translated version of the MiniLM v1 (MiniLM-m) (Wang et al., 2020), a dataset to add the needed non-English examples. 12-layer with 384 hidden size architecture distilled As detailed in Section 4.2, we simply use a spe- from XLM-R Base multilingual (Conneau et al., cific prompt that corresponds to the target language 2020). With only 66M parameters, it is an order (with N different prompts corresponding to the N of magnitude smaller than state-of-the-art archi- languages present in the dataset). In Table 1, we tectures such as BERT-large or XLM-large. We show how QGsynth+trans can generate questions in used the official Microsoft implementation.5 For the same language as the input. Further, the ques- all the experiments—both QG and QA—we trained tions seem much more relevant, coherent and fluent, the model for 5 epochs, using the default hyper- if compared to those produced by QGsynth : for the parameters. We used a single nVidia gtx2080ti Spanish paragraph, the question is well formed and with 11G RAM, and the training times to circa 4 focused on the input answer; for the Chinese (see and 2 hours amount respectively for Question Gen- last row of Table 1 for QGsynth+trans ) is perfectly eration and for Question Answering. To evaluate written. 5 6 Publicly available at https://github.com/ https://github.com/facebookresearch/ microsoft/unilm/tree/master/minilm. MLQA/blob/master/mlqa_evaluation_v1.py
Paragraph (EN) Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. Answer Broncos QGsynth What team did John Elway lead to victory at age 38? QGsynth+trans (target language = en) What team did John Elway lead to win in the Super Bowl? Paragraph (ES) Peyton Manning se convirtió en el primer mariscal de campo de la historia en llevar a dos equipos diferentes a participar en múltiples Super Bowls. Ademas, es con 39 años, el mariscal de campo más longevo de la historia en jugar ese partido. El récord anterior estaba en manos de John Elway —mánager general y actual vicepresidente ejecutivo para operaciones futbolı́sticas de Denver— que condujo a los Broncos a la victoria en la Super Bowl XXXIII a los 38 años de edad. Answer Broncos QGsynth Where did Peyton Manning condujo? QGsynth+trans (target language = es) Qué equipo ganó el récord anterior? (Which team won the previous record?) QGsynth+trans (target language = en) What team did Menning win in the Super Bowl? Paragraph (ZH) 培顿·曼宁成为史上首位带领两支不同球队多次进入超级碗的四分卫。他也以39 岁高龄参加超 级碗而成为史上年龄最大的四分卫。过去的记录是由约翰·埃尔维保持的,他在38岁时带领野马队赢得第33 届 超级碗,目前担任丹佛的橄榄球运营执行副总裁兼总经理 Answer 野马队 QGsynth What is the name for the name that the name is used? QGsynth+trans (target language = zh) 约翰·埃尔维在13岁时带领哪支球队赢得第33届超级碗? (Which team did John Elvey lead to win the 33rd Super Bowl at the age of 13?) QGsynth+trans (target language = en) What team won the 33th Super Bowl? Table 1: Example of questions generated by the different models on an XQuAD’s paragraph in different languages. For QGsynth+trans , we report the outputs given two target languages, the one of the context and English. 5.2 Question Answering consistent. Hence, the model was never exposed We report the main results of our experiments on to different languages across the same input text. XQuAD and MLQA in Table 2. The scores corre- The synthetic inputs are composed of questions spond to the average over all the different possible mostly in English (see examples in Table 1) while combination of languages (de-de, de-ar, etc.). the contexts can be in any languages. Therefore, the QA model is exposed for the first time to a English as Pivot Using English as pivot does not cross-lingual scenario. We hypothesise that such lead to good results. This may be due to the evalu- a cross-lingual ability is not innate for a default ation metrics based on n-grams similarity. For ex- multilingual model: exposing a model to this sce- tractive QA, F1 and EM metrics measure the over- nario allows to develop this ability and contributes lap between the predicted answer and the ground to improve its performance. truth. Therefore, meaningful answers worded dif- Synthetic with translation (+synth+trans) : ferently are penalized, a situation that is likely to For MiniLM+synth+trans , we obtain a much larger im- occur because of the back-translation mechanism. provement over its baselines, MiniLM, compared Therefore, automatic evaluation is challenging for to MiniLM+synth on both MLQA and XQuAD. This this setup, and suffer from similar difficulties than confirms the intuition developed in the paragraph in text generation (Sulem et al., 2018). As an ad- above, that independently of the multilingual ca- ditional downside, it requires multiple translations pacity of the model, a cross-lingual ability is de- at inference time. For these reasons, we decided to veloped when the two inputs components are not not explore further this approach. exclusively written in the same language. In the Synthetic without translation (+synth) Com- Section 5.3, we discuss more in depth this phe- pared to the MiniLM baseline, we observe a small nomenon. increase in term of performance for MiniLM+synth (the Exact Match increases from 29.5 to 33.1 on 5.3 Discussion XQuAD and from 26.0 to 27.5 on MLQA). Cross Lingual Generalisation To explore the During the self-supervised pre-training stage, the models’ effectiveness in dealing with cross lingual model was exposed to multilingual inputs. Yet, inputs, we report in Figure 1 the performance for for a given input, the target language was always our MiniLM+synth+trans setup, varying the number of
#Params Trans. XQuAD MLQA MiniLM (Wang et al., 2020) 66M No 42.2 / 29.5 38.4 / 26.0 XLM (Hu et al., 2020)7 340M No 68.4/53.0 65.4 / 47.9 English as Pivot 66M Yes 36.1 / 23.0 M iniLM+synth 66M No 44.8 / 33.1 39.8 / 27.5 M iniLM+synth+trans 66M Yes 63.3 / 49.1 56.1 / 41.4 XLM+synth+trans 340M Yes 74.3 / 59.2 65.3 / 49.2 Table 2: Results (F1 / EM) of the different QA models on XQuAD and MLQA. Not All Languages (f1 Cross) 55.0 40.0% All Languages (f1 Cross) 60 Relative Improvement (f1) 52.5 30.0% 50.0 55 Score (f1) Score (f1) 47.5 20.0% 45.0 50 10.0% 42.5 Not all languages (f1 Cross) 45 Not all languages (f1 Cross) 40.0 0.0% All languages (f1 Cross) All languages (f1 Cross) 0 to 125K 8K 7K 7K (Synth Cr 6K oss ) 125K 208K 277K 347K 125K 208K 277K 347K (Bas0eline) (Bas0eline) oss ) oss ) 125K to 20 208K to 27 277K to 34 347K to 41 (Synth6KCr (Synth6KCr 41 41 Number of synthetic questions Number of synthetic questions Number of synthetic questions Figure 1: Left: F1 score on MLQA, for models with different number of synthetic data in two setups: for All Lan- guages, the synthetic questions are sampled among all the 5 different MLQA’s languages; for Not All Languages, the synthetic questions are sampled progressively from only one language, two, . . . , to all 5 for the last point, which corresponds to All Languages. Middle: same as on the left but evaluated on XQuAD. Right: The relative variation in performance for these models. samples and the languages present in the synthetic However, it seems that even with only one lan- data. The abscissa x corresponds to the progres- guage pair present, the model is able to develop a sively increasing number of synthetic samples used; cross lingual ability that benefits –in part– on other at x = 0 corresponds to the baseline MiniLM+trans , languages: at the right-hand side of Figure 1, we where the model has access only to the original can see that most of the improvement is happen- English data from SQuADen . We explore two sam- ing given only one cross-lingual language pair (i.e. pling strategies for the synthetic examples: English and Spanish). 1. All Languages corresponds to sampling the Unseen Languages To measure the benefit of examples from any of the different languages. our approach on unseen languages (i.e. lan- guage not present in the synthetic data from 2. Conversely, for Not All Languages, we pro- MLQA/XQuAD), we test our models on three QA gressively added the different languages: for evaluation sets: PIAF (fr), KorQuAD and SQuAD- x = 125K, all the 125K synthetic data are on it (see Section 3.3. The results are consistent with a Spanish input. Then German in addition of the previous experiments on MLQA and XQuAD. Spanish; Hindu; finally, Arab, all the MLQA Our MiniLM+synth+trans model improved by more languages are present. than 4 Exact Match points over its baseline, while XLM+synth+trans obtains a new state-of-the-art. No- In Figure 1, we observe that the performance for tably, our multilingual XLM+synth+trans outperforms All Language Cross increases largely at the begin- CamemBERT on PIAF, while the later is a pure ning, then remains mostly stable. Conversely, we monolingual, in domain language model. can observe a gradually improvement for Not All Language Cross, as more languages are present in Hidden distillation effect The relative im- the training. It indicates that when all the languages provement for our best synthetic configuration are present in the synthetic data, the model obtains +synth+trans over the baseline, is superior to 60% immediately a cross-lingual ability. EM for MiniLM (from 29.5 to 49.5 on XQuAD
#Params PIAF (fr) KorQuAD SQuAD-it MiniLM 66M 58.9 / 34.3 53.3 / 40.5 72.0 / 57.7 BERT-m (Croce et al., 2018) 110M - - 74.1 / 62.5 CamemBERT (Martin et al., 2020) 340M 68.9 / - - - MiniLM+synth 66M 58.6 / 34.5 52.1 / 39.0 71.3 / 58.0 MiniLM+synth+trans 66M 63.9 / 40.6 60.0 / 48.8 74.5 / 62.0 XLM+synth+trans 340M 72.1 / 47.1 63.0 / 52.8 80.4 / 67.6 Table 3: Zero-shot results (F1 / EM) on PIAF, KorQuAD and SQuAD-it for our different QA models, compared to various baselines. Note that CamemBERT actually corresponds to a French version of RoBERTa, an architecture widely outperforming BERT. and from 26.0 to 41.4 on MLQA). It is way more that the model might adopt the correct language at important than for XLM, where the EM improved inference, even for a target language unseen during by +11.7% on XQuAD and +2.71% on MLQA. training.8 A similar intuition has been explored in It indicates that XLM contains more cross-lingual GPT-2 (Radford et al., 2019): the authors report transfer abilities than MiniLM. Since the latter is a an improvement for summarization when the input distilled version of the former, we hypothesize that text is followed by “TL;DR” (i.e. Too Long Didn’t such abilities may have been lost during the distil- Read). lation process. Such loss of generalisation can be At inference time, we evaluated this approach on difficult to identify, and opens questions for future French with the prompt language=Français. work. Unfortunately, the model did not succeed to gener- ate text in French. On target language control for text generation Controlling the target language in the context When relying on the translated versions of SQuAD, of multilingual text generation remains under ex- the target language for generating synthetic can plored, and progress in this direction could have easily be controlled, and results in fluent and rel- direct applications to improve this work, and be- evant questions in the various languages (see Ta- yond. ble 1). However, one limitation of this approach is that synthetic questions can only be generated 6 Conclusion in the languages that were available during train- ing: the prompts are special tokens that In this work, we presented a method to generate are randomly initialised when fine-tuning QG on synthetic QA dataset in a multilingual fashion and SQuADtrans;qg . They are not semantically related, showed how QA models can benefit from it, and re- before fine-tuning, to the name of corresponding porting large improvements over the baseline. Our language (“English”, “Español” etc.) and, thus, the method contributes to fill the gap between English learned representation for the tokens is and other languages. Nonetheless, our method has limited to the languages present in the training set. shown to generalize even for languages not present To the best of our knowledge, no method allows in the synthetic corpus (e.g. French, Italian, Ko- so far to generalize this target control to unseen rean). language. It would be valuable, for instance, to be In future work, we plan to investigate if the pro- able to generate synthetic data in Korean, French posed data augmentation method could be applied and Italian, without having to translate the entire to other multilingual tasks, such as classification. SQuAD−en dataset in these three languages to We will also experiment more in depth with dif- then fine-tune the QG model. ferent strategies to control the target language of a To this purpose, we report –alas, as a negative model, and extrapolate on unseen ones. We believe result– the following attempt: instead of control- that this difficulty to extrapolate might highlight an ling the target language with a special, randomly important limitation of current language models. initialised, token, we used a token semantically re- 8 With unseen during training, we mean not present in the lated to the language-word: “English”, “Español” QG dataset; obviously, the language should have been present for Spanish, or “中文” for Chinese. The intuition is in the first self-supervised stage.
Acknowledgments ficial Intelligence, pages 389–402, Cham. Springer International Publishing. Djamé Seddah was partly funded by the ANR project PARSITI (ANR-16-CE33-0021), Arij Ri- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and abi was partly funded by Benoı̂t Sagot’s chair in Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- the PRAIRIE institute as part of the French na- standing. In Proceedings of the 2019 Conference tional agency ANR “Investissements d’avenir” pro- of the North American Chapter of the Association gramme (ANR-19-P3IA-0001). for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- References ation for Computational Linguistics. Chris Alberti, Daniel Andor, Emily Pitler, Jacob De- Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- vlin, and Michael Collins. 2019. Synthetic QA cor- aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, pora generation with roundtrip consistency. In Pro- and Hsiao-Wuen Hon. 2019. Unified language ceedings of the 57th Annual Meeting of the Asso- model pre-training for natural language understand- ciation for Computational Linguistics, pages 6168– ing and generation. In Advances in Neural Informa- 6173, Florence, Italy. Association for Computa- tion Processing Systems, pages 13063–13075. tional Linguistics. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. Stanovsky, Sameer Singh, and Matt Gardner. 2019. 2020. On the cross-lingual transferability of mono- DROP: A reading comprehension benchmark requir- lingual representations. In Proceedings of the 58th ing discrete reasoning over paragraphs. In Proceed- Annual Meeting of the Association for Computa- ings of the 2019 Conference of the North American tional Linguistics, pages 4623–4637, Online. Asso- Chapter of the Association for Computational Lin- ciation for Computational Linguistics. guistics: Human Language Technologies, Volume 1 Laurie Burchell, Jie Chi, Tom Hosking, Nina (Long and Short Papers), pages 2368–2378, Min- Markl, and Bonnie Webber. 2020. Querent in- neapolis, Minnesota. Association for Computational tent in multi-sentence questions. arXiv preprint Linguistics. arXiv:2010.08980. Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. Alexander Buslaev, Vladimir I Iglovikov, Eugene 2017. Question generation for question answering. Khvedchenya, Alex Parinov, Mikhail Druzhinin, In Proceedings of the 2017 Conference on Empiri- and Alexandr A Kalinin. 2020. Albumentations: cal Methods in Natural Language Processing, pages fast and flexible image augmentations. Information, 866–874, Copenhagen, Denmark. Association for 11(2):125. Computational Linguistics. Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian- Germain Forestier, François Petitjean, Hoang Anh Ling Mao, and Heyan Huang. 2020. Cross-lingual Dau, Geoffrey I Webb, and Eamonn Keogh. 2017. natural language generation via pre-training. In Generating synthetic time series to augment sparse AAAI, pages 7570–7577. datasets. In 2017 IEEE international conference on data mining (ICDM), pages 865–870. IEEE. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen- tau Yih, Yejin Choi, Percy Liang, and Luke Zettle- David Golub, Po-Sen Huang, Xiaodong He, and moyer. 2018. QuAC: Question answering in con- Li Deng. 2017. Two-stage synthesis networks for text. In Proceedings of the 2018 Conference on transfer learning in machine comprehension. In Pro- Empirical Methods in Natural Language Processing, ceedings of the 2017 Conference on Empirical Meth- pages 2174–2184, Brussels, Belgium. Association ods in Natural Language Processing, pages 835– for Computational Linguistics. 844, Copenhagen, Denmark. Association for Com- Alexis Conneau, Kartikay Khandelwal, Naman Goyal, putational Linguistics. Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- moyer, and Veselin Stoyanov. 2020. Unsupervised stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, cross-lingual representation learning at scale. In and Phil Blunsom. 2015. Teaching machines to read Proceedings of the 58th Annual Meeting of the Asso- and comprehend. In Advances in neural information ciation for Computational Linguistics, pages 8440– processing systems, pages 1693–1701. 8451, Online. Association for Computational Lin- guistics. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- ham Neubig, Orhan Firat, and Melvin Johnson. Danilo Croce, Alexandra Zelenanska, and Roberto 2020. Xtreme: A massively multilingual multi-task Basili. 2018. Neural learning for question answer- benchmark for evaluating cross-lingual generaliza- ing in italian. In AI*IA 2018 – Advances in Arti- tion. arXiv preprint arXiv:2003.11080.
Rachel Keraron, Guillaume Lancrenon, Mathilde Bras, transformer. Journal of Machine Learning Research, Frédéric Allary, Gilles Moyse, Thomas Scialom, 21(140):1–67. Edmundo-Pavel Soriano-Morales, and Jacopo Sta- iano. 2020. Project PIAF: Building a native French Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and question-answering dataset. In Proceedings of the Percy Liang. 2016. SQuAD: 100,000+ questions for 12th Language Resources and Evaluation Confer- machine comprehension of text. In Proceedings of ence, pages 5481–5490, Marseille, France. Euro- the 2016 Conference on Empirical Methods in Natu- pean Language Resources Association. ral Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics. Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A Vasile Rus, Brendan Wyse, Paul Piwek, Mihai Lintean, conditional transformer language model for control- Svetlana Stoyanchev, and Christian Moldovan. 2010. lable generation. arXiv preprint arXiv:1909.05858. The first question generation shared task evaluation challenge. In Proceedings of the 6th International Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee, Natural Language Generation Conference. Ganesh Ramakrishnan, and Preethi Jyothi. 2019. Cross-lingual training for automatic question gen- Thomas Scialom, Benjamin Piwowarski, and Jacopo eration. In Proceedings of the 57th Annual Meet- Staiano. 2019. Self-attention architectures for ing of the Association for Computational Linguis- answer-agnostic neural question generation. In Pro- tics, pages 4863–4872, Florence, Italy. Association ceedings of the 57th Annual Meeting of the Asso- for Computational Linguistics. ciation for Computational Linguistics, pages 6027– 6032, Florence, Italy. Association for Computa- Mike Lewis, Yinhan Liu, Naman Goyal, Mar- tional Linguistics. jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Thomas Scialom, Serra Sinem Tekiroglu, Jacopo Sta- Bart: Denoising sequence-to-sequence pre-training iano, and Marco Guerini. 2020. Toward stance- for natural language generation, translation, and based personas for opinionated dialogues. arXiv comprehension. arXiv preprint arXiv:1910.13461. preprint arXiv:2010.03369. Robert F Simmons. 1965. Answering english ques- Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian tions by computer: a survey. Communications of Riedel, and Holger Schwenk. 2020. MLQA: Evalu- the ACM, 8(1):53–70. ating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Asso- Elior Sulem, Omri Abend, and Ari Rappoport. 2018. ciation for Computational Linguistics, pages 7315– BLEU is not suitable for the evaluation of text sim- 7330, Online. Association for Computational Lin- plification. In Proceedings of the 2018 Conference guistics. on Empirical Methods in Natural Language Process- ing, pages 738–744, Brussels, Belgium. Association Seungyoung Lim, Myungji Kim, and Jooyoul Lee. for Computational Linguistics. 2019. Korquad1. 0: Korean qa dataset for ma- chine reading comprehension. arXiv preprint Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har- arXiv:1909.07005. ris, Alessandro Sordoni, Philip Bachman, and Ka- heer Suleman. 2017. NewsQA: A machine compre- Louis Martin, Benjamin Muller, Pedro Javier Or- hension dataset. In Proceedings of the 2nd Work- tiz Suárez, Yoann Dupont, Laurent Romary, Éric shop on Representation Learning for NLP, pages de la Clergerie, Djamé Seddah, and Benoı̂t Sagot. 191–200, Vancouver, Canada. Association for Com- 2020. CamemBERT: a tasty French language model. putational Linguistics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages Alex Wang, Amanpreet Singh, Julian Michael, Fe- 7203–7219, Online. Association for Computational lix Hill, Omer Levy, and Samuel Bowman. 2018. Linguistics. GLUE: A multi-task benchmark and analysis plat- form for natural language understanding. In Pro- Lawrence Page, Sergey Brin, Rajeev Motwani, and ceedings of the 2018 EMNLP Workshop Black- Terry Winograd. 1999. The pagerank citation rank- boxNLP: Analyzing and Interpreting Neural Net- ing: Bringing order to the web. Technical report, works for NLP, pages 353–355, Brussels, Belgium. Stanford InfoLab. Association for Computational Linguistics. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Dario Amodei, and Ilya Sutskever. 2019. Language Nan Yang, and Ming Zhou. 2020. Minilm: Deep models are unsupervised multitask learners. OpenAI self-attention distillation for task-agnostic compres- blog, 1(8):9. sion of pre-trained transformers. arXiv preprint arXiv:2002.10957. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Wei Li, and Peter J Liu. 2020. Exploring the lim- Hangbo Bao, and Ming Zhou. 2017. Neural ques- its of transfer learning with a unified text-to-text tion generation from text: A preliminary study. In
National CCF Conference on Natural Language Processing and Chinese Computing, pages 662–671. Springer.
You can also read