FQUAD2.0: FRENCH QUESTION ANSWERING AND KNOWING THAT YOU KNOW NOTHING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
FQuAD2.0: French Question Answering and knowing that you know nothing Quentin Heinrich, Gautier Viaud, Wacim Belblidia Illuin Technology Paris, France {quentin, gautier, wacim}@illuin.tech Abstract as SQuAD1.1 (Rajpurkar et al., 2016), SQuAD2.0 (Rajpurkar et al., 2018), BoolQ (Clark et al., 2019), Question Answering, including Reading Com- CoQA (Reddy et al., 2018), Natural Questions prehension, is one of the NLP research ar- arXiv:2109.13209v1 [cs.CL] 27 Sep 2021 eas that has seen significant scientific break- (Kwiatkowski et al., 2019), only to cite a few. Sec- throughs over the past few years, thanks to ond, the progress in Language Modeling with the the concomitant advances in Language Mod- introduction of transformers model (Vaswani et al., eling. Most of these breakthroughs, however, 2017), leveraging self-supervised training on very are centered on the English language. In large text corpus, followed by fine-tuning on a 2020, as a first strong initiative to bridge the downstream task (Devlin et al., 2018). This process gap to the French language, Illuin Technol- has now become a de-facto standard for most of ogy introduced FQuAD1.1, a French Native Reading Comprehension dataset composed of Natural Language Processing (NLP) tasks and also 60,000+ questions and answers samples ex- contributed to the progress of state-of-the-art for tracted from Wikipedia articles. Nonetheless, most of these tasks, including Question Answering. Question Answering models trained on this Whilst most of these recent transformers mod- dataset have a major drawback: they are not els are English models, French language mod- able to predict when a given question has no els have also been released, in particular Camem- answer in the paragraph of interest, therefore BERT (Martin et al., 2019) and FlauBERT (Le making unreliable predictions in various indus- trial use-cases. In the present work, we intro- et al., 2019), as well as multilingual models such duce FQuAD2.0, which extends FQuAD with as mBERT (Pires et al., 2019) or XLM-RoBERTa 17,000+ unanswerable questions, annotated (Conneau et al., 2019). These models fostered the adversarially, in order to be similar to answer- state-of-the-art in NLP for French language, allow- able ones. This new dataset, comprising a total ing to benefit from this large-scale transfer learning of almost 80,000 questions, makes it possible mechanism. However, native French resources for to train French Question Answering models Question Answering remain scarcer than English with the ability of distinguishing unanswerable resources. Nonetheless, in 2020, FQuAD1.1 was questions from answerable ones. We bench- mark several models with this dataset: our introduced by Illuin Technology (d’Hoffschmidt best model, a fine-tuned CamemBERTLARGE , et al., 2020). With 60,000+ question-answer pairs, achieves a F1 score of 82.3% on this classifica- it enabled the development of QA models surpass- tion task, and a F1 score of 83% on the Read- ing the human performance in the Reading Com- ing Comprehension task. prehension task. Although, the introduction of this resource was 1 Introduction a major leap forward for French QA, the obtained Question Answering (QA) is a central task in Natu- models suffered from an important weakness, as ral Language Understanding (NLU), with numer- they were trained to consistently find an answer ous industrial applications such as searching in- to the specified question by reading the associated formation in large corpus, extracting information context. In real-life applications however, it is of- from conversations or form filling. Amongst this ten the case that asked questions do not have an domain, Reading Comprehension has gained a lot answer in the associated context. For example, let of traction in the past years thanks to two main fac- us imagine that we are building a system designed tors. First, the release of numerous datasets such to automatically fill a form with relevant informa-
tion from property advertisements. We could be sion datasets in English such as SQuAD (Rajpurkar interested in the type of property, its surface area et al., 2016, 2018), and the drastic improvement of and its number of rooms. By asking questions such Neural Machine Translation with the emergence of as "How many rooms does the property have?" to the attention mechanism within models architec- a QA model, we would be able to extract this in- tures (Bahdanau et al., 2016; Vaswani et al., 2017), formation. But it would also be important that our it was at first the most natural path to try to machine model is able to predict when such a question does translate such English datasets to French. This not find an answer in the provided text advertise- is what works such as Asai et al. (2018) or Kab- ment, as this situation often arises. badj (2018) experimented. However, translating As FQuAD1.1 contains solely answerable ques- a Reading Comprehension dataset presents inher- tions, the models trained on this dataset did not ent difficulties. Indeed, in some cases, the context learn to determine when a question is unanswer- translation can reformulate the answer such that able with the associated context. To overcome this it is not possible to match it to the answer transla- difficulty, we extended FQuAD with unanswerable tion, making the sample unusable. Specific trans- questions, annotated adversarially, in order to be lation methods to mitigate this difficulty were for close to answerable ones. example proposed in Asai et al. (2018) or Carrino Our contribution sums as follows: et al. (2019a). Such translated datasets then enable • We introduce the FQuAD2.0 dataset, which the training of Question Answering models with extends FQuAD1.1 with 17,000+ unanswer- French data. However, d’Hoffschmidt et al. (2020) able questions, hand-crafted to be difficult to demonstrates that the use of French native datasets distinguish from answerable questions, mak- such as FQuAD1.1 brings far better models. In ad- ing it the first French adversarial Question An- dition to FQuAD, another French native Question swering dataset with a grand total of almost Answering dataset has been released: PiAF (Rachel 80,000 questions. et al., 2020). This smaller dataset, complementary • We evaluate how models benefit from being to FQuAD1.1, contains 3835 question-answer pairs trained on adversarial questions to learn when in its 1.0 version and up to 9225 question-answer questions are unanswerable. To do so, we fine- pairs in its 1.2 version. tune CamemBERT models of varying sizes We described in Section 1 the interest of ad- (large, base) on the training set of FQuAD2.0 versarial Question Answering datasets for some and evaluate it on the development set of use cases. This concept was first introduced FQuAD2.0. We take interest in both the abil- in Rajpurkar et al. (2018) with the presenta- ity of a model to distinguish unanswerable tion of SQuAD2.0, an English dataset extend- questions from answerable ones, as well as ing SQuAD1.1 (Rajpurkar et al., 2016) with over an eventual performance drop in the precision 50,000 unanswerable questions. The purpose of answers provided for answerable questions, of SQuAD2.0 is similar to FQuAD2.0’s and the due to the addition of unanswerable questions hereby presented work takes some of its roots in its during training. We also study the impact of English counterpart. the number of adversarial questions used and To the best of our knowledge, there is no other obtain learning curves for each model. dataset for adversarial Question Answering other • By using both FQuAD2.0 and SQuAD2.0 than SQuAD2.0 and FQuAD2.0, even though there datasets, we study how multilingual models exist several translated versions of SQuAD2.0 in fine-tuned solely on question-answer pairs of Spanish (Carrino et al., 2019b), Swedish (Okazawa, a single language (English), performs in an- 2021), Polish1 , Dutch2 , among others. This makes other language (French). We also take interest FQuAD2.0 the first non-English native dataset for in performances of such models trained on adversarial QA. both French and English datasets. Nonetheless, numerous large-scale Question An- swering datasets exist apart from FQuAD2.0 and 2 Related work SQuAD2.0, focusing on different Question An- swering paradigms. Some take interest in QA In the past few years, several initiatives emerged to promote Reading Comprehension in French. With 1 https://bit.ly/2ZqLLgb 2 both the release of large scale Reading Comprehen- https://bit.ly/3CHFyee
within conversations: CoQA (Reddy et al., 2018) such as FQuAD2.0. highlights the difficulties of answering intercon- nected questions that appear in a conversation, 3 Dataset collection & analysis QuAC (Choi et al., 2018) provides questions asked 3.1 Annotation process during an Information Seeking Dialog, ShARC FQuAD2.0 is an extension of FQuAD1.1 (Saeidi et al., 2018) studies real-world scenarios (d’Hoffschmidt et al., 2020). This extension where follow-up questions are asked to obtain fur- consists in the addition of unanswerable questions. ther informations before answering the initial ques- These questions are hand-crafted in an adversarial tion. Others take interest in more complex forms of manner in order to be difficult to distinguish from reasoning: HotpotQA (Yang et al., 2018) contains answerable ones. To achieve this goal we gave questions that require reasoning among multiple precise guidelines to the annotators: documents, DROP (Dua et al., 2019) introduces questions requiring discrete reasoning types such • An adversarial question must be relevant to as addition, comparison or counting. MASH-QA the context paragraph by addressing a topic (Zhu et al., 2020) takes interest in questions that also addressed in the context paragraph. have multiple and non-consecutive answers within a document. BoolQ (Clark et al., 2019) highlights • An adversarial question should be designed in the surprising difficulty of answering yes/no ques- the following way: ask an answerable ques- tions, while RACE (Lai et al., 2017) studies multi- tion on the paragraph, and apply to it a trans- ple choice Question Answering. The last paradigm formation such as an entity swap, a negation we would like to mention is Open Domain Ques- or something else that renders the question tion Answering where a system is asked to find the unanswerable. answer of a given question among a large corpus. The articles and paragraphs used in the train, This task is tackled with datasets such as Natural development and test sets of FQuAD1.1 and Questions (Kwiatkowski et al., 2019) or TriviaQA FQuAD2.0 are exactly the same. An annotator is (Joshi et al., 2017). presented with a paragraph and the already existing Another line of research we find very interesting answerable questions collected for this paragraph is a model ability to learn to jointly tackle several for FQuAD1.1. He is then asked to forge at least 4 Question Answering tasks. Micheli et al. (2021) adversarial questions, while spending up to 7 min- trained RoBERTa models (Liu et al., 2019) on both utes by paragraph. A total of 17,765 adversarial BoolQ (Clark et al., 2019), a boolean Question questions were collected in 3,100 paragraphs. As Answering dataset, and SQuAD2.0. To enable FQuAD contains in total 14,908 paragraphs, unan- the model to be multi-task, a slightly modified swerable questions were not annotated for every RoBERTa architecture is presented. Another way paragraph, nor every article. In order to have reli- to obtain multi-task Question Answering models able evaluations on the development and test sets is to use an autoregressive model such as GPT-3 for this new task, we chose to annotate, in propor- (Brown et al., 2020) or T5 (Raffel et al., 2020). tion, an important amount of adversarial questions When provided with a context and a question, these in these two sets. They contain in total around 42% text-to-text models directly output an answer in the adversarial questions, while the train set contains form of a sequence of tokens using their decoder. 16% adversarial questions. More statistics can be This naturally enables such approaches to general- found in table 1. ize beyond Reading Comprehension. For example We used the Étiquette annotation platform3 de- a boolean question can simply be treated by a text- veloped by Illuin Technology. It has a dedicated to-text model by outputing "yes" or "no". Khashabi interface to annotate the Question Answering task. et al. (2020) leveraged this concept even further by Unanswerable questions can be annotated by indi- training a T5 model to tackle four different Ques- cating that the answer is an empty string. A screen- tion Answering formats: Extractive, Abstractive, shot of the platform is displayed in Figure 1. Multiple Choice, and Boolean. Their model per- A total of 18 French students contributed to the forms then well accross 20 different QA datasets. annotation of the dataset. They were hired in col- We believe that such models for the French lan- laboration with the Junior Enterprise of Centrale- 3 guage will soon appear, building on French datasets https://etiquette.illuin.tech/
Figure 1: The interface used to collect the question-answer pairs for FQuAD. During the annotation process for FQuAD2.0, an annotator can see a paragraph and the associated answerable questions that were already collected for FQuAD1.1. Supélec4 . To limit the bias introduced by an anno- proposed. Then, we sorted these questions follow- tator’s own style of forging adversarial questions, ing the different identified categories, in order to each annotator only contributed to a given subset: estimate the proportion of each category within the train, development or test. total dataset. Table 2 presents this analysis where 5 main categories have been identified. 3.2 Statistics To maximize the number of available questions 4 Evaluation metrics for fine-tuning experiments, while keeping a suffi- ciently important set for evaluation, we decide to To evaluate the predictions of a model on the merge the train and test sets of FQuAD2.0 into a FQuAD2.0 dataset, we use mainly the Exact Match bigger training set, and keep the development set (EM) and F1 score metrics, which are defined ex- intact for evaluating the obtained models. The new actly as in Rajpurkar et al. (2018) with the required training set contains a total of 13,591 unanswer- adaptations regarding stop words for the French lan- able questions. Main statistics for FQuAD1.1 and guage as explained in d’Hoffschmidt et al. (2020). FQuAD2.0 are presented in Table 1. In a nutshell, EM measures the percentage of pre- dictions matching exactly one of the ground truth 3.3 Challenges raised by adversarial answers, while F1 computes the average overlap questions between the predicted tokens and the ground truth answer. To understand what the different types of adver- To extend these metrics to unanswerable metrics, sarial questions collected are, we propose a seg- unanswerable questions are simply considered as mentation of the challenges raised by adversarial answerable questions with a ground truth answer questions in FQuAD2.0. To do so, we randomly being an empty string. sampled 102 questions from the new annotated questions in FQuAD2.0 development set and man- One may be interested in evaluating on the one ually inspected them to identify the challenges they hand the ability of a model to extract the correct answers of answerable questions, and on the other 4 https://juniorcs.fr/en/ hand its ability to determine if a question is unan-
FQuAD1.1 FQuAD2.0 Train Development Test Train Development Test Articles 271 30 25 271 30 25 Paragraphs 12,123 1,387 1,398 12,123 1,387 1,398 Answerable questions 50,741 5,668 5,594 50,741 5,668 5,594 Unanswerable questions 0 0 0 9,481 4,174 4,110 Total questions 50,741 5,668 5,594 60,222 9,842 9,704 Table 1: Dataset statistics for FQuAD1.1 and FQuAD2.0 Reasoning Description Example Frequency Question: Quels mamifères ne sont pas présents ? Use of negation or antonym Antonym Context: [...] Le parc abrite aussi de nombreux grands mammifères comme 21.6 % to make the question adversarial. des ours noirs, des grizzlys, [...] A name, a number, a date has Question: Quelle est la couleur traditionnelle de la ville de Paris ? Entity Swap been modified so that the Context: [...] La livrée des rames est personnalisée, associant le vert jade 24.5 % question becomes adversarial. traditionnel de la RATP à divers visuels symboliques de la ville de Paris. A tiny precision or imprecision Question: Quelle est la dernière station de la ligne ? in the question makes the Ambiguity Context: [...] La ligne se dirige vers l’est en position axiale jusqu’à 17.6 % plausible answer in the context incorrect. la station Balard [...] While some concepts of the Question: Quelle était la profession de Nicolas Bachelier ? question are discussed in the Context: Les projets les plus réalistes sont présentés au roi au XVIe siècle. Out-of-context context, at least one key Un premier projet est présenté par Nicolas Bachelier en 1539 aux États de 6.9 % concept of the question is not Languedoc, puis un second en 1598 par Pierre Reneau, et enfin un mentioned in the context. troisième projet proposé par Bernard Arribat de Béziers en 1617 [...] Question: Quel est le nom du troisième volet de la saga ? All concepts of the question are Context: [...] Le fait que Solo soit plongé dans la carbonite constitue en mentioned in the context, while outre une alternative pour les scénaristes si Harrison Ford refuse de jouer Semantical Similarity 29.4 % the question remains unanswered dans le troisième volet de la saga. En effet, George Lucas n’est pas assuré in the context que sa vedette accepte de reprendre à nouveau le rôle après son succès dans Les Aventuriers de l’arche perdue. Table 2: Categories of adversarial questions and their respective proportion in a FQUAD2.0 sample of 102 ques- tions. Bold words are the plausible answers or discriminative terms within the question. Colored terms are co- references between question and context. Model Dataset EM F1 F1has ans NoAnsF1 NoAnsP NoAnsR CamemBERTBASE FQuAD2.0 63.3 68.7 82.5 62.1 82 49.9 CamemBERTLARGE FQuAD2.0 78 83 90.1 82.3 93.6 73.4 Table 3: Baseline results on the FQuAD2.0 validation set while training is made on the expanded training set containing 13,591 unanswerable questions. swerable given the context. To do so, we introduce We must emphasize that NoAnsF1 is a metric com- two other metrics: puted as a whole on the entirety of the FQuAD2.0 development set as a classification problem, while • F1has ans : the average F1 score, question-wise, F1has ans is computed question-wise and is an aver- as defined above, but limited to answerable age of the individual scores for each question. We questions, also want to point out that the global F1 score is • NoAnsF1 : the F1 score of the classification in no way the weighted average of F1has ans and problem consisting in determining if a ques- NoAnsF1 . tion is unanswerable. It is then the harmonic In order to evaluate the ability of a Question mean of the precision (NoAnsP ) and recall Answering model to learn when a question is unan- (NoAnsR ) for this classification problem, the swerable, we carried out various fine-tuning exper- no-answer class being considered as the posi- iments using the FQuAD2.0 dataset. These experi- tive class. ments are split into the following sections: French
Monolingual Experiments and Multilingual Exper- FQuAD1.1 FQuAD2.0 iments. Model EM F1 EM F1 CamemBERTBASE 78.1 88.1 73.1 82.5 5 French monolingual experiments CamemBERTLARGE 82.4 91.8 81.3 90.1 The goal of these experiments is two-fold. First, we Table 4: Comparison of scores obtained on the want to obtain strong baselines on the FQuAD2.0 FQuAD1.1 dev set for models trained on FQuAD1.1 dataset. Second, we want to analyze how the per- or FQuAD2.0. formances of the fine-tuned models evolve with respect to the quantity of unanswerable questions 100 available at training time. Score on FQuAD2.0 dev set (%) 5.1 Baselines 80 To fulfill our first goal, we choose to fine-tune CamemBERT models, because they are the best 60 performing models on several NER, NLI and Ques- tion Answering French benchmarks (Martin et al., 40 2019; d’Hoffschmidt et al., 2020). We could also NoAnsF1 CLARGE have chosen FlauBERT models (Le et al., 2019), F1has ans CLARGE but (d’Hoffschmidt et al., 2020) tends to show 20 NoAnsF1 CBASE that for the same size, CamemBERT models out- F1has ans CBASE perform FlauBERT models on the Question An- 0 swering task, hence our choice of CamemBERT 0 2.5 5 7.5 10 12.5 15 models. We benchmark two different model sizes: # adversarial samples (×103 ) CamemBERTLARGE (24 layers, 1024 hidden di- mensions, 12 attention heads, 340M parameters) Figure 2: Evolution of NoAnsF1 and F1has ans for and CamemBERTBASE (12 layers, 768 hidden di- CamemBERT models depending on the number of unanswerable questions in the training dataset mensions, 12 attention heads, 110M parameters). The fine-tuning procedure used is identical to the one described in Devlin et al. (2018), and an correct answer in most cases when a question implementation can be found in HuggingFace’s is answerable. Transformers library (Wolf et al., 2019). All mod- els were fine-tuned on 3 epochs, with a warmup • As observed in d’Hoffschmidt et al. (2020) ratio of 6%, a batch size of 16 and a learning rate of or Micheli et al. (2020), Question Answering 1.5 · 10−5 . The optimizer used is AdamW with its seems to be a complex task for a small size default parameters. All experiments were carried (base, small) fine-tuned Language Model to out on a single Nvidia V100 16 GB GPU. When- solve, and hence the obtained performances ever necessary, gradient accumulation was used to are highly dependent to model size, bigger train with batch size not fitting within the GPU models performing much better than smaller memory. The results obtained on the FQuAD2.0 ones. It appears that for Adversarial Ques- development set for the different metrics are pre- tion Answering this observation is even more sented in Table 3. important, with CamemBERTLARGE scoring These first results allow us to draw the following a 20.2 % absolute improvement in NoAnsF1 conclusions: metric compared to CamemBERTBASE . • One can see that the best trained model, CamemBERTLARGE , obtains a rather high 5.2 Comparison with FQuAD1.1 scores score of 82.3 % for the NoAnsF1 metric, while Whilst the models presented in the previous sub- keeping a high score of 90.1 % for the F1has ans section clearly learned to both extract accurate an- metric. It confirms that it is possible for a pre- swers from answerable questions and determine trained French Language Model to learn to when a question is unanswerable, one may also be determine with high precision when a French interested in whether these models extract as accu- question is unanswerable, while extracting the rate answers as similar models solely fine-tuned on
FQuAD1.1, ie. only on answerable questions. few adversarial examples before achieving de- To answer this problematic, we present in Ta- cent performances. Indeed the model trained ble 4 a comparison of our models of interest in with 5k adversarial questions achieves 88% two different set-ups: when fine-tuned solely on of the performance of the best model trained FQuAD1.1 and when fine-tuned on the entirety with 13.6k adversarial questions, which is 2.7 of FQuAD2.0. All evaluations are on FQuAD1.1 times more unanswerable questions. dev set. By dataset construction, the F1 score on the FQuAD1.1 dev set is strictly equivalent to • The slope of the CamemBERTBASE learning the F1has ans on the FQuAD2.0 dev set. Results curve is higher than for CamemBERTLARGE . for fine-tuning on FQuAD1.1 are extracted from For example, the CamemBERTBASE model d’Hoffschmidt et al. (2020). trained with 5k adversarial questions achieves With the addition of unanswerable questions dur- only 66% of the performance of the best ing fine-tuning, the model is encouraged to predict CamemBERTBASE model trained with 13.6k that some questions are unanswerable. And as for adversarial questions. We conclude that the every model, the NoAnsP is strictly lower than value brought by additional data is more im- 100%, there are answerable questions in the dev portant for smaller models than for bigger set, for which models tend to wrongly predict that ones. However, we also observe that the they are unanswerable. Then for these questions, CamemBERTLARGE model trained with 2.5k the predicted answer is the empty string instead adversarial questions performs on par with the of the expected answer. Hence, we can expect a CamemBERTBASE model trained with 12.5k decrease of the F1has ans metric in comparison to adversarial questions (5 times more data). the set-up where a model is fine-tuned solely on • Whatever the model, the learning curve has FQuAD1.1. not flatten yet, which means that both archi- This assumption is confirmed in Table 4 with tectures would benefit from more adversarial a shrinking gap as model size grows. Indeed, training samples. In order to do so, one would F1has ans is only 1.7 absolute points lower for need to annotate further adversarial questions, CamemBERTLARGE trained on FQuAD2.0 com- which we leave for future work. paired to the same model fine-tuned solely on FQuAD1.1. For CamemBERTBASE , the gap grows 5.4 Baseline performances by question to 5.6 points. This gap evolution also follows category the evolution of the NoAnsP metric which is We present in section 3.3 a detailed analysis of the equal to 82% for CamemBERTBASE and 93.5% different challenges FQuAD2.0 adversarial ques- for CamemBERTLARGE . tions provide. To understand how well the baseline 5.3 Learning curves CamemBERT models trained perform on each one of these challenges, we present in table 6 evalu- To get a better grasp of how many adversarial ation results on each of these categories. As the questions are needed for a model to learn to de- evaluation is solely made on adversarial questions, termine when a question is unanswerable, we con- the chosen metric is the recall of the NoAns task: duct several fine-tuning experiments with an in- NoAnsR . creasing number of adversarial questions used for training. For every training, all answerable ques- 6 Multilingual experiments tions of the training set of FQuAD2.0 (i.e. the training set of FQuAD1.1), and unanswerable ques- The previous experiments focus on the study of tions are progressively added to the training set fine-tuning French Language Models with the with increments of 2500 questions. We conduct FQuAD2.0 dataset. However, one could ask the such experiments for the two model architectures of following questions: CamemBERTBASE and CamemBERTLARGE . The results are displayed in Figure 2. • Does a multilingual Language Model fine- From these experiments, we observe the follow- tuned solely on English Question Answering ing: datasets could compete against such a model, thanks to the existence of several large-scale • The CamemBERTLARGE model needs quite English Question Answering datasets?
Model Training Dataset Test Dataset EM F1 F1has ans NoAnsF1 SQuAD2.0 FQuAD2.0 56 62.4 75.9 56.2 XLM-RBASE SQuAD2.0 + FQuAD2.0 FQuAD2.0 64.4 69.6 78.4 66.4 CamemBERTBASE FQuAD2.0 FQuAD2.0 63.3 68.7 82.5 62.1 CamemBERTBASE FQuAD2.0* FQuAD2.0* 60.5 66.1 83.5 56.4 RoBERTaBASE SQuAD2.0* SQuAD2.0* 69.7 73.3 85.3 73.4 SQuAD2.0 FQuAD2.0 67.3 73.4 87.8 68.1 XLM-RLARGE SQuAD2.0 + FQuAD2.0 FQuAD2.0 76.8 82.1 87.2 81.9 CamemBERTLARGE FQuAD2.0 FQuAD2.0 78 83 90.1 82.3 Table 5: Results for multilingual experiments with FQuAD2.0 and SQuAD2.0. Best score on FQuAD2.0 develop- ment set for both model sizes are highlighted in bold. NoAnsR question-answer pair in French. For exam- Category CBASE CLARGE ple, XLM-RLARGE reaches in zero-shot set- ting better performances on the FQuAD2.0 Antonym 36 68 dataset than CamemBERTBASE trained on Entity Swap 36 86 FQuAD2.0. Nevertheless, this observation Ambiguity 50 56 must be put into perspective by reminding that SQuAD2.0 training set includes 43.5k ad- Out-of-context 43 71 versarial questions, hence 3.2 times more than Semantical Similarity 37 70 FQuAD2.0. By relying on the learning curves presented in Section 5, one can suppose that a Table 6: Baseline models recalls for NoAns task on each category of adversarial questions. CBASE refers CamemBERTBASE trained with 43.5k French to CamemBERTBASE . adversarial questions would have substan- cially better performances than our actual best CamemBERTBASE model. • Does the combination of French and English Question Answering datasets during training • For both model sizes, the CamemBERT model makes a multilingual Language Model better reaches better performances than the XLM-R than a monolingual Language Model in this model in the zero-shot setting with a substan- Question Answering task? tial margin. One can then conclude than with We use FQuAD2.0 and SQuAD2.0, respec- 13.5k adversarial questions, we are beyond tively as French and English reference datasets the point where training a French monolin- for these experiments. We benchmark two gual model on French question-answer pairs multilingual models: XLM-RoBERTaBASE and brings better results than using a multilin- XLM-RoBERTaLARGE (Conneau et al., 2019) gual model trained solely on English question- which are comparable in size respectively to answer pairs. CamemBERTBASE and CamemBERTLARGE . Ex- • By combining FQuAD2.0 and SQuAD2.0 perimental set-up and parameters (relative to model training sets, XLM-RBASE performs slightly sizes) are identical to the ones described in Section better than CamemBERTBASE trained with 5. Fine-tuning experiments are summarized in Ta- FQuAD2.0, while for large models, Camem- ble 5. BERT is slightly better. We believe this is One can make the following observations from another demonstration of the ability of big- these results: ger language models to perform well in small • Results in zero-shot setting are promising. We data regimes, while smaller language models call zero-shot setting the FQuAD2.0 evalua- need large-scale datasets to reach their poten- tions of models trained solely on SQuAD2.0 tial. Hence, it seems more interesting in a low because the models were not trained with any computing resource setting and low training
data availability setting to rely on multilingual tions 5 and 6. As far as data collection is concerned, models leveraging the more important avail- we could also collect additional answers for each ability of training data in English. unanswerable question. By following the same pro- cedure as in d’Hoffschmidt et al. (2020), this would • As a matter of comparison with its English allow for the computation of human performance, counterpart SQuAD2.0 and in order to as- measuring the inherent difficulty of the challenge sess the relative difficulty of each dataset, provided by FQuAD2.0. we also performed experiments on subsets of SQuAD2.0 and FQuAD2.0, respectively On top of monolingual French experiments, denoted SQuA2.0* and FQuAD2.0*, each we conducted various multilingual experiments with a training set of 50,000 answerable demonstrating the relevance of using multilingual questions and 10,000 unanswerable ones. models fine-tuned on English resources to use in We trained a RoBERTaBASE model and a other languages when very few or no resources are CamemBERTBASE with the same experimen- available in this target language. Nevertheless, we tal set-ups. The scores obtained for all metrics also showed the superiority of a monolingual ap- are significantly better for the English set-up proach on the target language using a dataset such than the French one, notably with increases of as FQuAD2.0 (as this was in our case both eco- 9.2% for the EM score and 6.2% for the over- nomically and practically feasible). Besides, we all F1 score. Besides, there is a 17% increase also exhibited that FQuAD2.0 is probably a harder for the NoAnsF1 score for the English set-up. dataset of adversarial Question Answering than its These results clearly indicate that the overall English counterpart SQuAD2.0, as similar models task for FQuAD2.0 is much harder than for perform better in the same conditions in English SQuAD2.0 in terms of both the difficulty and than in French. the ambiguity of questions. In particular, the results for NoAnsF1 indicate that it is much Although the performances that we obtained on harder for the French model to detect whether the FQuAD2.0 dataset are very good, the evalua- a question is answerable. tion of the resulting models on other datasets is of crucial importance. In real-life industrial use 7 Conclusion & future work cases, the contexts and questions asked vary from those present in FQuAD2.0: how can we perform In this paper, we introduced FQuAD2.0, a QA efficient domain transfer on these datasets? This dataset with both answerable questions (coming will be further evaluated in future iterations of this from FQuAD1.1) and 17,000+ newly annotated work. unanswerable questions, for a total of almost 80,000 questions. To the best of our knowledge, Another topic of interest in the context of in- this is the first French (and, perhaps most impor- dustrial use cases is the inference time of such tantly, non-English) adversarial Question Answer- large models. The best model we obtained is a ing dataset. CamemBERTLARGE , but in some real-life appli- We trained various baseline models using cations where GPUs are unavailable or when we CamemBERT architectures. Our best model, a fine- must handle a large number of requests in a short tuned CamemBERTLARGE , reaches 83% F1 score amount of time, we cannot afford the inference and 82.3% F1no ans , the latter measuring its abil- times that come with such large models. The ity to distinguish answerable questions from unan- use of smaller models than CamemBERTLARGE swerable ones. The study of learning curves with or CamemBERTBASE comes naturally into mind, respect to the number of samples used for training such as LePetit (Micheli et al., 2020), also denoted such models show that our baseline models would as CamemBERTSMALL . To overcome this limita- benefit from additional unanswerable questions. In tion, we could use model compression techniques the future, we plan to collect additional samples such as pruning (McCarley et al., 2019; Sanh et al., to expand FQuAD2.0. For comparison, its english 2020), distillation (Hinton et al., 2015; Jiao et al., cousin SQuAD2.0 (Rajpurkar et al., 2018) contains 2019; Sun et al., 2020) or quantization (Kim et al., 53,775 unanswerable questions. Such a large-scale 2021; Shen et al., 2019). We performed prelimi- dataset would of course enable the acquisition of nary tests that seem very promising, they should be even better models as the ones presented in Sec- investigated further in the future.
Acknowledgments Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina We are very grateful to Robert Vesoul, CEO of Toutanova. 2019. BoolQ: Exploring the surprising Illuin Technology and Co-Director of Centrale- difficulty of natural yes/no questions. In Proceed- ings of the 2019 Conference of the North American Supélec’s Digital Innovation Chair, for enabling Chapter of the Association for Computational Lin- and funding this project, while leading it through. guistics: Human Language Technologies, Volume 1 We warmly thank Martin D’Hoffschmidt, who (Long and Short Papers), pages 2924–2936, Min- launched this initiative within Illuin Technology, neapolis, Minnesota. Association for Computational Linguistics. and vastly contributed in the first steps of this jour- ney. Our thanks also go to Inès Multrier for her Alexis Conneau, Kartikay Khandelwal, Naman Goyal, contribution and insights on multilingual experi- Vishrav Chaudhary, Guillaume Wenzek, Francisco ments. Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2019. Unsupervised We would also like to thank Enguerran Henniart, cross-lingual representation learning at scale. Lead Product Manager of Étiquette, for his assis- tance and technical support during the annotation Jacob Devlin, Ming-Wei Chang, Kenton Lee, and campaign. Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language under- Finally we extend our thanks to the whole Illuin standing. CoRR, abs/1810.04805. Technology team for their innovative mindsets and incentives, as well as their constructive feedbacks. Martin d’Hoffschmidt, Wacim Belblidia, Quentin Hein- rich, Tom Brendlé, and Maxime Vidal. 2020. FQuAD: French question answering dataset. In Findings of the Association for Computational Lin- References guistics: EMNLP 2020, pages 1193–1208, Online. Akari Asai, Akiko Eriguchi, Kazuma Hashimoto, and Association for Computational Linguistics. Yoshimasa Tsuruoka. 2018. Multilingual extractive reading comprehension by runtime machine transla- Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel tion. CoRR, abs/1809.03275. Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requir- ing discrete reasoning over paragraphs. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2016. Neural machine translation by jointly Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. learning to align and translate. 2015. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learn- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie ing Workshop. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Askell, Sandhini Agarwal, Ariel Herbert-Voss, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Gretchen Krueger, Tom Henighan, Rewon Child, 2019. Tinybert: Distilling bert for natural language Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, understanding. Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Chess, Jack Clark, Christopher Berner, Sam Mc- Zettlemoyer. 2017. TriviaQA: A large scale dis- Candlish, Alec Radford, Ilya Sutskever, and Dario tantly supervised challenge dataset for reading com- Amodei. 2020. Language models are few-shot learn- prehension. In Proceedings of the 55th Annual Meet- ers. ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Van- Casimiro Pio Carrino, Marta Ruiz Costa-jussà, and couver, Canada. Association for Computational Lin- José A. R. Fonollosa. 2019a. Automatic spanish guistics. translation of the squad dataset for multilingual ques- tion answering. ArXiv, abs/1912.05200. Ali Kabbadj. 2018. Something new in french text min- ing and information extraction (universal chatbot): Casimiro Pio Carrino, Marta R. Costa-jussà, and José Largest q&a french training dataset (110 000+). A. R. Fonollosa. 2019b. Automatic spanish transla- tion of the squad dataset for multilingual question Daniel Khashabi, Sewon Min, Tushar Khot, Ashish answering. Sabharwal, Oyvind Tafjord, Peter Clark, and Han- naneh Hajishirzi. 2020. UNIFIEDQA: Crossing for- Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen- mat boundaries with a single QA system. In Find- tau Yih, Yejin Choi, Percy Liang, and Luke Zettle- ings of the Association for Computational Linguis- moyer. 2018. Quac : Question answering in context. tics: EMNLP 2020, pages 1896–1907, Online. As- CoRR, abs/1808.07036. sociation for Computational Linguistics.
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. 12th Conference on Language Resources and Eval- Mahoney, and Kurt Keutzer. 2021. I-bert: Integer- uation. The International Conference on Language only bert quantization. Resources and Evaluation. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine field, Michael Collins, Ankur Parikh, Chris Alberti, Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Wei Li, and Peter J. Liu. 2020. Exploring the limits Kenton Lee, et al. 2019. Natural questions: a bench- of transfer learning with a unified text-to-text trans- mark for question answering research. Transactions former. of the Association for Computational Linguistics, 7:453–466. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques- Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, tions for squad. CoRR, abs/1806.03822. and Eduard H. Hovy. 2017. RACE: large-scale reading comprehension dataset from examinations. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and CoRR, abs/1704.04683. Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Max- the 2016 Conference on Empirical Methods in Natu- imin Coavoux, Benjamin Lecouteux, Alexandre Al- ral Language Processing, pages 2383–2392, Austin, lauzen, Benoît Crabbé, Laurent Besacier, and Didier Texas. Association for Computational Linguistics. Schwab. 2019. Flaubert: Unsupervised language model pre-training for french. Siva Reddy, Danqi Chen, and Christopher D. Manning. 2018. Coqa: A conversational question answering Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- challenge. CoRR, abs/1808.07042. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Luke Zettlemoyer, and Veselin Stoyanov. 2019. Singh, Tim Rocktäschel, Mike Sheldon, Guillaume Roberta: A robustly optimized BERT pretraining ap- Bouchard, and Sebastian Riedel. 2018. Interpreta- proach. CoRR, abs/1907.11692. tion of natural language rules in conversational ma- Louis Martin, Benjamin Muller, Pedro Javier Or- chine reading. tiz Suárez, Yoann Dupont, Laurent Romary, Éric Victor Sanh, Thomas Wolf, and Alexander M. Rush. Villemonte de la Clergerie, Djamé Seddah, and 2020. Movement pruning: Adaptive sparsity by fine- Benoît Sagot. 2019. CamemBERT: a Tasty tuning. French Language Model. arXiv e-prints, page arXiv:1911.03894. Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt J. S. McCarley, Rishav Chakravarti, and Avirup Sil. Keutzer. 2019. Q-bert: Hessian based ultra low pre- 2019. Structured pruning of a bert-based question cision quantization of bert. CoRR, abs/1909.05840. answering model. Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Vincent Micheli, Martin d’Hoffschmidt, and François Yiming Yang, and Denny Zhou. 2020. MobileBERT: Fleuret. 2020. On the importance of pre-training a compact task-agnostic BERT for resource-limited data volume for compact language models. In devices. In Proceedings of the 58th Annual Meet- Proceedings of the 2020 Conference on Empirical ing of the Association for Computational Linguistics, Methods in Natural Language Processing (EMNLP), pages 2158–2170, Online. Association for Computa- pages 7853–7858, Online. Association for Computa- tional Linguistics. tional Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Vincent Micheli, Quentin Heinrich, François Fleuret, Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz and Wacim Belblidia. 2021. Structural analysis of Kaiser, and Illia Polosukhin. 2017. Attention is all an all-purpose question answering model. you need. CoRR, abs/1706.03762. Susumu Okazawa. 2021. Swedish translation of Thomas Wolf, Lysandre Debut, Victor Sanh, Julien squad2.0. https://github.com/susumu2357/ Chaumond, Clement Delangue, Anthony Moi, Pier- SQuAD_v2_sv. ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. Huggingface’s trans- Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. formers: State-of-the-art natural language process- How multilingual is multilingual bert? CoRR, ing. ArXiv, abs/1910.03771. abs/1906.01502. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- Keraron Rachel, Lancrenon Guillaume, Bras Mathilde, gio, William W. Cohen, Ruslan Salakhutdinov, and Allary Frédéric, Moyse Gilles, Scialom Thomas, Christopher D. Manning. 2018. HotpotQA: A Soriano-Morales Edmundo-Pavel, and Jacopo Sta- dataset for diverse, explainable multi-hop question iano. 2020. Project piaf: Building a native french answering. In Conference on Empirical Methods in question-answering dataset. In Proceedings of the Natural Language Processing (EMNLP).
Ming Zhu, Aman Ahuja, Da-Cheng Juan, Wei Wei, and Chandan K. Reddy. 2020. Question answer- ing with long multiple-span answers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3840–3849, Online. Associa- tion for Computational Linguistics.
You can also read