Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text Sebastian Gehrmann Elizabeth Clark Thibault Sellam Google Research New York, NY {gehrmann, eaclark, tsellam}@google.com Abstract with the content of model outputs, for example if they are not attributable to input information. Evaluation practices in natural language These ineffective evaluations lead to overestimates generation (NLG) have many known flaws, of model capabilities. Deeper analyses uncover that but improved evaluation approaches are rarely arXiv:2202.06935v1 [cs.CL] 14 Feb 2022 widely adopted. This issue has become popular models fail even at simple tasks by taking more urgent, since neural NLG models have shortcuts, overfitting, hallucinating, and not being improved to the point where they can often no in accordance with their communicative goals. longer be distinguished based on the surface- Identifying these shortcomings, many recent pa- level features that older metrics rely on. This pers critique evaluation techniques or propose new paper surveys the issues with human and auto- matic model evaluations and with commonly ones. But almost none of the suggestions are fol- used datasets in NLG that have been pointed lowed or new techniques used. There is an in- out over the past 20 years. We summarize, centive mismatch between conducting high-quality categorize, and discuss how researchers have evaluations and publishing new models or model- been addressing these issues and what their ing techniques. While general-purpose evaluation findings mean for the current state of model techniques could lower the barrier of entry for in- evaluations. Building on those insights, we corporating evaluation advances into model devel- lay out a long-term vision for NLG evaluation opment, their development requires resources that and propose concrete steps for researchers to improve their evaluation processes. Finally, are hard to come by, including model outputs on we analyze 66 NLG papers from recent NLP validation and test sets or large quantities of hu- conferences in how well they already follow man assessments of such outputs. Moreover, some these suggestions and identify which areas issues, like the refinement of datasets, require itera- require more drastic changes to the status quo. tive processes where many researchers collaborate. All this leads to a circular dependency where eval- 1 Introduction uations of generation models can be improved only There are many issues with the evaluation of mod- if generation models use better evaluations. els that generate natural language. For example, We find that there is a systemic difference be- datasets are often constructed in a way that pre- tween selecting the best model and characterizing vents measuring tail effects of robustness, and they how good this model really is. Current evalua- almost exclusively cover English. Most automated tion techniques focus on the first, while the second metrics measure only similarity between model out- is required to detect crucial issues. More empha- put and references instead of fine-grained quality sis needs to be put on measuring and reporting aspects (and even that poorly). Human evaluations model limitations, rather than focusing on produc- have a high variance and, due to insufficient docu- ing the highest performance numbers. To that end, mentation, rarely produce replicable results. this paper surveys analyses and critiques of eval- These issues have become more urgent as the na- uation approaches (sections 3 and 4) and of com- ture of models that generate language has changed monly used NLG datasets (section 5). Drawing without significant changes to how they are being on their insights, we describe how researchers de- evaluated. While evaluation methods can capture veloping modeling techniques can help to improve surface-level improvements in text generated by and subsequently benefit from better evaluations state-of-the-art models (such as increased fluency) with methods available today (section 6). Expand- to some extent, they are ill-suited to detect issues ing on existing work on model documentation and
Train & Test Model Automatic Human Qualitative Quantitative Data Hyperparameters Metrics Judgements Analysis Analysis Evaluation All Others Model Developer Figure 1: Even though the evaluation pipeline of a model is complex, with many steps and potential missteps that get “funneled” into the final results, it is often seen as a black box with the purpose of generating numbers that demonstrate superiority over competing approaches. We argue that more attention should be paid to the evaluation process and that the reporting of evaluation results should focus on the characteristics and limitations of a model. formal evaluation processes (Mitchell et al., 2019; 2 Background Ribeiro et al., 2020), we propose releasing evalu- While “natural language generation” used to have ation reports which focus on demonstrating NLG a very narrow scope,1 today it is used broadly to model shortcomings using evaluation suites. These refer to the production of natural language in any reports should apply a complementary set of auto- context, and NLG tasks include summarization, ma- matic metrics, include rigorous human evaluations, chine translation, paraphrasing, and story genera- and be accompanied by data releases that allow for tion. For the purpose of this survey, we follow this re-analysis with improved metrics. broader definition, but focus on conditional gen- In an analysis of 66 recent EMNLP, INLG, and eration tasks. We define conditional NLG tasks ACL papers along 29 dimensions related to our sug- as those in which a machine learning model can gestions (section 7), we find that the first steps to- be trained to maximize a conditional probability ward an improved evaluation are already frequently p(y|x) where y is natural language and x is an in- taken at an average rate of 27%. The analysis un- put that can be structured data or natural language covers the dimensions that require more drastic and which provides information about what should changes in the NLG community. For example, 84% be generated.2 The evaluation of conditionally gen- of papers already report results on multiple datasets erated text typically involves a comparison to the and more than 28% point out issues in them, but input and/or a reference text, neither of which is we found only a single paper that contributed to the available in an unconditional generation setting. dataset documentation, leaving future researchers The scope of this survey thus includes tasks such to re-identify those issues. We further highlight as machine translation, summarization, and data-to- typical unsupported claims and a need for more text generation, but excludes language modeling. consistent data release practices. Following the sug- 1 Reiter and Dale (1997) define NLG as the process of gestions and results, we discuss how incorporating producing text from structured data and thus, text-to-text or the suggestions can improve evaluation research, unconditional generation tasks would not count as NLG. 2 how the suggestions differ from similar ones made We omit multimodal tasks like image captioning or speech-to-text, as well as those with non-textual output like for NLU, and how better metrics can benefit model sign language or audio from the scope of this survey since development itself (section 8). those tasks require vastly different evaluation processes.
In addition, we require in-scope NLG tasks to metrics, how these metrics are typically evaluated, have an explicit communicative goal, which needs what issues are being found, and how newly intro- to be expressed while also planning the content and duced metrics may overcome these issues in the structure of the text and actualizing it in fluent and future. Since not all evaluation strategies are be- error-free language (Gehrmann, 2020).3 All these ing applied to all metrics and not all metrics are aspects need to be captured in the NLG evaluation, applied to all possible generation tasks, we can making it much more challenging than evaluating only provide an incomplete insight into the met- other NLP tasks. For an introduction to NLG be- ric×task×evaluation method space. Since there yond this survey, we point readers to the overview currently exists no “perfect” metric, we will not by Gatt and Krahmer (2018) for a deeper discussion conclude with explicit metric recommendations of NLG tasks, and to the survey by Celikyilmaz but rather try and extract successful metric design et al. (2020) of the evaluation approaches and sta- principles alongside a family of evaluations that tistical methods that are discussed in Sections 3-4. together may provide a more complete characteri- Evaluation approaches for generated text have zation of a model’s performance. traditionally been categorized as intrinsic or ex- trinsic (Jones and Galliers, 1995). Intrinsic ap- 3.1 The Status Quo proaches evaluate a text by itself, whereas extrinsic approaches measure how it affects people perform- Almost all commonly used generation metrics are ing a given task. Intrinsic evaluations include as- reference-based: a system output o is compared sessments by human ratings and by automatic met- to one or multiple human-produced references, rics which have gained popularity with the advent {r1 , . . . , rn }. System outputs that are more similar of statistical NLG (Langkilde and Knight, 1998), to the references are deemed better. However, there which led to the standardization of tasks. While have been many strategies to measure the similarity. some work exists that aims to standardize extrin- The most popular evaluation metrics, BLEU (Pap- sic evaluations (e.g., Mani et al., 1999; Gehrmann ineni et al., 2002) and ROUGE (Lin, 2004), along et al., 2019a), the design space is much larger. As many others, measure the lexical overlap between a result, intrinsic approaches dominate academic o and r in terms of precision and recall of n-grams. publications; Gkatzia and Mahamood (2015) found Variants and parameters control tokenization, stem- that about 75% of published NLG systems rely ming, or balancing of precision and recall. With the on intrinsic evaluations with the fraction increas- advent of deep learning, metrics were introduced ing.4 Since we survey widely used approaches, we that measure the distributional similarity instead mostly cover intrinsic evaluations, but stress the that rely on various ways to measure the distance importance of task-specific extrinsic evaluations. between two distributed token and sequence rep- As pointed out by Reiter and Belz (2009a), the resentations. Notable examples from this class of evaluation meta-evaluations we draw on are most metrics are the word mover distance (Kusner et al., commonly conducted on summarization and ma- 2015), which relies on non-contextual word em- chine translation (MT), but that there is an implicit beddings, and BERT-S CORE (Zhang et al., 2020), assumption that findings translate to other tasks. To which aggregates cosine distances between repre- avoid this issue, we note the task for each study, sented tokens in a sequence, among others (Zhao but, due to a lack of prior findings, are not able to et al., 2019; Clark et al., 2019; Kane et al., 2020; cover every NLG task. Taking a cautious approach, Colombo et al., 2021, inter alia). A related class we make the worst-case assumption that modes of of automatic evaluation are statistical approaches, failure likely transfer across tasks. which focus on the distributions, rather than rep- resentations, produced by a model. Saggion et al. 3 Challenges of Automatic Evaluation (2010) first demonstrated that distributional differ- ences between references and model-outputs can In this section, we provide an overview of common be used as a scoring mechanism. Gehrmann et al. design principles of (intrinsic) automatic evaluation (2019b) showed that these differences exist even 3 This requirement excludes most question-answering tasks for large pretrained models, a fact that was used since they require generating spans or otherwise non-fluent by Zellers et al. (2019) to train a classifier that de- sequences of text. 4 Informally surveying recent *CL papers suggests a num- tects generated text. Hashimoto et al. (2019) used ber of 90% or higher. the same foundation to combine human and auto-
matic evaluation in capturing the trade-off between generate questions (Chen et al., 2018; Wang et al., sampling diverse outputs and achieving the high- 2020; Durmus et al., 2020; Scialom et al., 2021; Re- est possible quality. Pillutla et al. (2021) expand buffel et al., 2021; Honovich et al., 2021; Deutsch on these insights and a framework by Djolonga et al., 2021a, inter alia). et al. (2020) to compare the human- and model- This overview already points to the first issue distributions by measuring the extent to which they with the state of metrics research: the metrics listed diverge. An alternative approach by Thompson and above, except those targeting machine translation, Post (2020) uses the probabilities of each model- are designed to work only on English. A notable ex- generated token under a paraphrasing model that ception is a study by Briakou et al. (2021) which as- uses the human reference as input. sesses different learned metrics for formality trans- Utilizing existing corpora of human quality judg- fer and uses multilingual pre-trained models such ments of generated text, learned metrics are clas- as XLM-R (Conneau et al., 2020). While auto- sifiers that emulate these judgments. Some metrics matic metrics are well-studied, the barrier of entry move beyond reference-based evaluation and in- to developing non-English models is growing. stead provide quality estimation scores between an input i and output o. The first metric of this kind 3.2 Similarity to References is a Red Herring was CLASSY, a logistic regression model for sum- marization evaluation (Rankel et al., 2012). Newer Many automatic metrics rely on the assumption metrics rely on pretrained models, are trained on that NLG systems outputs that are more similar to more human ratings, and introduce initialization the reference(s) are better, a property commonly and and pretraining schemes (Sellam et al., 2020; referred to as “human-likeness” in the NLG liter- Rei et al., 2020; Pu et al., 2021; Wegmann and ature (see, e.g., Belz and Gatt (2008)). While the Nguyen, 2021, inter alia), or focus on specific as- ability to reproduce a reference text sounds like pects like the faithfulness of generated text (e.g., natural evidence of success, relying entirely on Kryscinski et al., 2020; Aralikatte et al., 2021). it for evaluation is misleading—a caveat pointed Many of these metrics rely on artificially intro- out by many evaluation researchers. For instance, duced errors, but Cao et al. (2020) find that moving Belz and Gatt (2008) investigate the correlation from artificial to real error detection is challenging, between lexical overlap metrics (such as BLEU an issue that Zeng et al. (2021) aim to address by and ROUGE) and various measures of success in using adversarial examples instead. a Referring Expression Generation context. They The metrics mentioned so far operate on text find that “a system’s ability to produce human-like directly, but there has also been a long history outputs may be completely unrelated to its effect of metrics that generate and use intermedi- on human task-performance.” ate structures. These include accuracy of parse One reason for this discrepancy is that similarity- trees (Bangalore et al., 2000), overlap between based evaluations reward surface similarity at the “basic elements” (Hovy et al., 2005),5 automati- expense of meaning and may be “fooled” by cally constructed content units (Tauchmann and similar-looking, yet semantically different, out- Mieskes, 2020) using the Pyramid framework puts. NLG tasks have an extensive output space by Nenkova and Passonneau (2004), dependency which cannot be captured through a limited num- parses (Pratapa et al., 2021), or sequence align- ber of references and, a comparison to references ment (Deng et al., 2021). A special case of in- becomes less reliable the more “open-ended” a termediate structures that recently gained pop- task is. For that reason, ROUGE underperforms ularity are question-answering metrics that as- on non-extractive summaries (Dorr et al., 2005). sess information-equivalence. Similar to the faith- The problem is especially poignant when the ref- fulness classifiers above, these aim to measure erences themselves are flawed. As Dhingra et al. whether generated text contains the same informa- (2019) show, using BLEU and ROUGE is prob- tion as a source or reference. Instantiations of these lematic with many table-to-text datasets, because metrics may blank out entities (Eyal et al., 2019; there is a mismatch between the information con- Xie et al., 2021; Scialom et al., 2019), or fully veyed by the reference texts and that of the input 5 table. As a result, model outputs that contain sim- ROUGE is a special case of this where basic elements are fixed size n-grams, but other basic element metrics like ilar unsupported information are rewarded by the PARENT (Dhingra et al., 2019) only focus on content words. metric. Similarly, Freitag et al. (2020) show that
BLEU, METEOR, and BERTS CORE may fail to corpus (CNNDM, Hermann et al., 2015; Nallapati reward good translations when the reference text et al., 2016) along two measures of content quality contains artifacts such as “translationese”. (relevance of the content, and faithfulness) and two One may wonder whether the problem still ex- of linguistic quality (on the sentence- and summary- ists with learnt or embedding-based metrics, since level) using raters from Mechanical Turk. Consis- a more flexible notion of similarity should enable tent with previous findings, they find that ROUGE metrics to be less reliant on surface-level features or does not significantly correlate with either of them. text artifacts in references. However, this argument Extending the annotations by three expert judg- assumes that the set of reference appropriately cov- ments per data point and extending the analysis to ers the target domain, and that the metric is flexible more metrics, Fabbri et al. (2021) find similarly enough to “generalize” from an incomplete set of low correlations without significant performance examples. The current empirical evidence for this improvements of distributional over lexical sim- is negative —in section 3.4 we will present several ilarity metrics. Comparing correlations of these studies that show that even current metrics break metrics across shared tasks from the Text Analysis down with simple adversarial examples (Sai et al., Conferences (TAC) and CNN/DM and using a dif- 2021; Kaster et al., 2021). ferent annotation scheme, Bhandari et al. (2020b) corroborate the very low segment-level correla- How to Interpret Similarity-Based Metrics? tions and also find that that no distributional metric If similarity to the reference is a flawed proxy for outperforms ROUGE. Reanalyzing the data and quality, what do automatic metrics tell us? This addressing issues in the statistical tests, Deutsch question can be investigated empirically by mea- et al. (2021b) come to the same conclusion about suring the correlation between metric scores and ROUGE, but note the insights should be care- human annotations. In a survey of such studies by fully assessed since the data selection strategy for Reiter (2018) focused on BLEU, he concludes that annotations, coupled with large confidence inter- it is useful as a diagnostic tool during the develop- vals, can lead to false results. Beyond summariza- ment of MT systems, but not for other tasks and that tion, Novikova et al. (2017a) note similarly poor is should not be used at the segment level. More segment-level correlations for data-to-text datasets. recently, Kocmi et al. (2021) assess how well auto- All this shows that it is unclear what the results matic metrics compute pairwise rankings for MT of embedding-based and lexical metrics represent, systems, and recommend using a combination of and it is questionable whether the numbers they overlap-based and pretraining-based metrics, con- produce can be trusted outside a few cases such as firming the previous findings that metrics may be MT systems ranking. To better understand their used to rank MT models at the system-level. limitations and opportunities, we need large-scale Several authors have tried to introduce finer- corpora of high-quality human annotations, which grained quality criteria, and attempted to under- do not yet exist for most NLG tasks. stand which quality dimensions are captured by automatic metrics that measure the similarity to The Myth of the Single Reliable Number If references. In most cases, there is inconclusive human-likeness should not be used as proxy mea- evidence. For instance, Reiter and Belz (2009b) sure for quality of generated text, what should be find that these metrics may approximate language used instead? Analyzing DUC 2004 data (Over and quality, although with only weak evidence, and Yen, 2004), where human raters annotated the lan- that they do not measure content quality at all. In guage quality and the coverage of a summary, i.e., contrast, Stent et al. (2005) evaluate metrics on re- how well it covered the meaning of the source, Gra- structured sentences, showing that lexical-overlap ham (2015) found that there was almost no correla- based metrics do measure similarity in meaning, tion between the two measures. However, language but fail at measuring syntactic correctness. The in- quality was a precondition for achieving high cover- consistency between studies and use cases suggests age, leading to a complex relationship between the that overlap-based metrics likely measure neither, two. The lack of correlation between language and which is confirmed by later studies. content quality was also noted by Pitler et al. (2010) In a more recent study, Kryscinski et al. (2019) 5- who find correlations between some evaluation cat- way annotated system outputs on 100 samples from egories. These insights, combined with the lack the test set of the CNN-Dailymail summarization of strong correlations, suggests that a single num-
ber, as produced by almost all automatic metrics, shows that evaluating factual truth is (perhaps un- cannot fully characterize an NLG system. Similar surprisingly) a complex, ill-defined, and unsolved points are made by Deutsch and Roth (2021) who task. Additionally complicating this problem is that show that many similarity metrics capture the over- artificially introduced errors rarely match errors of lap in topics between two summaries much better real summarization models, which means that met- than the overlap in their information. rics trained on synthetic errors may not generalize to real systems (Goyal and Durrett, 2021). Faithfulness is Not Single Dimensional Either Researchers have studied the validity of faithful- An aspect of quality mentioned above and which ness metrics for other NLG tasks as well. For table- permeates all of NLG is faithfulness, and much to-text, Thomson and Reiter (2020) report the per- recent work has focused on this aspect for abstrac- formance of an information extraction-based met- tive summarization. Maynez et al. (2020) state that ric (Wiseman et al., 2017) given different types of a model is not faithful if it hallucinates, that is, it errors, and highlights typically problematic cases adds information that is not present in the source such as errors with names and numbers which are document. They define multiple categories of hal- not detected by the metric. Taking all these points lucinations: Intrinsic hallucinations misrepresent into consideration, we conclude that there is no con- facts in the input, for example turning a “former sensus on how best decompose and measure faith- London mayoral candidate” into a “former London fulness and that even the best current approaches mayor”. Extrinsic hallucinations ignore the input are typically flawed. However, we can also see a altogether, for example generating “President Sara” clear benefit to measuring specific aspects of output in the example above. Not all hallucinations are quality and thus encourage metric designers to stop problematic—an extrinsic hallucination can be fac- treating output quality and in particular faithfulness tual, and may, in fact, be desirable depending on like a one-dimensional problem. the use case. For system evaluation, it is therefore important to be able to discern between hallucina- Parameter Choices and Reproducibility De- tions of different types, which cannot be done by spite these findings, most publications still use only producing a single number. a single metric to demonstrate improvements over Maynez et al. demonstrate that similarity met- prior systems. For example, 100% of papers in- rics fail to measure faithfulness. The same failure troducing new summarization models at *CL con- is observed by Pagnoni et al. (2021) who introduce ferences in 2021 use ROUGE and 69% use only and collect annotations for an alternative typology ROUGE. It thus warrants a deeper look into how of factual errors which involves fine-grained cate- ROUGE and other metrics are used. gories such as Coreference Error and Out of Arti- The most commonly reported ROUGE con- cle Error. In an alternative approach to measuring figurations are the F1 scores of ROUGE-1, -2, correlations with human judgments, Gabriel et al. and -L. This choice was initially popularized by (2021) inject factual errors in reference summaries, Rush et al. (2015), who picked a subset of the op- and checks whether system rankings produced by tions used in DUC 2004 which also included 3, metrics correlate with the “level of factuality” of 4, and LW (Over and Yen, 2004). However, this the transformed sentences, among other proper- choice was not empirically motivated, and from ties like a metric’s value range and generalization. DUC 2005 onwards, the recall scores of ROUGE- They also identify that standard evaluation met- 2 and ROUGE-SU4 were even used instead (Dang, rics (e.g., ROUGE-L and ROUGE-1) oftentimes 2006).6 On top of the disconnect between the past fail at capturing factuality, but identify question- and present choices, both of them are actually sub- answering metrics as promising, somewhat contra- optimal. Rankel et al. (2013) find that rarely used dicting Maynez et al.. Similarly, Chen et al. (2021) configurations of ROUGE are outperforming com- analyze mispredictions on a set of previously an- monly used one, and in an investigation of all 192 notated summarization corpora (Kryscinski et al., ROUGE configurations, Graham (2015) find that 2020; Wang et al., 2020; Falke et al., 2019; Maynez none of them outperformed BLEU and that best et al., 2020). The study identifies common error performance was achieved with the precision vari- types (e.g., “Numerical inference”) and constructs 6 Note though that DUC 2005 evaluated query-focused an adversarial test set with rule-based transforma- summarization instead of sentence compression which was tions. The diversity of approaches in the literature the task studied by Rush et al. (2015).
ant of ROUGE-2. The studies by Kryscinski et al. incorrect comparisons or inflated scores. (2019) and Fabbri et al. (2021) evaluate the F1- variants of multiple ROUGE versions and confirm 3.3 Do Benchmarks Help? the suboptimal setting. They find that ROUGE-1, -2, and -L perform strictly worse than ROUGE-3, To develop reliable metrics, it may be helpful to -4, and -WE-1 across multiple rating dimensions. develop benchmarks to collect large-scale anno- tated evaluation data, which may then be used to Beyond using a suboptimal setup, additional train better metrics. This has been the approach parameters are often unclear; the most popular in MT for over 15 years (Koehn and Monz, 2006), Python implementation, for example, uses a dif- with metrics shared tasks organized as part of the ferent list of stopwords compared to the original yearly WMT workshop/conference. They have PERL script,7 but implementation details are rarely led to improved human annotation processes and specified. That means that not only do we rely on a metrics evaluation approaches, as well as almost metric that consistently underperforms others, we all the learned metrics listed in section 3.1. As are not even using it correctly or in a replicable part of these shared tasks, Macháček and Bojar manner. Beyond versioning issues, ROUGE was (2014) and Stanojević et al. (2015) used non-expert initially designed to evaluate English text, and it crowdworkers to perform a 5-way comparisons be- thus uses whitespace tokenization, and and English tween systems. However, they point out that 5-way stemmer and stoplist. Yet, it is commonly applied comparisons are challenging to interpret as pair- to other languages without mentions of the exact wise comparisons, which is required to compute changes to get it to run. segment-level Kendall-Tau correlations. Similar issues exist in modern frameworks as well, especially those that utilize pretrained mod- Addressing this issue, Bojar et al. (2016) ex- els (Liao et al., 2021). For example, BERT- perimented with three measuring techniques: the S CORE (Zhang et al., 2020) is reported in many original 5-way ranking, direct assessments (DA) recent summarization publications, but the term where outputs are evaluated by themselves, and BERT-S CORE refers to the methodology instead HUME, a method which aggregates scores for se- of underlying model. To combat the confusion mantic units. After promising results, Bojar et al. between model versions, the library produces a (2017) only used DA on a 0-100 scale and HUME. unique hash, inspired by the S ACRE BLEU frame- To compute correlations, DA annotations were con- work (Post, 2018). Yet, these hashes are often not verted into relative rankings, called DARR. The reported or aggregated in incomparable ways.8 following year also abandoned HUME and fully relied on DA (Ma et al., 2018), and embedding- Another example of an often unreported de- based metrics started strongly outperforming other sign choice is how to use single-reference met- metrics. The 2019 shared task introduced a qual- rics in multi-reference setups. While ROUGE ex- ity estimation task in accordance with the DA data plicitly describes how to use it in multi-reference collection technique, illustrating how the human tasks,9 most neural metrics do not. For example, evaluation techniques can influence the design of BLEURT (Sellam et al., 2020) only suggests tak- metrics (Ma et al., 2019). ing the max of multiple scores without discussing tradeoffs compared to computing the mean.10 All However, as metrics and systems improved fur- these evaluation parameters can have a drastic in- ther, the DA annotations proved insufficient to iden- fluence over the validity of scores and can lead to tify a “best” metric (Mathur et al., 2020), which led to another major change to the methodology (Fre- 7 The package can be found here. Anecdotally, wrappers itag et al., 2021b). The latest evaluations thus fol- around the original implementation can lead to changes of lowed the suggestion by Freitag et al. (2021a) to use more than 0.5 points. 8 For example, Papers With Code for WMT 2014 en-de Multidimensional Quality Metrics (MQM, Lom- compares models on S ACRE BLEU score without hashes. mel et al., 2014), a fine-grained expert-based an- 9 The multi-reference version of ROUGE represents a very notation approach. The results demonstrate that generous upper bound in which results can only improve by adding a reference, never decrease, which can have other DA is unreliable for high-quality translations, of- negative implications. Moreover, not all implementations may ten mistakenly ranking human translations lower use the originally recommended method. than system outputs whereas human translations 10 The alternative approach can be seen on the leaderboard of the ToTTo dataset (Parikh et al., 2020) where the mean of are correctly identified as better than system out- multiple BLEURT scores is reported. puts in MQM. Surprisingly, metrics correlate much
better with MQM, even those trained on the DA This section gives an overview of various re- annotations. search efforts that seek to evaluate automatic met- Does this mean that focusing on DA was wrong? rics experimentally, with each focusing on a spe- No, without many years of (suboptimal) data col- cific aspect of the metric, such as its sensitivity to lection, we would not have learned metrics, and we sequence length or to lexical overlap between the would not know whether DA worked for MT. How- candidate and the reference. ever, the progression also teaches the lesson that Perturbation Analysis and Surrogate Models benchmarks may lead the field down the wrong One common methodology is to apply methods path. A similar argument by Hirschman (1998) from the interpretability literature to understand critiques that benchmark evaluations only take a what metrics focus on. In one such study, Kaster narrow approach and states that evaluation is in- et al. (2021) measure to what extent several BERT- trinsically a cost-benefit trade-off. They further based metrics correlate with a simple linear model argue that we should weigh the divergent needs based on hand-crafted features. They find that of stakeholders when designing evaluations, sim- these metrics are sensitive to lexical overlap de- ilar to Ethayarajh and Jurafsky (2020), who ar- spite the fact that the initial motivation for distri- gue that not everyone may derive the same utility butional similarity metrics was the over-reliance from an improvement on a leaderboard. Scott and on lexical overlap of BLEU and ROUGE. The Moore (2007) warn that NLG evaluation shared authors craft adversarial examples, and show that tasks could harm the field, since they may amplify metrics can be fooled by lexically similar, non- issues with the data and that benchmarks may lead paraphrase sentences. To the same end, Sai et al. to people to ignore external evaluations, and put (2021) conduct a correlation analysis after applying too much emphasis on metrics that do not measure 34 perturbations that test the metrics’ sensitivity what we think they measure, both of which also to task-specific criteria (e.g., jumbling word order, happened. We thus can conclude that benchmarks introducing spelling errors for fluency, or chang- are necessary, but that they need to be self-critical ing numbers for correctness) using the Checklist and explore different evaluation approaches.11 method (Ribeiro et al., 2020). The results of this analysis, which covers 18 criteria across six tasks, 3.4 Auditing and Interpreting Metrics indicate that trained metrics tend to do better, but As seen through the WMT metrics shared tasks, tuning towards overall quality across task is a poor machine learning-based metrics are promising, but practice, leading to metrics that evaluate no indi- a common criticism is that they are not transparent; vidual aspect correctly. Sai et al. further report it is often unclear how they operate internally and that even metrics that score highly are not entirely whether they can deliver high performance consis- robust to simple perturbations, calling for a more tently across domains, tasks, and systems. Metric widespread use of this type of analysis. developers typically report agreement with human Aside from lexical overlap, another aspect of ratings on specific test subsets filtered on the prop- text that has been shown to confound metrics is erty of interest, or they measure the change in a length. During the DUC summarization tasks, sys- metric’s value when perturbing a reference (e.g., tems were restricted to a strict number of output by shuffling words). The idea to write tests for bytes and thus were compared at a given length. metrics, rather than reporting corpus-wide corre- This is no longer the case in modern datasets, but lations, may partly be traced back to Lin and Och Sun et al. (2019) show that this can have dire con- (2004), who pose that metrics should always rank sequences. Specifically, up to a certain length, one a human-produced reference first when compared can “cheat” ROUGE scores by simply generating to multiple system outputs and thus measure how longer outputs. Even when the longer outputs are far the reference deviates from the first spot.12 qualitatively worse, scores increase. 11 We also note that, in addition to DUC/TAC, there has Impact of the Systems’ Quality As models im- been a long history of shared tasks in the NLG community addressing a much more diverse set of tasks starting with prove, so should metrics. Yet, many metrics are referring expression generation (Gatt et al., 2008), but which tuned or benchmarked using previously published have also covered tasks such as summarization (Syed et al., system outputs, which cannot be representative of 2019) and data-to-text generation (Dusek et al., 2020). 12 As we discuss later, this strong assumption is rarely met the current and future state-of-the-art. As a re- for NLG datasets. sult of this, Peyrard (2019) find that summariza-
tion metrics with previously reported high correla- non-English ones, may be used to train future met- tions with humans disagree with one another when rics, feeding the positive feedback loop that ties tasked to compare high quality summaries, reveal- metrics, models, and human evaluation. ing their fragility. Bhandari et al. (2020a) revis- its this conclusion, demonstrating that metrics dis- 4 Challenges of Human Evaluation agree whenever the quality range is narrow, regard- The work presented in the previous section con- less of whether the summaries are good or bad. cludes human evaluation is a necessary component Bhandari et al. (2020b) also highlight that previ- of model evaluations since we cannot trust auto- ously published studies of metrics would yield dif- matic metrics. This conclusion is reached by treat- ferent conclusions with more recent datasets and ing human evaluation annotations as the ground top scoring systems, and that the relative perfor- truth to which automatic metrics are compared, mance of metrics vary a lot across datasets. These and human annotations are also used as training studies show that it is still unclear how metrics gen- corpora for automatic metrics. We thus rely on hu- eralize across time, systems, and datasets and the man evaluations and often treat them as a panacea evaluation of such qualities is complicated due to that reveals the ultimate truth about NLG system the cost of collecting human annotations, the low performance. Yet there are deep-running issues diversity of existing datasets, and the impossibility with how human evaluations are conducted, which to to access future systems. affect these system analyses, metric evaluations, and newly developed metrics. 3.5 Takeaways for Metric Developers Since BLEU was introduced, dozens of papers 4.1 What is Measured? have shown that automatic metrics have poor corre- While some work asks evaluators to rate the overall lations with human judgments of quality (in addi- quality of generated text, it is more common to tion to those cited above, see, e.g., Callison-Burch collect evaluations for specific dimensions of text et al. (2006)). We challenge the premise that such quality. However, there is little consensus on which a correlation would be desirable, because quality is dimensions to evaluate. a vastly under-defined property. Instead, we make In the human evaluations analyzed in Howcroft the case for multi-dimensional evaluation. This is et al. (2020)’s study of 165 NLG papers, generated already common in human evaluations; researchers text was evaluated along 204 dimensions of quality, often collect evaluations for several aspects of a which they mapped to 71 distinct criteria. Some of generated text’s quality (e.g., in MT, rating both the these criteria are hierarchical, e.g., grammaticality fluency and adequacy of a translated text). Since a and spelling fall under the more general correct- single number cannot give an accurate depiction of ness of surface form criterion. There are also cases system’s performance, we call for the development where researchers apply the same text quality di- of metrics with a smaller, but better defined scopes. mension differently. For example, Howcroft et al. Another aspect that does require more attention (2020) found that what researchers called fluency is robustness. Meta-evaluation studies have shown could actually be divided into 15 different criteria, that metrics can behave vastly differently on dif- depending on how the term was defined and used ferent datasets and when tasked to evaluate differ- in the context of the task. ent NLG systems. Furthermore, multiple studies The disparities in how text quality dimensions demonstrate that automatic metrics easily break are applied and defined in human evaluations com- when the input is subject to simple perturbations. plicate comparisons across efforts and benchmark- This shows that there is major headroom for im- ing improvements over previous work. This prob- provement: the metrics should be narrower in the lem is exacerbated by the lack of human evaluation phenomenon they try to capture, but broader in the details in NLG papers. Of the 478 quality evalu- input domain on which they perform well. ation questions studied by Howcroft et al. (2020), Given the results reported on existing bench- over 50% did not define the criterion they were marks, we support the view that human evalu- evaluating for (279 out of 478), 65% did not re- ation remains an essential component of perfor- port the exact question they gave the evaluators mance analysis, complementary to automatic met- (311/478), and 20% did not even name the crite- rics. In addition, collected annotations, especially rion being evaluated (98/478). To promote more
standardized human evaluations, some researchers the relative quality of the generation models also have proposed detailed definitions and methodolo- makes a difference, showing significant differences gies for human evaluation for a specific task and/or between older annotations and newly collected hu- dimension of text quality. For example, Thomson man judgments for better models.13 They show and Reiter (2020) propose a methodology for eval- that automatic metrics trained on annotations of uating accuracy for data-to-text generation tasks, text generated from older models do not always and Rashkin et al. (2021) define a framework for perform as well when evaluating state-of-the-art evaluating whether generated text is attributable to generated text. Another confounder, which we identified sources. point out in section 3, is the correlation between While general or vague evaluation criteria can dimensions that should not be correlated. Dusek lower the reproducibility and lead to low agreement et al. (2020) demonstrate that the correlation can between evaluators, well-specified human evalua- be avoided by running different annotation tasks in tion comes at a cost. For example, the human eval- parallel, but this leads to a much higher cost to the uation protocol used in the accuracy shared task at evaluators. INLG 2021 (Reiter and Thomson, 2020; Thomson Measurement instruments van der Lee et al. and Reiter, 2020) produced high inter-annotator (2021) find that Likert scales were the most popular agreement, but Thomson and Reiter (2021) re- method for rating generated text, used in 56% of ported that each 300-word text took an annotator studies (82/147). However, Belz and Kow (2010) 20-30 minutes to evaluate and the annotation cost argue that rating scales like those used in direct for a single generated text was about US$30. How- assessments (i.e., evaluating a generated text alone, ever, this detailed human evaluation protocol cap- without referencing other candidates) have many tured error categories that the automatic metrics issues: they are unintuitive, agreement numbers were unable to detect. are low, and most statistical measures are inappro- priate for ordinal data. They find that these issues 4.2 How is it Measured? can be addressed to some extent by switching to Previous work indicates that the way questions are preferential judgments. Kiritchenko and Moham- framed, the types of text that are being evaluated, mad (2017) demonstrated that best-worst scaling and the measurement instruments can affect the (asking evaluators to choose the best and the worst results of human evaluations. Schoch et al. (2020) items in a set) is an efficient and reliable method discuss the role cognitive biases can play in the way for collecting annotations, and this approach has researchers elicit human evaluations, such as using been used to collect comparative evaluations of gen- positive or negative framing (e.g., How much more erated text (e.g., Liu and Lapata, 2019; Amplayo fluent is sentence A vs. sentence B?), including text et al., 2021). artifacts or study design details that reveal the re- Belz and Kow (2011) further compare continu- searchers’ hypothesis, and framing instructions and ous and discrete rating scales and found that both questions around a model’s known strengths and lead to similar results, but raters preferred contin- weaknesses. Choi and Pak (2005) provide a longer uous scales, consistent with prior findings (Svens- catalogue covering 48 of these biases. However, if son, 2000).14 Contrary to these findings, Bojar et al. researchers do not report the details of their stud- (2016) and Novikova et al. (2018) compare direct ies, no one can judge whether any of these biases assessments and relative rankings and find that the would apply; surveys of NLG papers find as few rankings produced were very similar, but Novikova as 35% (Howcroft et al., 2020) and 16% (Schoch et al. conclude that relative rankings are best when et al., 2020) of papers share the questions used in combined with magnitude estimates. They also their human evaluations. find that collecting judgments in separate tasks Aspects of the texts themselves may also un- decorrelates different evaluation criteria, albeit at a duly affect the evaluators’ judgments. For example, higher cost since multiple tasks have to be run. Sun et al. (2019) find that several dimensions of 13 However, this finding may be confounded by the collec- summary quality (e.g., informativeness) are corre- tion approach as well (Shapira et al., 2019). 14 lated with the summary’s length and thus suggest One potential caveat is that these studies were conducted before the wide availability of crowdsourcing platforms and normalizing for summary length when evaluating are thus conducted with small cohorts of raters who have a these criteria. Bhandari et al. (2020b) find that different motivation.
4.3 Statistical Significance Aside from the parameters of the study, there are also confounding factors in the evaluation of the Human evaluations present yet another issue: how annotation quality itself. To demonstrate that the to measure the significance of human evaluation annotations are of sufficient quality, reporting inter- results? van der Lee et al. (2021)’s survey finds that annotator agreement is the most common method. only 23% of NLG papers report statistical analyses However, Amidei et al. (2019a) survey 10 years to determine the significance of their results, and of annotation agreement measures and show that only 13% explicitly state their hypotheses. almost all studies fail reliability tests. They argue One challenge when testing for significance in that a substantial amount of the variability cannot human evaluation results is small sample sizes; and should not be eliminated since evaluation of given that the median number of generated texts generated text is intrinsically subjective and relies in a human evaluation is 100 items (van der Lee on many different factors including rater experi- et al., 2021), most typical experimental designs for ence, motivation, knowledge, or education. As a human rating studies will be underpowered to de- remedy, they suggest using additional correlation tect small model differences. This problem is not measures alongside kappa statistics. specific to NLG. Card et al. (2020) analyze popular NLP datasets and find that they are not adequately 4.4 Who is Measuring? powered (e.g., a typical MT test set of 2000 sen- tences would have approximately 75% power to In many human evaluations, a small number of detect differences of 1 BLEU point). Howcroft and evaluators judge the generated text. 39% of papers Rieser (2021) demonstrate that treating ordinal data in van der Lee et al. (2021)’s survey use between 1– as interval data makes tests even more underpow- 5 evaluators. However, it is becoming increasingly ered, which is what most papers do when analyzing common to collect judgments from a large num- rating and Likert scales (68 out of 85 recent papers, ber of evaluators using crowdsourcing platforms according to Amidei et al. (2019b)). Significance like Amazon Mechanical Turk (MTurk), Appen, thresholds are not always adjusted when running Prolific Academic, and Upwork. multiple significance tests (e.g., Bonferroni correc- In particular, MTurk has a long history in tion), increasing the likelihood of false positives NLP with early claims stating that a small num- (van der Lee et al., 2019). ber of crowdworkers can replace a single expert Improvements in NLG models also make detect- rater (Snow et al., 2008). Similar claims were ing statistically significant differences more chal- made in other communities, stating that, while not lenging. Text generated by high quality models as high-quality, overall data quality can actually may differ less often or in more subtle ways, which be improved by having more redundant annota- requires more human judgments to detect. Wei and tions (Sheng et al., 2008). However, later studies Jia (2021) show that the requirement for more judg- find that this point is actually a lot more nuanced. ments can quickly becomes prohibitive: to detect Some dimensions of text quality may be easier a difference of 1 point on a 1-100 scale in WMT, than others to rate with crowdsourced evaluators we need 10,000 perfect annotator judgments. As instead of experts. Gillick and Liu (2010) find that a result, they suggest that automatic metrics may MTurk judges were better at measuring generated actually be more reliable than human annotations summaries’ linguistic quality than their content or if the annotations are insufficiently powered. The overall quality and had a much higher correlation number of required annotations can potentially be between linguistic and overall quality than experts. decreased by not uniformly sampling examples to Clark et al. (2021) find MTurk evaluators are more annotate and instead biasing the sampling toward likely to base judgments of generated text on the those where models differ. However, this process text’s form rather than its content. In their work on can lead to artificially high correlation of the re- German summarization evaluation, Iskender et al. sults with automatic metrics, which could overstate (2020) find that non-redundancy and usefulness are their effectiveness and the quality of human anno- very hard to assess using crowdworkers and suggest tations (Deutsch et al., 2021b). Moreover, since that experts should be used for them, while crowd- NLG models may only differ in very few exam- workers are suitable for other dimensions of text ples, statistical analyses should also handle ties as quality as long as results are carefully interpreted. discussed by Dras (2015) for pairwise rankings. Analyzing DUC annotations between 2001 and
2004, Harman and Over (2004) find that averaged is payment; does the low-pay, small-batch format human ratings can yield meaningful insights, but of crowdsourcing actually provide evaluators with a also note that there is very high variance both fair wage? Fort et al. (2011) discuss the low wages within and between human raters and that it is MTurk workers receive, along with concerns about unclear whether the source of the variance is in- data quality issues that the platform incentivizes. trinsic to the humans or the models. This variance These concerns are not unique to MTurk; Schmidt may be even higher in crowdsourcing scenarios (2013) argues that there are ethical concerns across compared to expert raters. Karpinska et al. (2021) crowdsourcing platforms, regardless of how they report that running the same MTurk evaluation on incentivize workers. Shmueli et al. (2021) cover different days of the week can vary enough to pro- a broader set of ethical considerations for crowd- duce different results. When analyzing evaluations sourcing work, including potential psychological of MT systems, Freitag et al. (2021a) find that harms, exposing sensitive information about work- agreement between ratings produced by linguists ers, and breaching workers’ anonymity. Despite and those from crowdworkers can be extremely these concerns, Shmueli et al. report that only 14 low. In fact, they find that automatic metrics can out of 703 NLP papers that used crowdsourcing have higher agreement with high-quality anno- mention IRB review. tations than human crowdworkers. Some tasks like multi-document summarization are especially 4.5 Subjectivity and User Satisfaction challenging and time-consuming for people to eval- Most of the human evaluations in this section are uate. Observations like these have led to work intrinsic evaluations, asking evaluators to rate the proposing evaluation methods that combine the ad- quality of the generated text. However, the more vantages of human and automatic evaluation (e.g., valuable question is answered with extrinsic eval- Hashimoto et al., 2019; Zhang and Bansal, 2021). uation: how well does the generated text serve The increasing quality of generated text has led its intended purpose? These evaluations measure some researchers to move away from crowdsourc- how useful a text generation model is and indicate ing platforms. For example, expert evaluators like whether real world users would be satisfied with English teachers (Karpinska et al., 2021) or trained, the generated texts. Evaluations focused on intrin- in-person evaluators (Ippolito et al., 2020) were sic qualities of the text fail to capture dimensions needed to distinguish between human-authored of NLG systems that practitioners care about, e.g., text and text generated by today’s generation mod- how trustworthy a generated text is or how well it els (an evaluation most commonly found in dia- performs in human-in-the-loop settings.15 logue generation). Similarly, Freitag et al. (2021a) Another related aspect that is rarely considered demonstrate that non-expert annotations often in human evaluations is the subjectivity of text lead to mistaken claims of super-human model evaluation. People may value certain text quali- performance, when expert annotators correctly ties more highly than others or be working from identify issues in the generated texts. a different point of reference. Even the more “ob- jective” aspects of text quality, like grammatical It is unclear whether these issues are specific to correctness, may depend on the evaluators’ dialect, the fact that non-expert annotators are being used, the perceived formality of the text, the context or or if these issues may be overcome by improving style of the generated text, etc. Disagreement in the quality of the study and the working condition evaluators’ ratings does not always indicate eval- of raters. Investigating the use of MTurk for NLP, uator error; rather it may be a signal that there is Huynh et al. (2021) find that about 25% of studies more complexity to the text or dimension of qual- have technical issues, 28% have flawed, vague, or ity. While it has been shown that increasing the insufficient instructions, and 26% of study creators number of annotations per example can decrease were rated as having poor communication. Notably, the overall bias (Artstein and Poesio, 2009), this they also find that 35% of requesters pay poorly or finding assumes that the population of annotators is very badly according to MTurk raters. To that end, somehow representative of the whole world. Prab- many have questioned whether the treatment eval- hakaran et al. (2021) find that aggregating annota- uators receive and the structure of crowdsourcing platforms provide ethical working conditions for 15 See, for example, Ehud Reiter’s summary of a panel on evaluators. The most basic of these considerations NLG in industry at INLG 2021.
tor responses results in under-representation of gue that choosing to evaluate on a dataset reinforces groups of annotators’ opinions, and they recom- design decisions taken during its construction and mend releasing annotator-level annotations and col- focuses the evaluation on the specific distributions lecting annotators’ socio-demographic information represented in the data. to prevent the exclusion of minority perspectives. Collectively, the research community could se- We thus should be careful of results such as those lect for a more diverse language representation and that suggest excluding data with low agreement decide to replace older flawed datasets by newly de- scores with other annotators (Owczarzak et al., veloped ones. Unfortunately, the collective choices 2012), unless we know the source of the disagree- also reinforce suboptimal design decisions. Analyz- ment is not subjectivity. Even well-established ing a sample of 20 papers that proposed summariza- NLG tasks have aspects of subjectivity that are tion approaches in 2021, we find 27 datasets that usually ignored. For example, the goal of a sum- models were being evaluated on. The most popular marization task is to generate the important points ones, CNN/DM and XSum (Narayan et al., 2018), from a document, but Kryscinski et al. (2019) find were used five and four times respectively, despite that when annotators select which sentences in a their issues, which we explore in section 5.2. Ad- document are the most important to include in a ditionally, only two of the 27 datasets were non- summary, the majority of evaluators only agree on English, despite much recent work that introduces an average of 0.6 sentences per document. multilingual summarization corpora (Giannakopou- While the majority of evaluation criteria is by los et al., 2015; Scialom et al., 2020; Ladhak et al., definition subjective, there is an opportunity for 2020; Hasan et al., 2021; Perez-Beltrachini and hybrid approaches with the help of standardized Lapata, 2021). measures (van der Lee et al., 2021). One such These findings lead to three questions. First, dimension that could be useful for tasks like sim- how can we as a research field measure summa- plification is the readability of text, which could be rization improvements on disjoint datasets? How measured using scales such as the ones proposed can we claim that we are making progress if we by Kincaid et al. (1975) or Ambati et al. (2016). only focus on a single language? And, given the van der Lee et al. point out that the relationship significant issues with popular benchmark datasets, between these objective measures and subjective what do improvements even mean? Throughout readability assessments is not currently being stud- this section, we analyze typical design choices dur- ied, although a strong objective measure could lead ing NLG data construction and how they influence to a higher degree of standardization. Similarly, insights derived from evaluations.16 one can imagine human-in-the-loop approaches for measuring faithfulness that focus on claims that 5.1 Representation in Performance Numbers are challenging to verify using only automatic ap- Dataset creation is a value-laden process, yet those proaches, enabling the collection of a much larger values are rarely made explicit (Hutchinson et al., quantity of judgments. 2021). The choices of dataset creators have signif- icant impact, for example on who is represented 5 Challenges with Datasets in the data and on the language(s) of a dataset. Joshi et al. (2020) assess the language diversity A component mostly kept apart from evaluation in NLP, showing that very few languages beyond analyses is the data, even though NLG tasks are English are being studied, regardless of the num- embodied through datasets; for example, claims ber of their speakers. A similar argument can be about performance on CNN/DM may be used as a made for dialects; focusing on African American proxy for performance on all summarization tasks. Vernacular English (AAVE), Blodgett et al. (2020) Issues with datasets are widely studied in the gen- describe multiple studies showing a drop in per- eral machine learning literature which we heavily formance on popular NLU tasks when applied to draw on in this section, with anecdotal evidence for 16 We point to Paullada et al. (2020) for a more in-depth NLG tasks when available. In a recent survey of survey of general issues in data creation, including those of datasets and benchmarks in machine learning, Liao benchmarking and data maintenance practices, to Bender et al. et al. (2021) point out that the lack of differentiation (2021) for a survey issues of using large web-scraped datasets, and to Luccioni and Viviano (2021) and Dodge et al. (2021) between tasks and datasets that aim to capture them for analyses of such large-scale web-scraped corpora and their can lead to harmful over-generalization. They ar- representational, legal, consent, and PII issues.
You can also read