When Combating Hype, Proceed with Caution - NYU
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
When Combating Hype, Proceed with Caution Samuel R. Bowman New York University bowman@nyu.edu Abstract In an effort to avoid reinforcing widespread hype about the capabilities of state-of-the-art language technology, researchers have devel- oped practices in framing and citation that serve to deemphasize the field’s successes. Though well-meaning, these practices often yield misleading or even false claims about Figure 1: Hype is a problem. The opposite of hype isn’t the limits of our best technology. This is a necessarily better. problem, and it may be more serious than it looks: It limits our ability to mitigate short- term harms from NLP deployments and it lim- Kiela et al., 2021; Bowman and Dahl, 2021). While its our ability to prepare for the potentially we have only a limited ability to control the pub- enormous impacts of more distant future ad- lic narrative taking place through industry PR and vances. This paper urges researchers to be the media, there’s reason to be hopeful that we re- careful about these claims and suggests some searchers are getting much better at avoiding the research directions and communication strate- worst forms of overconfidence about our systems. gies that will make it easier to avoid or rebut them. Less fortunately, these discouraging results seem to have produced a trend toward pessimism about 1 Introduction model performance that is often ungrounded from real empirical results. This leaves room for the Over the last few years, natural language process- research community’s consensus about a model’s ing has seen a wave of surprising negative re- capabilities to fall short of its actual capabilities. sults overturning previously-reported success sto- This is more dangerous than it might seem. It ries about what our models can do and showing that risks our credibility and thereby limits our ability widely-used models are surprisingly brittle (Jia and to influence stakeholders in cases where our cur- Liang, 2017; Niven and Kao, 2019; McCoy et al., rent systems are doing real harm. It also limits our 2019). This shows that many of our standard prac- ability to accurately forecast and plan for the im- tices for evaluation and reporting often lead to unre- pacts that may result from the deployment of more alistically positive initial claims about what we can capable systems in the future. If we can truly reach do. The resulting hype and overclaiming, whether near-human-level performance on many of the core intentional or not, is a problem. Hype and over- problems of NLP, we should expect these impacts claiming can encourage the reckless deployment to be enormous, and potentially catastrophic if not of NLP systems in high-stakes settings where they planned for. can do significant harm. They also threaten the I call this issue underclaiming, for lack of a bet- health and credibility of NLP as a research field, ter term.1 In this paper, I lay out four types of and thereby threaten our ability to influence applied underclaiming, focusing especially on writing and stakeholders or attract funding. citation practices in NLP. I then argue that this Fortunately, these results have led to a surge of 1 research and writing that proposes more thorough While overclaiming generally refers to overstating the effectiveness of one’s own methods or ideas, the phenomenon and cautious practices for the evaluation of model that I call underclaiming often involves downplaying the ef- ability (Ribeiro et al., 2020; Gardner et al., 2020; fectiveness of preexisting methods or ideas.
Model Year SQuAD AS AOS ReasoNet Ensemble 2017 81 39 50 BERT-Base 2018 87 64 72 XLNet-Base 2019 89 69 77 Table 1: F1 results for the best-performing SQuAD model studied by Jia and Liang—ReasoNet (Shen et al., 2017)—and for thenewer BERT and XLNet models (Devlin et al., 2019; Yang et al., 2019) reported by Zhou et al. (2020) on the original SQuAD development set and the two Jia and Liang adversarial evaluation sets. Figure 2: Jia and Liang (2017) remains widely cited ac- While I am not aware of results from more recent mod- cording to Google Scholar. The original work pointed els on the this data, progress through 2019 had already out major unexpected limitations in neural networks cut error rates in half. trained from scratch on the SQuAD reading compre- hension task. However, many of these citing works make the groundless implication that modern pre- this, discussions about the limitations of systems trained systems—developed more recently than 2017— from past years don’t straightforwardly apply to show these same limitations. present systems. The first two cases that I present involve failures to contextualize claims about the is a problem: It hurts the short-term interests of failures of weaker past systems: NLP researchers, it limits our ability to mitigate Adversarial Examples for SQuAD Jia and the current real-world harms of language technol- Liang (2017) published one of the first prominent ogy deployments, and it limits our ability to ade- demonstrations of adversarial failures in neural- quately prepare for longer-term harms involving network-based NLP, showing that a simple algo- economic disruption or catastrophic alignment fail- rithm could automatically augment examples from ures. I close by sketching some ways of reducing the SQuAD benchmark (Rajpurkar et al., 2016) in a the prevalence of this kind of underclaiming, in- way that would fool many state-of-the-art systems, cluding straightforward best practices in writing but not humans. This work prompted a wave of and evaluation, a proposed new rule of thumb for much-needed analysis and a corresponding lower- writing and reviewing, improvements to tooling for ing of expectations about the effectiveness of neural analysis and benchmarking, and as more ambitious network methods. research agendas in model performance forecasting However, citations to this work have since taken and test set design. on a strange role in the literature. In particular, the results in this paper predate the development of 2 Underclaiming: Case Studies modern pretraining methods in NLP (Peters et al., This paper addresses the possibility that in our ef- 2018; Radford et al., 2018; Devlin et al., 2019). fort to avoid overconfidence, we have started to err The best systems studied in Jia and Liang have in the opposite direction: Published work increas- more than twice the error rate of current state-of- ingly claims or implies that state-of-the-art systems the-art systems, and while I am not aware of any are less capable than they actually are. This has results from current state-of-the-art systems on this taken several forms, including misleading presenta- data, results from 2019 systems suggest that we are tions of valid negative results from weak or dated making substantial progress (Table 1). We have no baseline models, misleading claims about the lim- reason to expect, then, that the failures documented its of what is conceptually possible with machine in this work are quantitatively or qualitatively simi- learning, misleading reporting of results on adver- lar to the failures of current systems. sarially collected data. However, papers that cite these results often present them with no discussion of the model under 2.1 Negative Results with Weak or Dated study, yielding misleading implications. For exam- Baselines ple, the award-winning work of Linzen (2020) uses the Jia and Liang result to justify this claim: In spite of many surprises and setbacks, NLP re- search seems to have made genuine progress on [F]or current deep learning systems: many problems over the last few years. In light of when tested on cases sampled from a dis-
tribution that differs from the one they previously observed effects. Even so, this three- were trained on, their behavior is unpre- year lag makes it easy to seriously misjudge our dictable and inconsistent with that of hu- progress. mans The BERT-only results, though, represent a clear missed opportunity: There exist newer models like The chief concern in this context is the claim that RoBERTa and DeBERTa (Liu et al., 2019; He et al., this failure applies to current deep learning systems 2020) which follow nearly identical APIs and ar- in general, and the corresponding unjustified im- chitectures to BERT, such that it should generally plication that these failures are a fundamental or be possible to reuse any BERT-oriented analysis defining feature of neural network language mod- method on these newer models without modifica- els. Looking only to highly-cited works from the tion. In many cases, these newer models are dif- last two years that cite Jia and Liang, similar state- ferent enough in their performance that we should ments can be found in Xu et al. (2020), Zhang et al. expect analyzing them to yield different conclu- (2020), and others. sions: For example, BERT performs slightly worse than chance on the few-shot Winograd Schema The Long Shadow of BERT While the case of commonsense reasoning test set in SuperGLUE Jia and Liang is especially striking since it deals (Levesque et al., 2011; Wang et al., 2019), while with models that predate pretraining entirely, a sim- the newer DeBERTa reaches a near-perfect 96% ilar effect is much more common in a subtler form: accuracy. Most analysis papers that identify limitations of a system come out well after the system description 2.2 Strong Claims about Limits to paper that claims the initial (typically positive) re- Understanding sults. BERT, first released in Fall 2018, has been a major locus for this kind of analysis work, and The influential work of Bender and Koller (2020) continues to be a focus of such work long after its is centered on the argument that: release. Looking to a random sample of ten papers [T]he language modeling task, because from the NAACL 2021 analysis track that study it only uses form as training data, cannot existing systems,2 . none of them analyze models in principle lead to learning of meaning. that have come out since summer 2019, and five only study BERT, representing a median lag of The proof of this claim is straightforward and nearly three years from the release of a model to convincing under some (but not all) mainstream the publication of the relevant analysis. definitions of the word meaning in the context of This trend has consequences for the conclusions NLP: If meaning deals with the relationship be- that one would draw from a broad review of the tween language and some external nonlinguistic recent literature on some problem: A review of that reality, then a system that can only interact with the literature will contrast the successes of the best cur- world through language cannot manipulate mean- rent systems against the weaknesses of the best sys- ings. tems from an earlier period. In many cases, these This argument does not on its own make any weaknesses will be so severe as to challenge the prediction about the behavior of these models on credibility of these successes if they are not prop- tasks that take place entirely through the medium erly recognized as belonging to different model of language. On the definitions of meaning that generations. must be assumed for this argument, tasks like sum- This analysis work is nonetheless often valuable marization or translation can in principle be solved and these long timelines are sometimes justifiable: without reference to meaning: Under this defini- Good analysis work takes time, and researchers do- tion, a translation model would be acting without ing analysis work often have an incentive to focus reference to meaning even if it has a rich, struc- on older models to ensure that they can reproduce tured internal model of the world, and it interprets 2 Papers studying only BERT: White et al. (2021); Slo- sentences with reference to that model when trans- bodkin et al. (2021); Bian et al. (2021); Cao et al. (2021); lating: As long as that model of the world was Pezeshkpour et al. (2021). Papers studying other models pre- developed solely using language, and does not oth- dating Fall 2019: Wallace et al. (2021); Hall Maudslay and Cotterell (2021); Hollenstein et al. (2021); Bitton et al. (2021); erwise ground out to perception or some other non- Du et al. (2021) linguistic modality, no meaning has been used. See
Merrill et al. (2021) for some limits on how closely 2.3 Adversarially Collected Data such a model can correspond to the real world and Adversarially collected test sets (Bartolo et al., Bommasani et al. (2021, §2.6.3) for further dis- 2020; Nie et al., 2020; Kiela et al., 2021)—or test cussion of the implications of Bender and Koller’s sets composed of examples that some target sys- arguments for NLP. tem gets wrong—have recently become a popular In addition, this argument does justify any strong tool in the evaluation of NLP systems. Datasets prediction about the behavior of models which are of this kind are crowdsourced in a setting where trained primarily, but not exclusively, on a language an example-writer can interact with a model (or modeling objective, as with models that are fine- ensemble) in real time and is asked to explore pos- tuned to produce non-textual outputs like labels, or sible ways of writing examples until they find an models which are trained in a multimodal language- example on which the model fails. Writers are gen- and-vision regime. erally paid or rewarded only for these failure cases, While this core claim is sound and important, and the test section(s) of the final released dataset public discussion of the paper has often repeated will generally consist exclusively of these failure the claim in ways that imply stronger conclusions cases. about model behavior. Utama et al. (2020), for This process produces difficult test sets, and it example, write that: can be a useful tool in understanding the limits of Researchers have recently studied more existing training sets or models (Williams et al., closely the success of large fine-tuned 2020). However, the constraint that a specified LMs in many NLU tasks and found that system must fail on the test examples makes it models are simply better in leveraging bi- difficult to infer much from absolute measures of ased patterns instead of capturing a better test-set performance. In particular, as long as a notion of language understanding for the model makes any errors at all on some possible intended task (Bender and Koller, 2020). inputs, then we expect it to be possible to construct an adversarial test set against it, and we expect it to In another vein, Jang and Lukasiewicz (2021) achieve zero test accuracy on that test set. We can make the straightforward claim that further infer that any models that are sufficiently similar to the adversary should also perform very Bender and Koller (2020) show that it is poorly on this test set, regardless of their ability. impossible to learn the meaning of lan- Neither of these observations would tell us anything guage by only leveraging the form of non-trivial about the actual abilities of the models. sentences. What’s more, in many NLU data collection ef- forts, a significant fraction of cases of annota- but they then use that claim to motivate a new reg- tor disagreement represents subjective judgment ularization technique for language models, which calls, rather than annotator errors (Pavlick and does nothing to change the fact that they are trained Kwiatkowski, 2019). This means that even a per- on form alone. In this context, it is hard to avoid fectly careful and perfectly well-qualified human the inference that Bender and Koller must have annotator should be expected to disagree with the shown something specific and contingent about re- majority judgment on some examples, and will cent language models. In fact, they make a more thereby be coded as having made errors. It is there- general point about the nature of meaning that ap- fore possible to create an adversarial test set for plies equally to any purely form-based language which our perfect human annotator would reliably model, no matter what objective it uses or how well achieve 0% accuracy. it performs. Adversarially collected datasets are increasingly Similar claims can be found in many other citing commonly used as test sets in standard experimen- works (Utama et al., 2020; van Noord et al., 2020; tal paradigms, and these caveats about the interpre- Hovy and Yang, 2021; Sayers et al., 2021; Peti- tation of results are not always clear when numbers Stantić et al., 2021; Jang and Lukasiewicz, 2021). are presented. I focus on Nie et al. (2020) as a While Bender and Koller raise important points for case study: Sampling papers that cite this work, it discussion, these strong inferences in citing works is easy to find references that do not mention the are misleading and potentially harmful. adversarial design of the data and that make claims
that are hard to justify:3 Talmor et al. (2020) use 4 Why Underclaiming is Harmful the results from Nie et al. to claim that “LMs do not take into account the presence of negation in Research papers are generally most useful when sentences”, and Hidey et al. (2020) use them to they’re true and informative, and a research field justify the claim that “examples for numerical rea- that allows misleading claims to go unchallenged soning and lexical inference have been shown to be is likely to lose credibility with serious funders, difficult.” Liu et al. (2020) similarly use absolute re- reporters, or industry stakeholders. This is the most sults on the adversary models to back up the trivial obvious reason that we should be concerned about but easily-misread claim that BERT-style models underclaiming, but it is not the whole story. “may still suffer catastrophic failures in adversarial This loss of insight and loss of credibility can scenarios.” McClelland et al. (2019) use absolute seriously challenge our ability to anticipate, under- results on Nie et al.’s ANLI data to make similar stand, and manage the impacts of deploying NLP claims about the limitations of GPT-3 (Brown et al., systems. This is especially true of impacts that are 2020). contingent on NLP technologies actually working well, which we should expect will become more substantial as time goes on. 3 A Word on Hype 4.1 Present-Day Impact Mitigation The previous section has laid out some ways in The deployment of modern NLP systems has had which the mainstream NLP research community significant positive and negative impacts on the often makes unjustifiable claims about the limita- world. Technical researchers in NLP have an ethi- tions of state-of-the-art methods. These claims do cal obligation to inform (and if necessary, pressure) not make the opposite phenomenon, hype, any less stakeholders about how to avoid or mitigate the real or any less harmful. negative impacts while realizing the positive ones. Most prominently, most applied NLP models While hype is likely most severe in industry PR show serious biases with respect to legally pro- and in the media,4 it is nonetheless still prevalent tected attributes like race and gender (Bolukbasi in the research literature. In one especially clear ex- et al., 2016; Rudinger et al., 2018; Nangia et al., ample, a prominent paper claiming of human parity 2020; Bender et al., 2021). We have no reliable in machine translation performance (Hassan et al., mechanisms to mitigate these biases and no rea- 2018) severely overstates what has been accom- son to believe that they will be satisfactorily re- plished relative to commonsense intuitions about solved with larger scale. Worse, it is not clear that what a human-level translation system would do even superhuman levels of fairness on some mea- (Toral et al., 2018; Läubli et al., 2018; Zhang and sures would be satisfactory: Fairness norms can Toral, 2019; Graham et al., 2020). conflict with one another, and in some cases, a ma- I do not aim to argue that overclaiming or hype chine decision-maker will be given more trust and is acceptable or safe. However, unlike the under- deference than a human decision-maker would in claiming that this paper focuses on, hype is re- the same situation (see, e.g., Rudin et al., 2020; ceiving substantial attention, and we have some Fazelpour and Lipton, 2020). We are standing on established tools and norms within peer review to shaky moral and political grounds when we deploy combat it. Finally, combating hype should be fully present systems in high-impact settings, but they compatible with the goals laid out in this paper, are being widely deployed anyway (e.g. Dastin, and broad-based efforts to improve our practices 2018; Nayak, 2019; Dansby et al., 2020). Be- in evaluation, analysis, writing, and forecasting yond bias, similar present-day concerns can be seen should help reduce both underclaiming and hype. around issues involving minority languages and di- alects, deceptive design, and the concentration of 3 power (Joshi et al., 2020; Bender et al., 2021; Ken- I focus here about claims about the absolute performance level of models. Whether adversarially collected test sets ton et al., 2021, §3.3). are appropriate for comparing the relative effectiveness of Persuading the operators of deployed systems to models is a largely orthogonal issue (Bowman and Dahl, 2021; take these issues seriously, and to mitigate harms Kaushik et al., 2021). 4 Consider the 2017 Huffington Post headline “Facebook or scale back deployments when necessary, will be Shuts Down AI Robot After It Creates Its Own Language.” difficult. Intuitively, researchers concerned about
these harms may find it appealing to emphasize the Looking somewhat further into the future, a sub- limitations of models in the hope that this will dis- stantial community of philosophers, economists, courage the deployment of harmful systems. This and general ML researchers are concerned that is appropriate when done carefully, and could be a highly-capable AI systems, of the kind that could way in which the field’s increasing focus on model plausibly be developed through existing ML re- analysis could reduce some of these harms. search paradigms, are catastrophically dangerous However, this kind of strategic underclaiming by default (Bostrom, 2012; Critch and Krueger, can backfire if done sloppily: Models are often both 2020; Christian, 2020; Ord, 2020; Russell and useful and harmful, especially when the operator Norvig, 2020). Expert forecasts suggest that this of the system is not the one being harmed. If the could take place within a few decades (Grace et al., operator of some deployed system sees firsthand 2018). If these hypotheses hold, and if we are that the system is effective for their purposes, they poorly prepared for these developments, the worst- have little reason to trust researchers who argue that case outcomes could be catastrophic, even threaten- that same system does not understand language, ing the continued existence of human civilization or who argue something similarly broad and nega- on some views. tive. They will then be unlikely to listen to those Investments in research into these potential catas- researchers’ further claims that such a system is trophic risks from advanced machine learning sys- harmful, even if those further claims are accurate. tems have become substantial: Funding from one charitable foundation alone has totaled over $75M 4.2 Longer-Term Risks USD.5 Concerns about risks from AI have also been a major stated motivation for a significant We can reasonably expect NLP systems to im- fraction of the work from DeepMind and OpenAI, prove over the coming decades. Even if intellectual which both have access to even greater amounts of progress from research were to slow, the dropping private funding. price of compute should allow us to continue to Spurred on in particular by the shift in emergent reap the benefits of larger-scale training (Kaplan capabilities from GPT-2 to GPT-3, the attention of et al., 2020; Brown et al., 2020). This improvement these AI risk researchers has also been increasingly in capabilities is likely to amplify both the harms centered on language models and similar multi- and benefits of language technology. modal models (Irving et al., 2018; Stiennon et al., We have good reason to expect that this further 2020; Hendrycks et al., 2020; Kenton et al., 2021; progress in NLP will lead to upheavals in areas like Wu et al., 2021; Bommasani et al., 2021, §4.9). education, medicine, law, and the service sector Despite the scale of this research, and its recent more broadly, as well as making mass surveillance shift of focus toward language models, there has and misinformation campaigns far more effective, been little interaction between the AI risk research and opening up additional new use cases that will community and mainstream academic NLP to date. be hard for us to foresee (Brundage et al., 2018; To the extent that these concerns are valid, they Tamkin et al., 2021; Bommasani et al., 2021). One represent an urgent call for reprioritization within can reasonably expect that the positive and neg- NLP research to favor safety-relevant areas like in- ative impacts of these upheavals will far exceed terpretability, control, and evaluation over scaling, the impacts that our technologies have produced to and to push for better oversight and regulation of date. In turn, NLP researchers who want to ensure large-scale research (Dafoe, 2018): Even a small that their career has a net-positive impact on the risk of a globally significant catastrophe warrants a world should be concerned with these longer-term dramatic response. developments. On the other hand, to the extent that these con- It will be difficult to do the necessary technical, cerns are unfounded or are built on misunderstand- social, and governance work to prepare for these ings about the possible trajectories of ML research, advances if we do not have a clear picture of our it would be quite valuable to correct this misun- current capabilities, and it will be difficult to con- derstanding. Correcting the record could redirect vince outside stakeholders to act appropriately to these resources and, more significantly, reduce the mitigate these risks if we don’t acknowledge that we have made, and are making, real progress to- 5 https://www.openphilanthropy.org/ ward effective language technology. giving/grants
risk of popular or regulatory pressure that limits serve as good proxies for what we want in in famil- large-scale ML research so severely or clumsily as iar situations can break down in new situations. to snuff out the positive potential of NLP technolo- Further, optimizing complex systems is difficult, gies. such that even a genuinely safe objective, if op- timized powerfully by a system with an imper- 5 Why Worry? fect learned representation of the objective, could Because these more speculative concerns about AI produce dangerous failure modes (Hubinger et al., risk are rarely discussed in the NLP literature, I 2019). will here offer a brief overview of that work. Con- Finally, we should expect highly-capable AI sys- cerns around dangers from advanced artificial intel- tems to be useful in the short term, giving potential ligence tend to focus on four clusters of hypotheses: users a strong incentive to deploy them as soon as they are affordable, even if their safety is not Powerful Unaccountable Organizations guaranteed. This means that it is not enough that Highly-capable AI is likely to lead to highly- it simply be possible for us to develop safe sys- profitable applications. Highly-capable AI should tems, it is additionally necessary that it be nearly as also be able to displace human labor in technical easy and nearly as affordable as developing unsafe fields to a large extent, increasing the relative systems (Irving et al., 2018). value of capital over labor, and making it easier for the leaders of profitable organizations to take Risks May Be Difficult to Spot Human-level ca- unilateral unpopular actions. In the longer term, pabilities are likely to emerge first from large ma- highly-capable AI may also contribute to the chine learning models that, like modern neural net- effectiveness of persuasion campaigns, further works, are not directly interpretable. This means insulating these organizations from political or that it may be difficult to spot ways in which a popular pressure. Combined with significant model is unsafe or forecast ways in which its be- further technological progress, these forces could havior might change in novel settings (Critch and conspire to produce make the companies or Krueger, 2020). governments that first produce highly-capable AI almost entirely unaccountable, and allowing their Instrumentally-Convergent Subgoals The in- decisions to play a major role in the trajectory of strumental convergence hypothesis holds that sys- humanity as a whole (Ord, 2020). tems that are optimizing for benign objectives, once they become sufficiently capable, have a pre- Alignment and Robustness Even if a system is dictable reason to take on dangerous subgoals— deployed by an actor with good intentions and ap- like accumulating large amounts of computational, propriate oversight, good outcomes are not guar- economic, or political power—to maximize the anteed. As AI systems become more capable, it odds that their primary objectives are achieved becomes more likely that they will be trusted to (Bostrom, 2003; Omohundro, 2008; Bostrom, make high-stakes decisions autonomously. In these 2012).7 Even with merely near-human-like lev- cases, it becomes crucial that they behave in ways els of performance, the ability of computational that we would endorse, even when they are pushed models to be copied and accelerated gives them into unfamiliar new situations. This requires both considerable leeway to act in un-human-like ways. that the systems be optimized for the right objec- Systems that interact with humans only through tives and that the systems actually internalize and text, or systems whose goals are circumscribed to a generalize those objectives correctly. well-defined task like question answering, are not Specifying and using safe objectives, such that exempt from this concern (Armstrong et al., 2012). aggressively optimizing them does not produce catastrophic outcomes, is difficult (Critch and 7 This is exemplified by the thought experiment of the pa- Krueger, 2020). Human preferences are complex, perclip maximizer, which points out that a machine which making the problem of specifying an objective tasked with manufacturing as many paperclips as possible, if sufficiently capable, can be expected to turn nearly all matter that rules out unintended bad behavior non-trivial. on earth into paperclips. While this vision of a single system Goodhart’s law6 means that many objectives that acting alone on a trivial objective is unrealistic, it demonstrates the key hypothesis that almost any reasonable-sounding goal 6 in the formulation of Strathern (1997): “when a measure starts to conflict with basic human needs if a sufficiently capa- becomes a target, it ceases to be a good measure.” ble system pursues it single-mindedly.
So What? None of these hypotheses is simple (usually English) alone. In that spirit, I propose to prove, but as far as I am aware, all have re- another rule of thumb: sisted straightforward attempts at falsification. All When describing the failure of a machine four are potentially applicable to neural network- learning model on some empirical evalu- based models and to models which operate primar- ation, make it clear ily through language. While the nascent field of AI alignment has pro- i. what kind of model has failed, posed some mechanisms by which we might mit- ii. whether the model is significantly igate these risks, work in this area is still largely less capable than the current state exploratory, with no clear research agenda in place of the art in the domain, and to ensure that powerful ML models will be safe iii. whether the evaluation was deliber- (Hadfield-Menell et al., 2016; Irving et al., 2018; ately set up to trick that model or Critch and Krueger, 2020; Kenton et al., 2021). If another model like it. these hypotheses hold, significant further work is The latter two can be points can be left needed to ensure a safe long-term outcome. This unstated when they do not apply. will be difficult to achieve without a clear account- ing of the abilities and limitations of current and While a corresponding rule could be helpful in plausible near-future systems. In particular, we will the context of results describing the success of a need enough foresight to be able to see substantial machine learning system on some evaluation, the progress of this kind coming well in advance, to asymmetry here is intentional: Successes are likely avoid the complacency that comes with the percep- to be deliberately replicated from one generation tion that worrying about impacts from powerful AI of models to the next, creating a trend for a positive is like worrying about “overpopulation on mars” finding about older systems to remain true of newer (Garling, 2015, quoting Andrew Ng). systems in most cases. The opposite is true of model failures. 6 Ways to Do Better Better Evaluation The rise of underclaiming The core issue in this paper is one of sloppy com- can likely be attributed, at least in large part, to munication about results. The most straightforward the ineffectiveness of current evaluation practices step that we can take to remedy underclaiming is in many areas of NLP. If encouraging numbers on to simply use the same practices that we already widely-used benchmarks are frequently met with use to avoid overclaiming: The peer-review pro- disappointing subsequent evidence, suggesting that cess already polices overclaiming to a significant good evaluation numbers don’t translate to effec- extent, and most researchers have learned to be tive systems, it is rational to treat new encouraging careful about overclaiming in their writing. We results with extreme skepticism. should apply high standards of evidence to our Better benchmarks and evaluation practices own empirical claims and those of others, both in could help mitigate this by providing a firmer peer-reviewed venues and in more informal scien- ground on which to make positive claims about tific communication, even when those claims are system capacities.8 In practice, research into more negative and cloaked in a frame of individual or effective crowdsourcing and benchmark design field-level modesty. (Kiela et al., 2021; Bowman and Dahl, 2021; Nan- In addition, there are specific best practices or re- gia et al., 2021; Gehrmann et al., 2021) and re- search directions that can help make these mistakes search into better statistical reporting and publica- harder to make: tion norms (Dodge et al., 2019; Card et al., 2020; Rogers and Augenstein, 2020; van Miltenburg et al., A New Rule? In light of the issues with negative 2021) seem especially high-impact under this lens. results on older models discussed in Section 2.1, it could be productive to introduce a new heuristic Better Analysis We can help address the time- when reviewing or evaluating papers. The Ben- lag issue discussed in Section 2.1 by building tool- der Rule (Bender, 2019) asks authors to explicitly ing to make it easier to adapt existing analysis tech- name the natural languages that they study, so as niques to new models seamlessly. Leaderboards to avoid misleading claims about language or lan- 8 Though Raji et al. (2021) point out ways in which better guages on the basis of evidence from one language benchmarking alone is unlikely to be fully satisfactory.
that integrate conventional benchmarking with anal- struggled to avoid hype and inflated expectations ysis can be especially helpful by making this largely about the capabilities of state-of-the-art technology. automatic (Wang et al., 2018; Dua et al., 2019; This remains a serious issue. However, this paper Gehrmann et al., 2021; Ma et al., 2021). argues that our attempts to avoid hype have begun More broadly, careful work on model robustness to overshoot: Instead of merely correcting overly and interpretation, targeted at capable models, will optimistic claims about our capabilities, we have be valuable in helping to forecast and mitigate the in some cases started to replace them with overly worst risks from future systems. pessimistic claims. Making misleading claims is generally a bad Better Forecasting Scaling laws results in NLP sign for the health and credibility of a scientific (Hestness et al., 2017; Kaplan et al., 2020; Brown field, and the stakes are high: NLP technologies are et al., 2020; Zhang et al., 2021) offer the promise implicated in a range of serious real-world harms, that we can predict the performance of future larger- and plausible future elaborations of these technolo- scale machine learning models on at least some gies are potentially much more dangerous still. Our metrics. This line of work is still nascent, and suc- ability to mitigate existing harms will depend on cesses to date have largely focused on loss values our ability to make reliably credible claims about rather than more interpretable measures of capabil- the limitations of our systems. Our ability to mit- ity. Further developing these methods, as well as igate future harms will depend on our ability to others that allow us to better forecast near future accurately anticipate, recognize, agree upon, and progress, should be helpful. Better forecasting will report upon emerging capabilities. Both of these provide a useful way to sanity-check future claims goals are seriously hampered by claims that current (DellaVigna et al., 2019) and will help improve technologies are less capable than they in fact are. the responsiveness of model analysis by enabling Better evaluation, better tooling for model analy- us to prepare analysis methods and datasets that sis, and better mechanisms for technical forecasting anticipate future capabilities. should all contribute to making these pessimistic claims easier to avoid or debunk. However, this 7 Additional Related Work problem is ultimately one of scientific communica- While much of this paper discusses the state of tion, and to solve it fully, we will need to use the the NLP literature, there are some related works tools and norms of science to better police false or that warrant emphasis as starting points for further misleading claims. reading: Bender and Koller (2020), Bender et al. (2021), Acknowledgements and Raji et al. (2021) discuss the role of hype in The ideas and arguments in this work were devel- driving bad outcomes from the development of lan- oped through conversation (live or on Twitter) with guage technology. Jin et al. (2021) and Rogers more researchers than I can name, including Emiel (2021) offer broader discussion of how to ensure van Miltenburg, Deb Raji, Paul Christiano, Geof- that the net impact of near-future NLP deployments frey Irving, Rohin Shah, Jacob Steinhardt, Cather- on the world is positive. Looking to the longer term, ine Olsson, Nick Beckstead, Alex Tamkin, Daniel Bommasani et al. (2021, §4.9) provides an intro- Dewey, Alex Ray, and Robin Jia. Anna Rogers, duction to the AI risk and AI alignment literature Owain Evans, Alex Wang, Jacob Steinhardt, Jason from a perspective that emphasizes NLP and lan- Phang, Jared Kaplan, Alex Cristia, Tal Linzen, and guage. Welty et al. (2019), Linzen (2020), Ribeiro Jonathan Uesato provided feedback on drafts. Any et al. (2020), Raji et al. (2021), Bowman and Dahl errors or dubious rhetoric are my own. (2021), and Dehghani et al. (2021), among many This project has benefited from financial support others, discuss the challenges involved in design- by Eric and Wendy Schmidt (made by recommen- ing evaluations that yield trustworthy and accurate dation of the Schmidt Futures program) and Apple. depictions of the capabilities of ML models. This material is based upon work supported by 8 Conclusion the National Science Foundation under Grant Nos. 1922658 and 2046556. Any opinions, findings, and Like many research fields that have a tight connec- conclusions or recommendations expressed in this tion to a consumer products industry, NLP has long material are those of the author(s) and do not nec-
essarily reflect the views of the National Science Nick Bostrom. 2003. Ethical issues in advanced arti- Foundation. ficial intelligence. Science fiction and philosophy: from time travel to superintelligence, pages 277– 284. References Nick Bostrom. 2012. The superintelligent will: Moti- Stuart Armstrong, Anders Sandberg, and Nick vation and instrumental rationality in advanced arti- Bostrom. 2012. Thinking inside the box: Control- ficial agents. Minds and Machines, 22(2):71–85. ling and using an oracle AI. Minds and Machines, Samuel R. Bowman and George Dahl. 2021. What will 22(4):299–324. it take to fix benchmarking in natural language un- derstanding? In Proceedings of the 2021 Confer- Max Bartolo, Alastair Roberts, Johannes Welbl, Sebas- ence of the North American Chapter of the Associ- tian Riedel, and Pontus Stenetorp. 2020. Beat the ation for Computational Linguistics: Human Lan- AI: Investigating adversarial human annotation for guage Technologies, pages 4843–4855, Online. As- reading comprehension. Transactions of the Associ- sociation for Computational Linguistics. ation for Computational Linguistics, 8:662–678. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Emily Bender. 2019. The #benderrule: On naming the Subbiah, Jared D Kaplan, Prafulla Dhariwal, languages we study and why it matters. The Gradi- Arvind Neelakantan, Pranav Shyam, Girish Sastry, ent. Amanda Askell, Sandhini Agarwal, Ariel Herbert- Emily M. Bender, Timnit Gebru, Angelina McMillan- Voss, Gretchen Krueger, Tom Henighan, Rewon Major, and Shmargaret Shmitchell. 2021. On the Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, dangers of stochastic parrots: Can language models Clemens Winter, Chris Hesse, Mark Chen, Eric be too big? [parrot emoji]. In Proceedings of the Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, 2021 ACM Conference on Fairness, Accountability, Jack Clark, Christopher Berner, Sam McCandlish, and Transparency, FAccT ’21, page 610–623, New Alec Radford, Ilya Sutskever, and Dario Amodei. York, NY, USA. Association for Computing Machin- 2020. Language models are few-shot learners. In ery. Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Emily M. Bender and Alexander Koller. 2020. Climb- Inc. ing towards NLU: On meaning, form, and under- standing in the age of data. In Proceedings of the Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, 58th Annual Meeting of the Association for Compu- Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul tational Linguistics, pages 5185–5198, Online. As- Scharre, Thomas Zeitzoff, Bobby Filar, et al. 2018. sociation for Computational Linguistics. The malicious use of artificial intelligence: Fore- casting, prevention, and mitigation. arXiv preprint Yuchen Bian, Jiaji Huang, Xingyu Cai, Jiahong Yuan, 1802.07228. and Kenneth Church. 2021. On attention redun- dancy: A comprehensive study. In Proceedings of Steven Cao, Victor Sanh, and Alexander Rush. 2021. the 2021 Conference of the North American Chap- Low-complexity probing via finding subnetworks. ter of the Association for Computational Linguistics: In Proceedings of the 2021 Conference of the North Human Language Technologies, pages 930–945, On- American Chapter of the Association for Computa- line. Association for Computational Linguistics. tional Linguistics: Human Language Technologies, pages 960–966, Online. Association for Computa- Yonatan Bitton, Gabriel Stanovsky, Roy Schwartz, and tional Linguistics. Michael Elhadad. 2021. Automatic generation of Dallas Card, Peter Henderson, Urvashi Khandelwal, contrast sets from scene graphs: Probing the compo- Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. sitional consistency of GQA. In Proceedings of the With little power comes great responsibility. In 2021 Conference of the North American Chapter of Proceedings of the 2020 Conference on Empirical the Association for Computational Linguistics: Hu- Methods in Natural Language Processing (EMNLP), man Language Technologies, pages 94–105, Online. pages 9263–9274, Online. Association for Computa- Association for Computational Linguistics. tional Linguistics. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Brian Christian. 2020. The Alignment Problem: Ma- Venkatesh Saligrama, and Adam T Kalai. 2016. chine Learning and Human Values. WW Norton & Man is to computer programmer as woman is to Company. homemaker? debiasing word embeddings. Ad- vances in neural information processing systems, Andrew Critch and David Krueger. 2020. AI re- 29:4349–4357. search considerations for human existential safety (ARCHES). arXiv preprint 2006.04948. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Allan Dafoe. 2018. AI governance: a research Bernstein, Jeannette Bohg, Antoine Bosselut, Emma agenda. Governance of AI Program, Future of Hu- Brunskill, et al. 2021. On the opportunities and risks manity Institute, University of Oxford: Oxford, UK, of foundation models. arXiv preprint 2108.07258. 1442:1443.
Ryan Dansby, Han Fang, Hao Ma, Chris Moghbel, Singh, Noah A. Smith, Sanjay Subramanian, Reut Umut Ozertem, Xiaochang Peng, Ves Stoyanov, Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. Sinong Wang, Fan Yang, and Kevin Zhang. 2020. 2020. Evaluating models’ local decision boundaries AI advances to better detect hate speech. Facebook via contrast sets. In Findings of the Association AI Blog. for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Jeffrey Dastin. 2018. Amazon scraps secret AI recruit- Linguistics. ing tool that showed bias against women. Reuters. Caleb Garling. 2015. Andrew Ng: Why ‘deep learning’ Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe is a mandate for humans, not just machines. Reuters. Zhao, Neil Houlsby, Fernando Diaz, Donald Met- zler, and Oriol Vinyals. 2021. The benchmark lot- Sebastian Gehrmann, Tosin Adewumi, Karmanya Ag- tery. arXiv preprint 2107.07002. garwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Stefano DellaVigna, Devin Pope, and Eva Vivalt. Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D 2019. Predict science to improve science. Science, Dhole, et al. 2021. The GEM benchmark: Natu- 366(6464):428–429. ral language generation, its evaluation and metrics. arXiv preprint 2102.01672. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Katja Grace, John Salvatier, Allan Dafoe, Baobao deep bidirectional transformers for language under- Zhang, and Owain Evans. 2018. When will ai ex- standing. In Proceedings of the 2019 Conference ceed human performance? evidence from ai experts. of the North American Chapter of the Association Journal of Artificial Intelligence Research, 62:729– for Computational Linguistics: Human Language 754. Technologies, Volume 1 (Long and Short Papers), Yvette Graham, Barry Haddow, and Philipp Koehn. pages 4171–4186, Minneapolis, Minnesota. Associ- 2020. Statistical power and translationese in ma- ation for Computational Linguistics. chine translation evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Language Processing (EMNLP), pages 72–81, On- Schwartz, and Noah A. Smith. 2019. Show your line. Association for Computational Linguistics. work: Improved reporting of experimental results. In Proceedings of the 2019 Conference on Empirical Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, Methods in Natural Language Processing and the and Anca Dragan. 2016. Cooperative inverse rein- 9th International Joint Conference on Natural Lan- forcement learning. Advances in neural information guage Processing (EMNLP-IJCNLP), pages 2185– processing systems, 29:3909–3917. 2194, Hong Kong, China. Association for Computa- tional Linguistics. Rowan Hall Maudslay and Ryan Cotterell. 2021. Do syntactic probes probe syntax? experiments with Mengnan Du, Varun Manjunatha, Rajiv Jain, Ruchi jabberwocky probing. In Proceedings of the 2021 Deshpande, Franck Dernoncourt, Jiuxiang Gu, Tong Conference of the North American Chapter of the Sun, and Xia Hu. 2021. Towards interpreting and Association for Computational Linguistics: Human mitigating shortcut learning behavior of NLU mod- Language Technologies, pages 124–131, Online. As- els. In Proceedings of the 2021 Conference of the sociation for Computational Linguistics. North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- Hany Hassan, Anthony Aue, Chang Chen, Vishal gies, pages 915–929, Online. Association for Com- Chowdhary, Jonathan Clark, Christian Feder- putational Linguistics. mann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, et al. 2018. Achieving hu- Dheeru Dua, Ananth Gottumukkala, Alon Talmor, man parity on automatic chinese to english news Sameer Singh, and Matt Gardner. 2019. Orb: An translation. arXiv preprint 1803.05567. open reading benchmark for comprehensive evalua- Pengcheng He, Xiaodong Liu, Jianfeng Gao, and tion of machine reading comprehension. In EMNLP Weizhu Chen. 2020. DeBERTa: Decoding- 2019 MRQA Workshop, page 147. enhanced BERT with disentangled attention. arXiv preprint 2006.03654. Sina Fazelpour and Zachary C Lipton. 2020. Algorith- mic fairness from a non-ideal perspective. In Pro- Dan Hendrycks, Collin Burns, Steven Basart, Andrew ceedings of the AAAI/ACM Conference on AI, Ethics, Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. and Society, pages 57–63. 2020. Aligning AI with shared human values. In Proceedings of ICLR. Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Joel Hestness, Sharan Narang, Newsha Ardalani, Gre- Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, gory Diamos, Heewoo Jun, Hassan Kianinejad, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel- Zhou. 2017. Deep learning scaling is predictable, son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer empirically. arXiv preprint 1712.00409.
Christopher Hidey, Tuhin Chakrabarty, Tariq Alhindi, Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Siddharth Varia, Kriste Krstovski, Mona Diab, and 2020. Scaling laws for neural language models. Smaranda Muresan. 2020. DeSePtion: Dual se- arXiv preprint 2001.08361. quence prediction and adversarial examples for im- proved fact-checking. In Proceedings of the 58th An- Divyansh Kaushik, Douwe Kiela, Zachary C Lipton, nual Meeting of the Association for Computational and Wen-tau Yih. 2021. On the efficacy of adversar- Linguistics, pages 8593–8606, Online. Association ial data collection for question answering: Results for Computational Linguistics. from a large-scale randomized study. arXiv preprint 2106.00872. Nora Hollenstein, Federico Pirovano, Ce Zhang, Lena Jäger, and Lisa Beinborn. 2021. Multilingual lan- Zachary Kenton, Tom Everitt, Laura Weidinger, Ia- guage models predict human reading behavior. In son Gabriel, Vladimir Mikulik, and Geoffrey Irving. Proceedings of the 2021 Conference of the North 2021. Alignment of language agents. arXiv preprint American Chapter of the Association for Computa- 2103.14659. tional Linguistics: Human Language Technologies, pages 106–123, Online. Association for Computa- Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh tional Linguistics. Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vid- gen, Grusha Prasad, Amanpreet Singh, Pratik Ring- Dirk Hovy and Diyi Yang. 2021. The importance of shia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, modeling social factors of language: Theory and Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mo- practice. In Proceedings of the 2021 Conference of hit Bansal, Christopher Potts, and Adina Williams. the North American Chapter of the Association for 2021. Dynabench: Rethinking benchmarking in Computational Linguistics: Human Language Tech- NLP. In Proceedings of the 2021 Conference of nologies, pages 588–602, Online. Association for the North American Chapter of the Association for Computational Linguistics. Computational Linguistics: Human Language Tech- nologies, pages 4110–4124, Online. Association for Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Computational Linguistics. Joar Skalse, and Scott Garrabrant. 2019. Risks from learned optimization in advanced machine learning Samuel Läubli, Rico Sennrich, and Martin Volk. 2018. systems. arXiv preprint 1906.01820. Has machine translation achieved human parity? a case for document-level evaluation. In Proceed- Geoffrey Irving, Paul Christiano, and Dario Amodei. ings of the 2018 Conference on Empirical Methods 2018. AI safety via debate. arXiv preprint in Natural Language Processing, pages 4791–4796, 1805.00899. Brussels, Belgium. Association for Computational Linguistics. Myeongjun Jang and Thomas Lukasiewicz. 2021. NoiER: An approach for training more reliable fine- Hector J Levesque, Ernest Davis, and Leora Morgen- tuned downstream task models. arXiv preprint stern. 2011. The Winograd schema challenge. In 2110.02054. AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, page 47. Robin Jia and Percy Liang. 2017. Adversarial exam- ples for evaluating reading comprehension systems. Tal Linzen. 2020. How can we accelerate progress to- In Proceedings of the 2017 Conference on Empiri- wards human-like linguistic generalization? In Pro- cal Methods in Natural Language Processing, pages ceedings of the 58th Annual Meeting of the Asso- 2021–2031, Copenhagen, Denmark. Association for ciation for Computational Linguistics, pages 5210– Computational Linguistics. 5217, Online. Association for Computational Lin- guistics. Zhijing Jin, Geeticka Chauhan, Brian Tse, Mrinmaya Sachan, and Rada Mihalcea. 2021. How good is Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu NLP? a sober look at NLP tasks through the lens Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. of social impact. In Findings of the Association 2020. Adversarial training for large neural language for Computational Linguistics: ACL-IJCNLP 2021, models. arXiv preprint 2004.08994. pages 3099–3113, Online. Association for Computa- tional Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Luke Zettlemoyer, and Veselin Stoyanov. 2019. Bali, and Monojit Choudhury. 2020. The state and RoBERTa: A robustly optimized BERT pretraining fate of linguistic diversity and inclusion in the NLP approach. arXiv preprint 1907.11692. world. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya pages 6282–6293, Online. Association for Computa- Jain, Ledell Wu, Robin Jia, Christopher Potts, Ad- tional Linguistics. ina Williams, and Douwe Kiela. 2021. Dynaboard: An evaluation-as-a-service platform for holistic Jared Kaplan, Sam McCandlish, Tom Henighan, next-generation benchmarking. arXiv preprint Tom B Brown, Benjamin Chess, Rewon Child, Scott 2106.06052.
James L McClelland, Felix Hill, Maja Rudolph, Jason Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent Baldridge, and Hinrich Schütze. 2019. Extending disagreements in human textual inferences. Transac- machine language models toward human-level lan- tions of the Association for Computational Linguis- guage understanding. arXiv preprint 1912.05877. tics, 7:677–694. Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Right for the wrong reasons: Diagnosing syntactic Gardner, Christopher Clark, Kenton Lee, and Luke heuristics in natural language inference. In Proceed- Zettlemoyer. 2018. Deep contextualized word rep- ings of the 57th Annual Meeting of the Association resentations. In Proceedings of the 2018 Confer- for Computational Linguistics, pages 3428–3448, ence of the North American Chapter of the Associ- Florence, Italy. Association for Computational Lin- ation for Computational Linguistics: Human Lan- guistics. guage Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics. William Merrill, Yoav Goldberg, Roy Schwartz, and Noah A. Smith. 2021. Provable Limitations of Ac- Anita Peti-Stantić, Maja And̄el, Vedrana Gnjidić, Gor- quiring Meaning from Ungrounded Form: What dana Keresteš, Nikola Ljubešić, Irina Masnikosa, Will Future Language Models Understand? Trans- Mirjana Tonković, Jelena Tušek, Jana Willer-Gold, actions of the Association for Computational Lin- and Mateusz-Milan Stanojević. 2021. The croatian guistics, 9:1047–1060. psycholinguistic database: estimates for 6000 nouns, verbs, adjectives and adverbs. Behavior Research Nikita Nangia, Saku Sugawara, Harsh Trivedi, Alex Methods, pages 1–18. Warstadt, Clara Vania, and Samuel R. Bowman. 2021. What ingredients make for an effective crowd- Pouya Pezeshkpour, Sarthak Jain, Byron Wallace, and sourcing protocol for difficult NLU data collection Sameer Singh. 2021. An empirical comparison of in- tasks? In Proceedings of the 59th Annual Meet- stance attribution methods for NLP. In Proceedings ing of the Association for Computational Linguistics of the 2021 Conference of the North American Chap- and the 11th International Joint Conference on Nat- ter of the Association for Computational Linguistics: ural Language Processing (Volume 1: Long Papers), Human Language Technologies, pages 967–975, On- pages 1221–1235, Online. Association for Computa- line. Association for Computational Linguistics. tional Linguistics. Alec Radford, Karthik Narasimhan, Tim Salimans, and Nikita Nangia, Clara Vania, Rasika Bhalerao, and Ilya Sutskever. 2018. Improving language under- Samuel R. Bowman. 2020. CrowS-pairs: A chal- standing by generative pre-training. Unpublished lenge dataset for measuring social biases in masked ms. available through a link at https://blog. language models. In Proceedings of the 2020 Con- openai.com/language-unsupervised/. ference on Empirical Methods in Natural Language Inioluwa Deborah Raji, Emily Denton, Emily M Processing (EMNLP), pages 1953–1967, Online. As- Bender, Alex Hanna, and Amandalynne Paullada. sociation for Computational Linguistics. 2021. Ai and the everything in the whole wide world benchmark. In Proceedings of The Pandu Nayak. 2019. Understanding searches better ML-Retrospectives, Surveys & Meta-Analyses @ than ever before. The Keyword blog. NeurIPS 2020 Workshop. Yixin Nie, Adina Williams, Emily Dinan, Mohit Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Bansal, Jason Weston, and Douwe Kiela. 2020. Ad- Percy Liang. 2016. SQuAD: 100,000+ questions for versarial NLI: A new benchmark for natural lan- machine comprehension of text. In Proceedings of guage understanding. In Proceedings of the 58th An- the 2016 Conference on Empirical Methods in Natu- nual Meeting of the Association for Computational ral Language Processing, pages 2383–2392, Austin, Linguistics, pages 4885–4901, Online. Association Texas. Association for Computational Linguistics. for Computational Linguistics. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Be- Timothy Niven and Hung-Yu Kao. 2019. Probing neu- havioral testing of NLP models with CheckList. In ral network comprehension of natural language ar- Proceedings of the 58th Annual Meeting of the Asso- guments. In Proceedings of the 57th Annual Meet- ciation for Computational Linguistics, pages 4902– ing of the Association for Computational Linguis- 4912, Online. Association for Computational Lin- tics, pages 4658–4664, Florence, Italy. Association guistics. for Computational Linguistics. Anna Rogers. 2021. Changing the world by changing Stephen M Omohundro. 2008. The basic AI drives. In the data. In Proceedings of the 59th Annual Meet- Artificial General Intelligence 2008, pages 483–492. ing of the Association for Computational Linguistics IOS Press. and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), Toby Ord. 2020. The precipice: existential risk and the pages 2182–2194, Online. Association for Computa- future of humanity. Hachette Books. tional Linguistics.
You can also read