Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold Sebastian Ruder∗ Ivan Vulić∗ Anders Søgaard∗ Google Research University of Cambridge University of Copenhagen ruder@google.com iv250@cam.ac.uk soegaard@di.ku.dk Abstract The prototypical NLP experiment trains a stan- dard architecture on labeled English data and optimizes for accuracy, without accounting for other dimensions such as fairness, inter- pretability, or computational efficiency. We show through a manual classification of recent NLP research papers that this is indeed the case and refer to it as the square one experi- mental setup. We observe that NLP research often goes beyond the square one setup, e.g, focusing not only on accuracy, but also on fairness or interpretability, but typically only along a single dimension. Most work tar- geting multilinguality, for example, considers Figure 1: Visualization of contributions of ACL 2021 only accuracy; most work on fairness or in- oral papers along 4 dimensions: multilinguality, fair- terpretability considers only English; and so ness and bias, efficiency, and interpretability (indicated on. Such one-dimensionality of most research by color). Most work is clustered around the S QUARE means we are only exploring a fraction of the O NE or along a single dimension. NLP research search space. We provide his- torical and recent examples of how the square one bias has led researchers to draw false periment, and that the existence of such an exper- conclusions or make unwise choices, point to imental prototype steers and biases the research promising yet unexplored directions on the re- dynamics in our community. We will refer to this search manifold, and make practical recom- prototype as NLP’s S QUARE O NE—and to the bias mendations to enable more multi-dimensional that follows from it, as the S QUARE O NE B IAS. We research. We open-source the results of our an- notations to enable further analysis.1 argue this bias manifests in a particular way: Since research is a creative endeavor, and researchers aim 1 Introduction to push the research horizon, most research papers in NLP go beyond this prototype, but only along Our categorization of objects, say screwdrivers or a single dimension at a time. Such dimensions NLP experiments, is heavily biased by early pro- might include multilinguality, efficiency, fairness, totypes (Sherman, 1985; Das-Smaal, 1990). If the and interpretability, among others. The effect of the first 10 screwdrivers we see are red and for hexagon S QUARE O NE B IAS is to baseline novel research socket screws, this will bias what features we learn contributions, rewarding work that differs from the to associate with screwdrivers. Likewise, if the prototype in a concise, one-dimensional way. first 10 NLP experiments we see or conduct are in We present several examples of this effect in sentiment analysis, this will likely also bias how practice. For instance, analyzing the contributions we think of NLP experiments in the future. of ACL 2021 papers along 4 dimensions, we ob- In this position paper, we postulate that we can serve that most work is either clustered around meaningfully talk about the prototypical NLP ex- the S QUARE O NE or makes a contribution along ∗ The authors contributed equally to this work. a single dimension (see Figure 1). Multilingual 1 github.com/google-research/url-nlp work typically disregards efficiency, fairness, and 2340 Findings of the Association for Computational Linguistics: ACL 2022, pages 2340 - 2354 May 22-27, 2022 c 2022 Association for Computational Linguistics
interpretability. Work on efficient NLP typically ficiency, and interpretability. Compared to prior only performs evaluations on English datasets, and work that annotates the values of ML research pa- disregards fairness and interpretability. Fairness pers (Birhane et al., 2021), we are not concerned and interpretability work is also mostly limited to with a paper’s motivation but whether its practi- English, and tends to disregard efficiency concerns. cal contributions constitute a meaningful departure We argue that the S QUARE O NE B IAS has sev- from the S QUARE O NE. For each paper, we an- eral negative effects, most of which amount to the notate whether it makes a contribution along each study of one of the above dimensions being biased dimension as well as the languages and metrics it by ignoring the others. Specifically, by focusing employs for evaluation. We provide the detailed only on exploring the edges of the manifold, we are annotation guidelines in Appendix A.1. not able to identify the non-linear interactions be- ACL 2021 Oral Papers. We annotate the 461 pa- tween different research dimensions. We highlight pers that were presented orally at ACL 2021, a several examples of such interactions in Section 3. representative cross-section of the 779 papers ac- Overall, we encourage a focus on combining multi- cepted to the main conference. The general statis- ple dimensions on the research manifold in future tics from our classification of ACL 2021 papers NLP research, and delve deeper into studying their are presented in Table 1. In addition, we highlight (linear and non-linear) interactions. the statistics for the conference areas (tracks) cor- Contributions. We first establish that we can responding to 3 of the 4 dimensions4 , as well as meaningfully talk about the prototypical NLP ex- for the top 5 areas with the most papers. We show periment, through a series of annotation experi- statistics for the remaining areas in Appendix A.2. ments and surveys. This prototype amounts to ap- We additionally visualize their distribution in Fig- plying a standard architecture to an English dataset ure 1. Overall, almost 70% of papers evaluate only and optimizing for accuracy or F1. We discuss the on English, clearly highlighting a lack of language impact of this prototype on our research commu- diversity in NLP (Bender, 2011; Joshi et al., 2020). nity, and the bias it introduces. We then discuss the Almost 40% of papers only evaluate using accuracy negative effects of this bias. We also list work that and/or F1, foregoing metrics that may shed light has taken steps to overcome the bias. Finally, we on other aspects of model behavior. 56.6% of pa- highlight blind spots and unexplored research direc- pers do not study any of the four major dimensions tions and make practical recommendations, aiming that we investigated. We refer to this standard ex- to inspire the community towards conducting more perimental setup—evaluating only on English and ‘multi-dimensional’ research (see Figure 1). optimizing for accuracy or another performance metric without considering other dimensions—as 2 Finding the Square One the S QUARE O NE. In order to determine the existence and nature of a Regarding work that moves from the S QUARE S QUARE O NE, we assess contemporary research O NE, most papers make a contribution in terms of in NLP along a number of different dimensions. efficiency, followed by multilinguality. However, most papers that evaluate on multiple languages are Dimensions. We identify potential themes in NLP part of the corresponding MT and Multilinguality research by reviewing the Call for Papers, publi- track. Despite being an area receiving increasing at- cation statistics by area, and paper titles of recent tention (Blodgett et al., 2020), only 6.3% of papers NLP conferences. We focus on general dimensions evaluate the bias or fairness of a method. Overall, that are not tied to a particular task and are applica- only 6.1% of papers make a contribution along two ble to any NLP application.2 We furthermore focus or more of these dimensions. Among these, joint on dimensions that are represented in a reasonable contributions on both multilinguality and efficiency fraction of NLP papers (at least 5% of ACL 2021 are the most common (see Figure 1). In fact, 22 oral papers).3 Our final selection focuses on 4 di- of the 26 two-or-more-dimensional papers focus mensions along which papers may make research on efficiency, and 17 of these on the combination contributions: multilinguality, fairness and bias, ef- 4 Unlike EACL 2021, NAACL-HLT 2021 and EMNLP 2 For instance, we do not consider multimodality, as a task 2021, ACL 2021 had no area associated with efficiency. To or model is inherently multimodal or not. compensate for this, we annotated the 20 oral papers of the 3 Privacy, interactivity, and other emerging research areas “Efficient Models in NLP” track at EMNLP 2021 (see Ap- are excluded based on this criterion. pendix A.3). 2341
Area # papers English Accuracy / F1 Multilinguality Fairness and bias Efficiency Interpretability >1 dimension ACL 2021 oral papers 461 69.4% 38.8% 13.9% 6.3% 17.8% 11.7% 6.1% MT and Multilinguality 58 0.0% 15.5% 56.9% 5.2% 19.0% 6.9% 13.8% Interpretability and Analysis 18 88.9% 27.8% 5.6% 0.0% 5.6% 66.7% 5.6% Ethics in NLP 6 83.3% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% Dialog and Interactive Systems 42 90.5% 21.4% 0.0% 9.5% 23.8% 2.4% 2.4% Machine Learning for NLP 42 66.7% 40.5% 19.0% 4.8% 50.0% 4.8% 9.5% Information Extraction 36 80.6% 91.7% 8.3% 0.0% 25.0% 5.6% 8.3% Resources and Evaluation 35 77.1% 42.9% 5.7% 8.6% 5.7% 14.3% 5.7% NLP Applications 30 73.3% 43.3% 0.0% 10.0% 20.0% 10.0% 0.0% Table 1: The number of ACL 2021 oral papers (top row) and of papers in each area (bottom rows) as well as the fractions that only evaluate on English, only use accuracy / F1, make contributions along one of four dimensions, and make contributions along more than a single dimension (from left to right). of multilinguality and efficiency. This means less Year Paper Language Metric than 1% of the ACL 2021 papers consider combi- 1995 Grosz et al. (1995) English n/a nations of (two or more of) multilinguality, fairness 1995 Yarowsky (1995) English acc. 1996 Berger et al. (1996) English acc. and interpretability. We find this surprising, given 1996 Carletta (1996) n/a n/a these topics are considered among the most popular 2010 Baroni and Lenci (2010) English acc. topics in the field. 2010 Turian et al. (2010) English F1 2011 Taboada et al. (2011) English acc. Some areas have particularly concerning statis- 2011 Ott et al. (2011) English acc./F1 tics. A large majority of research work in dia- log (90.5%), summarization (91.7%), sentiment Table 2: Test-of-Time Award 2021-22 papers analysis (100%), and language grounding (100%) is done only on English; however, ways of ex- value in research, i.e., our perception of ideal re- pressing sentiment (Volkova et al., 2013; Yang and search practices. This can also be seen in the papers Eisenstein, 2017; Vilares et al., 2018) and visu- that have received the ACL Test-of-Time Award in ally grounded reasoning (Liu et al., 2021a; Yin the last two years (Table 2). Seven in eight papers et al., 2021) do vary across languages and cul- included empirical evaluations performed exclu- tures. Systems in the top tracks tend to evaluate sively on English data. Six papers were exclusively efficiency, but in general do not consider fairness or concerned with optimizing for accuracy or F1 . interpretability of the proposed methods. Even the creation of new resources and evaluation sets (cf., Blackbox NLP Papers. Finally, we check if more Resource and Evaluation in Table 1) seems to be multi-dimensional papers were presented at a work- directed towards rewarding and enabling S QUARE shop devoted to one of the above dimensions. The O NE experiments; favoring English (77.1%), and rationale is that if everyone at a workshop already with modest efforts on other dimensions. Notably, explores one of these dimensions, including an- we only identified a single paper that considers other may be a way to have an edge over other three dimensions (Renduchintala et al., 2021). This submissions. Unfortunately, this does not seem to paper considers gender bias (Fairness) in relation be the case. We manually annotated the first 10 pa- to speed-quality (Efficiency) trade-offs in multilin- pers in the Blackbox NLP 2021 program5 that were gual machine translation (Multilinguality). Finally, available as pre-prints at the time of submission. we observe that best-paper award winning papers Of the 10 papers, only one included more than one are not more likely to consider more than one of the dimension (Abdullah et al., 2021). This number four dimensions. Only 1 in 8 papers did; the best aligns well with the overall statistics of ACL 2021 paper (Xu et al., 2021), like most two-dimensional (6.1%). All the other Blackbox NLP papers only ACL 2021 papers, considered multilinguality and considered interpretability for English. efficiency. 3 Square One Bias: Examples Test-of-Time Award Recipients. Current papers provide us with a snapshot of actual current re- In the following, we highlight both historical and search practices, but the one-dimensionality of the recent examples touching on different aspects of re- best paper award winning papers at ACL 2021 sug- search in NLP that illustrate how the gravitational 5 gest the S QUARE O NE B IAS also biases what we https://blackboxnlp.github.io/ 2342
attraction of the S QUARE O NE has led researchers Basque, which is both morphologically richer and to draw false conclusions, unconsciously steer stan- has a relatively free word order (Ravfogel et al., dard research practices, or make unwise choices. 2018) compared to English (Linzen et al., 2016). They have also been shown to transfer worse to dis- Architectural Biases. One pervasive bias in our tant languages for dependency parsing compared models regards morphology. Many of our mod- to self-attention models (Ahmad et al., 2019). Such els were not designed with morphology in mind, biases concerning word order are not only inher- arguably because of the poor/limited morphology ent in our models but also in our algorithms. A of English. Traditional n-gram language models, recent unsupervised parsing algorithm (Shen et al., for example, have been shown to perform much 2018) has been shown to be biased towards right- worse on languages with elaborate morphology due branching structures and consequently performs to data sparsity problems (Khudanpur, 2006; Ben- better in right-branching languages like English der, 2011; Gerz et al., 2018). Such models were (Dyer et al., 2019). While the recent generation nevertheless more commonly used than more lin- of self-attention based architectures can be seen guistically informed alternatives such as factored as inherently order-agnostic, recent methods focus- language models (Bilmes and Kirchhoff, 2003) ing on making attention more efficient (Tay et al., that represent words as sets of features. Word em- 2020) introduce new biases into the models. Specif- beddings have been widely used, in part because ically, models that reduce the global attention to a pre-trained embeddings covered a large part of the local sliding window around the token (Liu et al., English vocabulary. However, word embeddings 2018; Child et al., 2019; Zaheer et al., 2020) may are not useful for tasks that require access to mor- incur similar limitations as their n-gram and word phemes, e.g., semantic tasks in morphologically embedding-based predecessors, performing worse rich languages (Avraham and Goldberg, 2017). on languages with relatively free word order.6 While studies have demonstrated the ability of The singular focus on maximizing a performance word embeddings to capture linguistic information metric such as accuracy introduces a bias towards in English, it remains unclear whether they capture models that are expressive enough to fit a given the information needed for processing morphologi- distribution well. Such models are typically black- cally rich languages (Tsarfaty et al., 2020). A bias box and learn highly non-linear relations that are towards morphologically rich languages is also ap- generally not interpretable. Interpretability is gen- parent in our tokenization algorithms. Subword erally studied in papers focusing exclusively on tokenization performs poorly on languages with this topic; a recent example is BERTology (Rogers reduplication (Vania and Lopez, 2017), while byte et al., 2020). Studies proposing more interpretable pair encoding does not align well with morphol- methods typically build on state-of-the-art meth- ogy (Bostrom and Durrett, 2020). Consequently, ods (Weiss et al., 2018) and much work focuses languages with productive morphological systems on leveraging components such as attention for in- also are disadvantaged when shared ‘language- terpretability, which have not been designed with universal’ tokenizers are used in current large-scale that goal in mind (Serrano and Smith, 2019; Wiegr- multilingual language models (Ács, 2019; Rust effe and Pinter, 2019). As a result, researchers et al., 2021) without any further vocabulary adapta- eschew directions focusing on models that are in- tion (Wang et al., 2020; Pfeiffer et al., 2021). trinsically more interpretable such as generalized Another bias in our models relates to word or- additive models (Hastie and Tibshirani, 2017) and der. In order for n-gram models to capture inter- their extensions (Chang et al., 2021; Agarwal et al., word dependencies, words need to appear in the 2021) but which have so far not been shown to n-gram window. This will occur more frequently match the performance of state-of-the-art methods. in languages with relatively fixed word order com- As most datasets on which models are evaluated pared to languages with relatively free word order focus on sentences or short documents, state-of- (Bender, 2011). Word embedding approaches such the-art methods restrict their input size to around as skip-gram (Mikolov et al., 2013) adhere to the 512 tokens (Devlin et al., 2019) and leverage meth- same window-based approach and thus have sim- 6 ilar weaknesses for languages with relatively free An older work of Khudanpur (2006) argues that free word order is less of a problem as local order within phrases word order. LSTMs are also sensitive to word or- is relatively stable. However, it remains to be seen to what der and perform worse on agreement prediction in degree this affects current models. 2343
ods that are inefficient when scaling to longer lead to severe model biases (Hansen and Søgaard, documents. This has led to the emergence of a 2021a) and hurt model fairness. In interpretability, wide range of more efficient models (Tay et al., we can use feature attribution methods and word- 2020), which, however, are rarely used as baseline level annotations to evaluate interpretability meth- methods in NLP. Similarly, the standard pretrain- ods applied to sequence classifiers (Rei and Sø- fine-tune paradigm (Ruder et al., 2019) requires gaard, 2018), but we cannot directly use feature at- separate model copies to be stored for each task, tribution methods to obtain rationales for sequence and thus restricts work on multi-domain, multi- labelers. Annotation biases can also stem from the task, multi-lingual, multi-subpopulation methods characteristics of the annotators, including their do- that is enabled by more efficient and less resource- main experience (McAuley and Leskovec, 2013), intensive (Schwartz et al., 2020) fine-tuning meth- demographics (Jørgensen and Søgaard, 2021), or ods (Houlsby et al., 2019; Pfeiffer et al., 2020) educational level (Al Kuwatly et al., 2020). In sum, (what we typically consider as) standard Annotation biases form an integral part of the baselines and state-of-the-art architectures favor S QUARE O NE B IAS: In NLP experiments, we com- languages with some characteristics over others and monly rely on the same pools of annotators, e.g., are optimized only for performance, which in turn computer science students, professional linguists, propagates the S QUARE O NE B IAS: If researchers or MTurk contributors. Sometimes these biases study aspects such as multilinguality, efficiency, percolate through reuse of resources, e.g., through fairness or interpretability, they are likely to do human or machine translation into new languages. so with and for commonly used architectures (i.e., Examples of such recycled resources include the often termed ‘standard architectures’), in order to ones introduced by Conneau et al. (2018) and Kass- reduce (too) many degrees of freedom in their em- ner et al. (2021), among others. Even when such pirical research. This is in many ways a sensible translation-based resources resonate with syntax choice in order to maximize perceived relevance— and semantics of the target language, and are fluent and thereby, impact. However, as a result, multi- and natural, they still suffer from translation arte- linguality, efficiency, fairness, interpretability, and facts: they are often target-language surface realiza- other research areas inherit the same biases, which tions of source-language-based conceptual thinking typically slip under the radar. (Majewska et al., 2022). As a consequence, eval- uations of cross-lingual transfer models on such Annotation Biases. Many NLP tasks can be cast data typically overestimate their performance as differently and formulated in multiple ways, and properties such as word order and even the choice differences may result in different annotation styles. of lexical units are inherently biased by the source Sentiment, for example, can be annotated at the language (Vanmassenhove et al., 2021). Put sim- document, sentence or word level (Socher et al., ply, the choice of the data creation protocol, e.g., 2013). In machine comprehension, answers are translation-based versus data collection directly in sometimes assumed to be continuous, but Zhu et al. the target language (Clark et al., 2020) can yield (2020) annotate discontinuous spans. In depen- profound differences in model performance for dency parsing, different annotation guidelines can some groups, or may have serious impact on the lead to very different downstream performance interpretability or computational efficiency (e.g., (Elming et al., 2013). How we annotate for a task sample efficiency) of our models. may interact in complex ways with dimensions such as multilinguality, efficiency, fairness, and in- Selection Biases. For many years, the English terpretability. The Universal Dependencies project Penn Treebank (Marcus et al., 1994) was an inte- (Nivre et al., 2020) is motivated by the observa- gral part of the S QUARE O NE of NLP. This corpus tion that not all dependency formalisms are easily consists entirely of newswire, i.e., articles and edi- applicable to all languages. Aligning guidelines torials from the Wall Street Journal, and arguably across languages has enabled researchers to ask in- amplified the (existing) bias toward news articles. teresting questions, but such attempts may limit the Since news articles tend to reflect a particular set analysis of outlier languages (Croft et al., 2017). of linguistic conventions, have a certain length, and Other examples of annotation guidelines interact- are written by certain demographics, the bias to- ing with the above dimensions exist: Slight nuances ward news articles had an impact on the linguistic in how annotation guidelines are formulated can phenomena studied in NLP (Judge et al., 2006), led 2344
to under-representation of challenges with handling papers that make contributions to the chosen area, longer documents (Beltagy et al., 2021), and had in order to appeal to the reviewers of this area and impact on early papers in fairness (Hovy and Sø- implicitly penalizes papers that make contributions gaard, 2015). Note how such a bias may interact in along multiple dimensions, as reviewers unfamil- non-linear ways with efficiency, i.e., efficient meth- iar with the related areas may not appreciate their ods for shorter documents need not be efficient for inter-disciplinary or inter-areal magnitude or value. longer ones, or fairness, i.e., what mitigates gender Even new initiatives that seek to improve review- biases in news articles need not mitigate gender ing such as ARR7 adhere to this area structure8 and biases in product reviews. thus further the S QUARE O NE B IAS. A review- Protocol Biases. In the prototypical NLP experi- ing system that allows papers to be associated with ment, the dataset is in the English language. As a multiple dimensions of research and that assigns consequence, it is also standard protocol in multi- reviewers with complementary expertise—similar lingual NLP to use English as a source language to TACL9 —would ameliorate this situation. Once in zero-shot cross-lingual transfer (Hu et al., 2020). a paper is accepted, presentations at conferences In practice, there are generally better source lan- are organized by areas, limiting audiences in most guages than English (Ponti et al., 2018; Lin et al., cases to members of said area and thereby reducing 2019; Turc et al., 2021), and results are heavily the cross-pollination of ideas.10 biased by the common choice of English. For in- Unexplored Areas of the Research Manifold. stance, effectiveness and efficiency of few-shot The discussed biases, which seem to originate from learning can be impacted by the choice of the the S QUARE O NE B IAS, leave areas of the research source language (Pfeiffer et al., 2021; Zhao et al., manifold unexplored. Character-based language 2021). English also dominates language pairs in models are often reported to perform well for mor- machine translation, leading to lower performance phologically rich languages or on non-canonical for non-English translation directions (Fan et al., text (Ma et al., 2020), but little is known about 2020), which are particularly important in multilin- their fairness properties, and attribution-based in- gual societies. Again, such biases may interact in terpretability methods have not been developed for non-trivial ways with dimensions explored in NLP such models. Annotation biases that stem from research: It is not inconceivable that there is an annotator demographics have been studied for En- algorithm A that is more fair, interpretable or effi- glish POS tagging (Hovy and Søgaard, 2015) or cient than algorithm B on, say, English-to-Czech English summarization (Jørgensen and Søgaard, transfer or translation, but not on German-to-Czech 2021), for example, but there has been very little or French-to-Czech. research on such biases for other languages. While Organizational Biases. The above architectural, linguistic differences among genders is shared annotation, selection and protocol biases follow among some languages, genders differ in very dif- from the S QUARE O NE B IAS, but they also con- ferent ways between other languages, e.g., Span- serve the S QUARE O NE. If our go-to architectures, ish and Swedish (Johannsen et al., 2015). We dis- resources, and experimental setups are tailored to cuss important unexplored areas of the research some languages over others, some objectives over manifold in §5, but first we briefly survey existing, others, and some research paradigms over others, multi-dimensional work, i.e., the counter-examples it is considerably more work to explore new sets of 7 aclrollingreview.org/ languages, new objectives, or new protocols. The 8 www.2022.aclweb.org/callpapers organizational biases we discuss below may also 9 transacl.org/index.php/tacl 10 reinforce the S QUARE O NE B IAS. Another previously pervasive organizational bias, which is now fortunately being institutionally mitigated within the The organization of our conferences and review- *ACL community through dedicated mentoring programs and ing processes perpetuates certain biases. In par- improved reviewing guidelines, concerned penalizing research ticular, both during reviewing and for later pre- papers for their non-native writing style, where it was fre- quently suggested to the authors whose native language is not sentation at conferences, papers are organized in English to ‘have their paper proofread by a native speaker’. As areas. Upon submission, a paper is assigned to one hidden consequence, this attitude might have set a higher a single area. Reviewers are recruited for their bar for the native speakers of minor and endangered languages working on such languages to put their research problems in expertise in a specific area, which they are associ- the spotlight, that way also implicitly hindering more work of ated with. Such a reviewing system incentivizes the entire community on these languages. 2345
to our claim that NLP research is biased to one- 5 Blind Spots dimensional extensions of the square one. We identified several under-explored areas on the research manifold. The common theme is a lack of studies of how dimensions such as multilingual- 4 Counter-Examples ity, fairness, efficiency, and interpretability interact. We now summarize some open problems that we Most of the exceptions to our thesis about the ‘one- believe are particularly important to address: (i) dimensionality’ of NLP research, in our classifica- While recent work has begun to study the trade-off tion of ACL 2021 Oral Papers, came from studies between efficiency and fairness, this interaction of efficiency in a multilingual context. Another remains largely unexplored, especially outside of example of this is Ahia et al. (2021), who show that the empirical risk minimization regime; (ii) fair- for low-resource languages, weight pruning hurts ness and interpretability interact in potentially performance on tail phenomena, but improves ro- many ways, i.e., interpretability techniques may af- bustness to out-of-distribution shifts—this is not ob- fect the fairness of the underlying models (Agarwal, served in the S QUARE O NE (high-resource) regime. 2021), but rationales may also, for example, be bi- There are also studies of fairness in a multilin- ased toward certain demographics in how they are gual context. Huang et al. (2020), for example, presented (Feng and Boyd-Graber, 2018; González show significant differences in social bias for mul- et al., 2021); (iii) finally, multilinguality and in- tilingual hate speech systems across different lan- terpretability seem heavily underexplored. While guages. Zhao et al. (2020) study gender bias in there exists resources for English for evaluating in- multilingual word embeddings and cross-lingual terpretability methods against gold-standard human transfer. González et al. (2020) also study gender annotations, there are, to the best of our knowledge, bias, but by relying on reflexive pronominal con- no such resources for other languages.11 structions that do not exist in the English language; 6 Contributing Factors this is a good example of research that would not have been possible taking S QUARE O NE as our We finally highlight possible factors that may con- point of departure. Dayanik and Padó (2021) study tribute to the S QUARE O NE B IAS. adversarial debiasing in the context of a multilin- Biases in NLP Education. We hypothesize that gual corpus and show some mitigation methods are early exposure to predominantly English-centric more effective for some languages rather than oth- experiment settings and tasks using a single per- ers. Nozza (2021) studies multilingual toxicity clas- formance metric may potentially propagate further sification and finds that models misinterpret non- to more advanced NLP research. To investigate to hateful language-specific taboo interjections as hate what extent this may be the case, we created a short speech in some languages. There has been much questionnaire, which we sent to a geographically less work on other combinations of these dimen- diverse set of teachers, including first authors from sions, e.g., fairness and efficiency. Hansen and the last Teaching NLP workshop (Jurgens et al., Søgaard (2021b) show that weight pruning has dis- 2021), asking about the first experiment that they parate effects on performance across demographics presented in their NLP 101 course. We received and that the min-max difference in group disparities 71 responses in total. Our first question was: The is negatively correlated with model size. Renduch- last time you taught an introductory NLP course, intala et al. (2021) observe that techniques to make what was the first task you introduced the students inference more efficient, e.g., greedy search, quan- to, or that they had to implement a model for? tization, or shallow decoder models, have a small The relative majority of respondents (31.9%) said impact on performance, but dramatically amplify sentiment analysis, while 10.1% indicated topic gender bias. In a rare study of fairness and inter- classification.12 More importantly, we also asked pretability, Vig et al. (2020) propose a methodol- them about the language of the data used in the ogy to interpret which parts of a model are causally 11 implicated in its behavior. They apply this method- We again note that there are other possible dimensions, ology to analyze gender bias in pre-trained Trans- not studied in this work, that can expose more blind spots: e.g., fairness and multi-modality, multilinguality and privacy. formers, finding that gender bias effects are sparse 12 The remaining responses included NER, language model- and concentrated in small parts of the network. ing, language identification, hate speech detection, etc. 2346
Year Book Language Task work that seeks to depart from the standard setting 1999 Manning and Schütze (1999) English-French Alignment has to work harder, not only to build systems and 2009 Jurafsky and Martin (2009) English LM 2009 Bird et al. (2009) English Name cl. resources in order to establish comparability with 2013 Søgaard (2013) English Doc.cl. 2019 Eisenstein (2019) English Doc.cl. existing work but also needs to argue convincingly the importance of such work. We provide practical Table 3: First experiments in NLP textbooks. The ob- recommendations in the next section on how we jective across all books is optimizing for performance can facilitate such research as a community. (AER, perplexity, or accuracy), rather than fairness, in- terpretability or efficiency. 7 Discussion Is S QUARE O NE B IAS not the Flipside of Sci- experiment, and what metric they optimized for. entific Protocol? One potential argument for a More than three quarters of respondents reported community-wide S QUARE O NE B IAS is that when that they used English language training and eval- studying the impact of some technique t, say a uation data and more than three quarters of the novel regularization term, we want to compare respondents asked the students to optimize for ac- some system with and without t, i.e., control for all curacy or F1. The choice of using English lan- other factors. To maximize impact and ease work- guage datasets is particularly interesting in contrast load, it makes sense at first sight to stick to a system to the native languages of the teachers and their and experimental protocol that is familiar or well- students: In around two thirds of the classes, most studied. Always returning to the S QUARE O NE is students shared an L1 language that was not En- a way to control for all other factors and relating glish; and less than a quarter of the teachers were new findings to known territory. The reason why L1 English speakers themselves. We extend this this is only seemingly a good idea, however, is that analysis to prototypical NLP experiments in un- the factors we study in NLP research, may be non- dergraduate and graduate research based on five linearly related. The fact that t makes for a positive exemplary NLP textbooks, spanning 20 years (see net contribution under one set of circumstances, Table 3). We observe that they, like the teachers does not imply that it would do so under different in our survey, take the same point of departure: an circumstances. This is illustrated most clearly by English-language experiment where we use super- the research surveyed in §3. Ideally, we thus want vised learning techniques to optimize for a standard to study the impact of t under as many circum- performance metric, e.g., perplexity or error. We stances as possible, but in the absence of resources note an important difference, however: While the to do so, it is a better (collective) search strategy to first four books largely ignore issues relating to apply t to a random set of circumstances (within fairness, interpretability, and efficiency, the most the space of relevant circumstances, of course). recent NLP textbook in Table 3 (Eisenstein, 2019) Comment on Meta-Research. This paper can discusses efficiency (briefly) and fairness (more be seen in the line of other meta-research (Davis, thoroughly). Overall, we believe that teachers and 1971; Lakatos, 1976; Weber, 2006; Bloom et al., educational materials should engage as early as 2020) that seeks to analyze research practices and possible with the multiple dimensions of NLP in whether a scientific field is heading in the right order to sensitize researchers regarding these topics direction. Within the NLP community, much of at the start of their careers. such recent discussion has focused on the nature Commercial Factors. For commercially focused of leaderboards and the practice of benchmarking NLP, there is an incentive to focus on settings with (Ethayarajh and Jurafsky, 2020; Ma et al., 2021). many users, such as major languages with many Should Each Paper Aim to Cover All Dimen- speakers. Similarly, as long as users do not mind us- sions? We believe that a researcher should aspire ing highly accurate black-box systems, researchers to cover as many dimensions as possible with their working on real-world applications can often afford research. Considering the dimensions of research to ignore dimensions such as interpretability and encourages us to think more holistically about our fairness. research and its final impact. It may also accel- Momentum of the Status Quo. The S QUARE erate progress as follow-up work will already be O NE is well supported by existing infrastructure, able to build on the insights of multi-dimensional resources, baselines, and experimental results. Any analyses of new methods. It will also promote the 2347
cross-pollination of ideas, which will no longer be valuable feedback on a draft of this paper and the confined to their own sub-areas. While such multi- suggestion of the term ‘square one’. dimensional research may be cumbersome at the moment, we believe with the proper incentives and support, we can make it much more accessible. References Practical Recommendations. What can we do Badr Abdullah, Iuliia Zaitova, Tania Avgustinova, Bernd Möbius, and Dietrich Klakow. 2021. How to incentivize and facilitate multi-dimensional re- familiar does that sound? Cross-lingual represen- search? i) Currently, most NLP models are eval- tational similarity analysis of acoustic word embed- uated by one or two performance metrics, but we dings. In Proceedings of the Fourth BlackboxNLP believe dimensions such as fairness, efficiency, and Workshop on Analyzing and Interpreting Neural Net- works for NLP, pages 407–419. interpretability need to become integral criteria for model evaluation, in line with recent proposals of Judit Ács. 2019. Exploring BERT’s Vocabulary. Blog more user-centric leaderboards (Ethayarajh and Ju- Post. rafsky, 2020; Ma et al., 2021). This requires new Rishabh Agarwal, Levi Melnick, Nicholas Frosst, tools, e.g., to evaluate environmental impact (Hen- Xuezhou Zhang, Ben Lengerich, Rich Caruana, and derson et al., 2020), as well as new benchmarks, Geoffrey Hinton. 2021. Neural additive models: In- e.g., to evaluate fairness (Koh et al., 2021) or ef- terpretable machine learning with neural nets. In ficiency (Liu et al., 2021b). ii) We believe sepa- Proceedings of NeurIPS 2021. rate conference tracks (areas) lead to unfortunate Sushant Agarwal. 2021. Trade-offs between fairness silo effects and inhibit multi-dimensional research. and interpretability in machine learning. In Proceed- Rather, we imagine conference submissions could ings of the IJCAI 2021 Workshop on AI for Social provide a checklist with dimensions along which Good. they make contributions, similar to reproducibil- Orevaoghene Ahia, Julia Kreutzer, and Sara Hooker. ity checklist. Reviewers can be assigned based on 2021. The low-resource double bind: An empirical their expertise corresponding to different dimen- study of pruning for low-resource machine transla- sions. iii) Finally, we recommend awareness of tion. In Findings of the Association for Computa- research prototypes and encourage reviewers and tional Linguistics: EMNLP 2021, pages 3316–3333. chairs to prioritize research that departs from pro- Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard totypes in multiple dimensions, in order to explore Hovy, Kai-Wei Chang, and Nanyun Peng. 2019. On new areas of the research manifold. difficulties of cross-lingual transfer with order differ- ences: A case study on dependency parsing. In Pro- ceedings of NAACL-HLT 2019, pages 2440–2452. 8 Conclusion Hala Al Kuwatly, Maximilian Wich, and Georg Groh. We identified the prototypical NLP experiment 2020. Identifying and measuring annotator bias through annotation experiments and surveys. We based on annotators’ demographic characteristics. highlighted the associated S QUARE O NE B IAS, In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190. which encourages research to go beyond the proto- type in a single dimension. We discussed the prob- Oded Avraham and Yoav Goldberg. 2017. The inter- lems resulting from this bias, by studying the area play of semantics and morphology in word embed- statistics of a recent NLP conference as well as by dings. In Proceedings of EACL 2017, pages 422– 426. discussing historic and recent examples. We finally pointed to under-explored research directions and Marco Baroni and Alessandro Lenci. 2010. Dis- made practical recommendations to inspire more tributional memory: A general framework for multi-dimensional research in NLP. corpus-based semantics. Computational Linguistics, 36(4):673–721. Acknowledgments Iz Beltagy, Arman Cohan, Hannaneh Hajishirzi, Sewon Min, and Matthew E. Peters. 2021. Beyond para- Ivan Vulić is funded by the ERC PoC Grant Mul- graphs: NLP for long sequences. In Proceedings of tiConvAI (no. 957356) and a research donation NAACL-HLT 2021: Tutorials, pages 20–24. from Huawei. Anders Søgaard is sponsored by the Emily M. Bender. 2011. On achieving and evaluating Innovation Fund Denmark and a Google Focused language-independence in NLP. Linguistic Issues in Research Award. We thank Jacob Eisenstein for Language Technology, 6(3):1–26. 2348
Adam L. Berger, Stephen A. Della Pietra, and Vin- Edith A. Das-Smaal. 1990. Biases in categorization. cent J. Della Pietra. 1996. A maximum entropy volume 68 of Advances in Psychology, pages 349– approach to natural language processing. Compu- 386. North-Holland. tational Linguistics, 22(1):39–71. Murray S Davis. 1971. That’s interesting! towards Jeff Bilmes and Katrin Kirchhoff. 2003. Factored a phenomenology of sociology and a sociology of language models and generalized parallel backoff. phenomenology. Philosophy of the social sciences, In Companion Volume of the Proceedings of HLT- 1(2):309–344. NAACL 2003-Short Papers, pages 4–6. Steven Bird, Ewan Klein, and Edward Loper. 2009. Erenay Dayanik and Sebastian Padó. 2021. Disentan- Natural Language Processing with Python: An- gling document topic and author gender in multiple alyzing Text with the Natural Language Toolkit. languages: Lessons for adversarial debiasing. In O’Reilly, Beijing. Proceedings of the Eleventh Workshop on Compu- tational Approaches to Subjectivity, Sentiment and Abeba Birhane, Pratyusha Kalluri, Dallas Card, Social Media Analysis, pages 50–61. William Agnew, Ravit Dotan, and Michelle Bao. 2021. The values encoded in machine learning re- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and search. CoRR, abs/2106.15590. Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Un- Su Lin Blodgett, Solon Barocas, Hal Daumé III, and derstanding. In Proceedings of NAACL-HLT 2019. Hanna M. Wallach. 2020. Language (technology) is power: A critical survey of "bias" in NLP. CoRR, Chris Dyer, Gábor Melis, and Phil Blunsom. 2019. A abs/2005.14050. critical analysis of biased parsers in unsupervised Nicholas Bloom, Charles I Jones, John Van Reenen, parsing. CoRR, abs/1909.09428. and Michael Webb. 2020. Are ideas getting harder to find? American Economic Review, 110(4):1104– Jacob Eisenstein. 2019. Introduction to Natural Lan- 44. guage Processing. Adaptive Computation and Ma- chine Learning series. MIT Press. Kaj Bostrom and Greg Durrett. 2020. Byte pair encod- ing is suboptimal for language model pretraining. In Jakob Elming, Anders Johannsen, Sigrid Klerke, Findings of the Association for Computational Lin- Emanuele Lapponi, Hector Martinez Alonso, and guistics: EMNLP 2020, pages 4617–4624. Anders Søgaard. 2013. Down-stream effects of tree-to-dependency conversions. In Proceedings of Jean Carletta. 1996. Assessing agreement on classifica- NAACL-HLT 2013, pages 617–626. tion tasks: The kappa statistic. Computational Lin- guistics, 22(2):249–254. Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in Chun-Hao Chang, Sarah Tan, Ben Lengerich, Anna the eye of the user: A critique of NLP leaderboards. Goldenberg, and Rich Caruana. 2021. How inter- In Proceedings of EMNLP 2020, pages 4846–4853. pretable and trustworthy are GAMs? In Proceed- ings of KDD 2021, pages 95–105. Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Man- Rewon Child, Scott Gray, Alec Radford, and Ilya deep Baines, Onur Celebi, Guillaume Wenzek, Sutskever. 2019. Generating long sequences with Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi- sparse Transformers. CoRR, abs/1904.10509. taliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, and Armand Joulin. 2020. Beyond Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan English-Centric Multilingual Machine Translation. Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and arXiv preprint arXiv:2010.11125. Jennimaria Palomaki. 2020. TyDi QA: A bench- mark for information-seeking question answering in Shi Feng and Jordan L. Boyd-Graber. 2018. What can typologically diverse languages. Transactions of the AI do for me: Evaluating machine learning interpre- Association for Computational Linguistics, 8:454– tations in cooperative play. CoRR, abs/1810.09648. 470. Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi ina Williams, Samuel Bowman, Holger Schwenk, Reichart, and Anna Korhonen. 2018. On the rela- and Veselin Stoyanov. 2018. XNLI: Evaluating tion between linguistic typology and (limitations of) cross-lingual sentence representations. In Proceed- multilingual language modeling. In Proceedings of ings of EMNLP 2018, pages 2475–2485. EMNLP 2018, pages 316–327. William Croft, Dawn Nordquist, Katherine Looney, Ana Valeria González, Maria Barrett, Rasmus Hvin- and Michael Regan. 2017. Linguistic typology gelby, Kellie Webster, and Anders Søgaard. 2020. meets universal dependencies. In Proceedings of the Type B reflexivization as an unambiguous testbed 15th International Workshop on Treebanks and Lin- for multilingual multi-task gender bias. In Proceed- guistic Theories (TLT15), pages 63–75. ings of EMNLP 2020, pages 2637–2648. 2349
Ana Valeria González, Anna Rogers, and Anders Sø- John Judge, Aoife Cahill, and Josef van Genabith. 2006. gaard. 2021. On the interaction of belief bias and ex- QuestionBank: Creating a corpus of parse-annotated planations. In Findings of the Association for Com- questions. In Proceedings of the 21st International putational Linguistics: ACL-IJCNLP 2021, pages Conference on Computational Linguistics and 44th 2930–2942. Annual Meeting of the Association for Computa- tional Linguistics, pages 497–504. Barbara J. Grosz, Aravind K. Joshi, and Scott Wein- stein. 1995. Centering: A framework for model- Dan Jurafsky and James H. Martin. 2009. Speech and ing the local coherence of discourse. Computational language processing : an introduction to natural Linguistics, 21(2):203–225. language processing, computational linguistics, and speech recognition. Pearson Prentice Hall, Upper Victor Petrén Bach Hansen and Anders Søgaard. 2021a. Saddle River, N.J. Guideline bias in Wizard-of-Oz dialogues. In Pro- ceedings of the 1st Workshop on Benchmarking: David Jurgens, Varada Kolhatkar, Lucy Li, Margot Past, Present and Future, pages 8–14. Mieskes, and Ted Pedersen, editors. 2021. Proceed- Victor Petrén Bach Hansen and Anders Søgaard. 2021b. ings of the Fifth Workshop on Teaching NLP. Is the lottery fair? evaluating winning tickets across demographics. In Findings of the Association Nora Kassner, Philipp Dufter, and Hinrich Schütze. for Computational Linguistics: ACL-IJCNLP 2021, 2021. Multilingual LAMA: Investigating knowl- pages 3214–3224. edge in multilingual pretrained language models. In Proceedings of EACL 2021, pages 3250–3258. Trevor J. Hastie and Robert J. Tibshirani. 2017. Gener- alized additive models. Routledge. Sanjeev P Khudanpur. 2006. Multilingual language modeling. Multilingual Speech Processing, page Peter Henderson, Jieru Hu, Joshua Romoff, Emma 169. Brunskill, Dan Jurafsky, and Joelle Pineau. 2020. Towards the systematic reporting of the energy and Pang Wei Koh, Shiori Sagawa, Henrik Mark- carbon footprints of machine learning. Journal of lund, Sang Michael Xie, Marvin Zhang, Akshay Machine Learning Research, 21(248):1–43. Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Eti- Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, enne David, Ian Stavness, Wei Guo, Berton A. Earn- Bruna Morrone, Quentin De Laroussilhe, Andrea shaw, Imran S. Haque, Sara Beery, Jure Leskovec, Gesmundo, Mona Attariyan, and Sylvain Gelly. Anshul Kundaje, Emma Pierson, Sergey Levine, 2019. Parameter-efficient transfer learning for NLP. Chelsea Finn, and Percy Liang. 2021. WILDS: A In Proceedings of ICML 2019, pages 2790–2799. benchmark of in-the-wild distribution shifts. In Pro- Dirk Hovy and Anders Søgaard. 2015. Tagging perfor- ceedings of ICML 2021. mance correlates with author age. In Proceedings of ACL-IJCNLP 2015, pages 483–488. Imre Lakatos. 1976. Falsification and the methodology of scientific research programmes. In Can theories Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- be refuted?, pages 205–259. Springer. ham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A Massively Multilingual Multi- Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, task Benchmark for Evaluating Cross-lingual Gener- Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, alization. In Proceedings of ICML 2020. Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neu- Xiaolei Huang, Linzi Xing, Franck Dernoncourt, and big. 2019. Choosing Transfer Languages for Cross- Michael J. Paul. 2020. Multilingual Twitter cor- Lingual Learning. In Proceedings of ACL 2019. pus and baselines for evaluating demographic bias in hate speech recognition. In Proceedings of LREC Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2020, pages 1440–1448. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Anders Johannsen, Dirk Hovy, and Anders Søgaard. Association for Computational Linguistics, 4:521– 2015. Cross-lingual syntactic variation over age and 535. gender. In Proceedings of CoNLL 2015, pages 103– 112. Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Anna Jørgensen and Anders Søgaard. 2021. Evaluation Ponti, Siva Reddy, Nigel Collier, and Desmond El- of summarization systems across gender, age, and liott. 2021a. Visually Grounded Reasoning across race. In Proceedings of the Third Workshop on New Languages and Cultures. In Proceedings of EMNLP Frontiers in Summarization, pages 51–56. 2021. Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Bali, and Monojit Choudhury. 2020. The State and Goodrich, Ryan Sepassi, Łukasz Kaiser, and Noam Fate of Linguistic Diversity and Inclusion in the Shazeer. 2018. Generating Wikipedia by Summariz- NLP World. In Proceedings of ACL 2020. ing Long Sequences. In Proceedings of ICLR 2018. 2350
Xiangyang Liu, Tianxiang Sun, Junliang He, Lingling Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se- Wu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing bastian Ruder. 2020. MAD-X: An Adapter-Based Huang, and Xipeng Qiu. 2021b. Towards efficient Framework for Multi-Task Cross-Lingual Transfer. nlp: A standard evaluation and a strong baseline. In Proceedings of EMNLP 2020, pages 7654–7673. arXiv preprint arXiv:2110.07038. Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebas- Wentao Ma, Yiming Cui, Chenglei Si, Ting Liu, Shi- tian Ruder. 2021. UNKs everywhere: Adapting mul- jin Wang, and Guoping Hu. 2020. CharBERT: tilingual language models to new scripts. In Pro- Character-aware pre-trained language model. In ceedings of EMNLP 2021. Proceedings of the 28th International Conference on Computational Linguistics, pages 39–50, Barcelona, Edoardo Maria Ponti, Roi Reichart, Anna Korhonen, Spain (Online). International Committee on Compu- and Ivan Vulić. 2018. Isomorphic transfer of syntac- tational Linguistics. tic structures in cross-lingual NLP. In Proceedings of ACL 2018, pages 1531–1542. Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Shauli Ravfogel, Yoav Goldberg, and Francis Tyers. Jain, Ledell Wu, Robin Jia, Christopher Potts, Ad- 2018. Can LSTM learn to capture agreement? the ina Williams, and Douwe Kiela. 2021. Dynaboard: case of Basque. In Proceedings of the 2018 EMNLP An evaluation-as-a-service platform for holistic next- Workshop BlackboxNLP: Analyzing and Interpreting generation benchmarking. CoRR, abs/2106.06052. Neural Networks for NLP, pages 98–107. Olga Majewska, Evgeniia Razumovskaia, Marek Rei and Anders Søgaard. 2018. Zero-shot se- Edoardo Maria Ponti, Ivan Vulic, and Anna quence labeling: Transferring knowledge from sen- Korhonen. 2022. Cross-lingual dialogue dataset tences to tokens. In Proceedings of NAACL-HLT creation via outline-based generation. CoRR, 2018, pages 293–302. abs/2201.13405. Adithya Renduchintala, Denise Diaz, Kenneth Christopher D. Manning and Hinrich Schütze. 1999. Heafield, Xian Li, and Mona Diab. 2021. Gender Foundations of Statistical Natural Language Pro- bias amplification during speed-quality optimization cessing. The MIT Press, Cambridge, Mas- in neural machine translation. In Proceedings of sachusetts. ACL-IJCNLP 2021, pages 99–109. Mitchell Marcus, Grace Kim, Mary Ann Anna Rogers, Olga Kovaleva, and Anna Rumshisky. Marcinkiewicz, Robert MacIntyre, Ann Bies, 2020. A primer in BERTology: What we know Mark Ferguson, Karen Katz, and Britta Schasberger. about how BERT works. Transactions of the Associ- 1994. The Penn Treebank: Annotating predicate ar- ation for Computational Linguistics, 8:842–866. gument structure. In Human Language Technology: Proceedings of a Workshop. Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta, and Thomas Wolf. 2019. Trans- Julian John McAuley and Jure Leskovec. 2013. From fer learning in natural language processing. In amateurs to connoisseurs: Modeling the evolution of Proceedings of the 2019 Conference of the North user expertise through online reviews. In Proceed- American Chapter of the Association for Computa- ings of WWW 2013, page 897–908. tional Linguistics: Tutorials, pages 15–18. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Dean. 2013. Distributed Representations of Words Ruder, and Iryna Gurevych. 2021. How Good is and Phrases and their Compositionality. In Proceed- Your Tokenizer? On the Monolingual Performance ings of NeurIPS 2013. of Multilingual Language Models. In Proceedings of ACL-IJCNLP 2021, pages 3118–3135. Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren ter, Jan Hajič, Christopher D. Manning, Sampo Etzioni. 2020. Green AI. Communications of the Pyysalo, Sebastian Schuster, Francis Tyers, and ACM, 63(12):54–63. Daniel Zeman. 2020. Universal Dependencies v2: An evergrowing multilingual treebank collection. In Sofia Serrano and Noah A. Smith. 2019. Is attention Proceedings of LREC 2020, pages 4034–4043. interpretable? In Proceedings of ACL 2019, pages 2931–2951. Debora Nozza. 2021. Exposing the Limits of Zero-shot Cross-lingual Hate Speech Detection. In Proceed- Yikang Shen, Zhouhan Lin, Chin-wei Huang, and ings of ACL 2021, pages 907–914. Aaron Courville. 2018. Neural Language Modeling by Jointly Learning Syntax and Lexicon. In Pro- Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Han- ceedings of ICLR 2018. cock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of ACL Tracy Sherman. 1985. Categorization skills in infants. 2011, pages 309–319. Child Development, 56(6):1561–73. 2351
You can also read