Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold

Square One Bias in NLP:
        Towards a Multi-Dimensional Exploration of the Research Manifold

              Sebastian Ruder, Ivan Vulić, Anders Søgaard
              Google Research, University of Cambridge, University of Copenhagen
            ruder@google.com                    iv250@cam.ac.uk                       soegaard@di.ku.dk

        The prototypical NLP experiment trains a stan-
        dard architecture on labeled English data and
        optimizes for accuracy, without accounting
        for other dimensions such as fairness, inter-
        pretability, or computational efficiency. We
        show through a manual classification of recent
        NLP research papers that this is indeed the
        case and refer to it as the square one experi-
        mental setup. We observe that NLP research
        often goes beyond the square one setup, e.g,
        focusing not only on accuracy, but also on
        fairness or interpretability, but typically only
        along a single dimension. Most work tar-
        geting multilinguality, for example, considers           Figure 1: Visualization of contributions of ACL 2021
        only accuracy; most work on fairness or in-              oral papers along 4 dimensions: multilinguality, fair-
        terpretability considers only English; and so            ness and bias, efficiency, and interpretability (indicated
        on. Such one-dimensionality of most research             by color). Most work is clustered around the S QUARE
        means we are only exploring a fraction of the            O NE or along a single dimension.
        NLP research search space. We provide his-
        torical and recent examples of how the square
        one bias has led researchers to draw false               periment, and that the existence of such an exper-
        conclusions or make unwise choices, point to             imental prototype steers and biases the research
        promising yet unexplored directions on the re-           dynamics in our community. We will refer to this
        search manifold, and make practical recom-               prototype as NLP’s S QUARE O NE—and to the bias
        mendations to enable more multi-dimensional
                                                                 that follows from it, as the S QUARE O NE B IAS. We
        research. We open-source the results of our an-
        notations to enable further analysis.1                   argue this bias manifests in a particular way: Since
                                                                 research is a creative endeavor, and researchers aim
1       Introduction                                             to push the research horizon, most research papers
                                                                 in NLP go beyond this prototype, but only along
Our categorization of objects, say screwdrivers or               a single dimension at a time. Such dimensions
NLP experiments, is heavily biased by early pro-                 might include multilinguality, efficiency, fairness,
totypes (Sherman, 1985; Das-Smaal, 1990). If the                 and interpretability, among others. The effect of the
first 10 screwdrivers we see are red and for hexagon             S QUARE O NE B IAS is to baseline novel research
socket screws, this will bias what features we learn             contributions, rewarding work that differs from the
to associate with screwdrivers. Likewise, if the                 prototype in a concise, one-dimensional way.
first 10 NLP experiments we see or conduct are in                   We present several examples of this effect in
sentiment analysis, this will likely also bias how               practice. For instance, analyzing the contributions
we think of NLP experiments in the future.                       of ACL 2021 papers along 4 dimensions, we ob-
   In this position paper, we postulate that we can              serve that most work is either clustered around
meaningfully talk about the prototypical NLP ex-                 the S QUARE O NE or makes a contribution along
         The authors contributed equally to this work.           a single dimension (see Figure 1). Multilingual
        github.com/google-research/url-nlp                       work typically disregards efficiency, fairness, and
interpretability. Work on efficient NLP typically                 ficiency, and interpretability. Compared to prior
only performs evaluations on English datasets, and                work that annotates the values of ML research pa-
disregards fairness and interpretability. Fairness                pers (Birhane et al., 2021), we are not concerned
and interpretability work is also mostly limited to               with a paper’s motivation but whether its practi-
English, and tends to disregard efficiency concerns.              cal contributions constitute a meaningful departure
   We argue that the S QUARE O NE B IAS has sev-                  from the S QUARE O NE. For each paper, we an-
eral negative effects, most of which amount to the                notate whether it makes a contribution along each
study of one of the above dimensions being biased                 dimension as well as the languages and metrics it
by ignoring the others. Specifically, by focusing                 employs for evaluation. We provide the detailed
only on exploring the edges of the manifold, we are               annotation guidelines in Appendix A.1.
not able to identify the non-linear interactions be-              ACL 2021 Oral Papers. We annotate the 461 pa-
tween different research dimensions. We highlight                 pers that were presented orally at ACL 2021, a
several examples of such interactions in Section 3.               representative cross-section of the 779 papers ac-
Overall, we encourage a focus on combining multi-                 cepted to the main conference. The general statis-
ple dimensions on the research manifold in future                 tics from our classification of ACL 2021 papers
NLP research, and delve deeper into studying their                are presented in Table 1. In addition, we highlight
(linear and non-linear) interactions.                             the statistics for the conference areas (tracks) cor-
Contributions. We first establish that we can                     responding to 3 of the 4 dimensions4 , as well as
meaningfully talk about the prototypical NLP ex-                  for the top 5 areas with the most papers. We show
periment, through a series of annotation experi-                  statistics for the remaining areas in Appendix A.2.
ments and surveys. This prototype amounts to ap-                  We additionally visualize their distribution in Fig-
plying a standard architecture to an English dataset              ure 1. Overall, almost 70% of papers evaluate only
and optimizing for accuracy or F1. We discuss the                 on English, clearly highlighting a lack of language
impact of this prototype on our research commu-                   diversity in NLP (Bender, 2011; Joshi et al., 2020).
nity, and the bias it introduces. We then discuss the             Almost 40% of papers only evaluate using accuracy
negative effects of this bias. We also list work that             and/or F1, foregoing metrics that may shed light
has taken steps to overcome the bias. Finally, we                 on other aspects of model behavior. 56.6% of pa-
highlight blind spots and unexplored research direc-              pers do not study any of the four major dimensions
tions and make practical recommendations, aiming                  that we investigated. We refer to this standard ex-
to inspire the community towards conducting more                  perimental setup—evaluating only on English and
‘multi-dimensional’ research (see Figure 1).                      optimizing for accuracy or another performance
                                                                  metric without considering other dimensions—as
2       Finding the Square One                                    the S QUARE O NE.
In order to determine the existence and nature of a                  Regarding work that moves from the S QUARE
S QUARE O NE, we assess contemporary research                     O NE, most papers make a contribution in terms of
in NLP along a number of different dimensions.                    efficiency, followed by multilinguality. However,
                                                                  most papers that evaluate on multiple languages are
Dimensions. We identify potential themes in NLP                   part of the corresponding MT and Multilinguality
research by reviewing the Call for Papers, publi-                 track. Despite being an area receiving increasing at-
cation statistics by area, and paper titles of recent             tention (Blodgett et al., 2020), only 6.3% of papers
NLP conferences. We focus on general dimensions                   evaluate the bias or fairness of a method. Overall,
that are not tied to a particular task and are applica-           only 6.1% of papers make a contribution along two
ble to any NLP application.2 We furthermore focus                 or more of these dimensions. Among these, joint
on dimensions that are represented in a reasonable                contributions on both multilinguality and efficiency
fraction of NLP papers (at least 5% of ACL 2021                   are the most common (see Figure 1). In fact, 22
oral papers).3 Our final selection focuses on 4 di-               of the 26 two-or-more-dimensional papers focus
mensions along which papers may make research                     on efficiency, and 17 of these on the combination
contributions: multilinguality, fairness and bias, ef-
                                                                       Unlike EACL 2021, NAACL-HLT 2021 and EMNLP
      For instance, we do not consider multimodality, as a task   2021, ACL 2021 had no area associated with efficiency. To
or model is inherently multimodal or not.                         compensate for this, we annotated the 20 oral papers of the
      Privacy, interactivity, and other emerging research areas   “Efficient Models in NLP” track at EMNLP 2021 (see Ap-
are excluded based on this criterion.                             pendix A.3).

Area                             # papers   English   Accuracy / F1   Multilinguality    Fairness and bias   Efficiency   Interpretability   >1 dimension
 ACL 2021 oral papers               461      69.4%        38.8%            13.9%               6.3%            17.8%           11.7%             6.1%
 MT and Multilinguality             58        0.0%        15.5%            56.9%               5.2%            19.0%           6.9%              13.8%
 Interpretability and Analysis      18       88.9%        27.8%             5.6%               0.0%             5.6%           66.7%              5.6%
 Ethics in NLP                       6       83.3%         0.0%            0.0%               100.0%           0.0%            0.0%              0.0%
 Dialog and Interactive Systems     42       90.5%        21.4%             0.0%                9.5%           23.8%           2.4%              2.4%
 Machine Learning for NLP           42       66.7%        40.5%            19.0%               4.8%            50.0%           4.8%              9.5%
 Information Extraction             36       80.6%        91.7%             8.3%                0.0%           25.0%           5.6%              8.3%
 Resources and Evaluation           35       77.1%        42.9%             5.7%                8.6%            5.7%           14.3%             5.7%
 NLP Applications                   30       73.3%        43.3%             0.0%               10.0%           20.0%           10.0%             0.0%

Table 1: The number of ACL 2021 oral papers (top row) and of papers in each area (bottom rows) as well as the
fractions that only evaluate on English, only use accuracy / F1, make contributions along one of four dimensions,
and make contributions along more than a single dimension (from left to right).

of multilinguality and efficiency. This means less                               Year      Paper                               Language          Metric
than 1% of the ACL 2021 papers consider combi-                                   1995      Grosz et al. (1995)                 English           n/a
nations of (two or more of) multilinguality, fairness                            1995      Yarowsky (1995)                     English           acc.
                                                                                 1996      Berger et al. (1996)                English           acc.
and interpretability. We find this surprising, given                             1996      Carletta (1996)                     n/a               n/a
these topics are considered among the most popular                               2010      Baroni and Lenci (2010)             English           acc.
topics in the field.                                                             2010      Turian et al. (2010)                English           F1
                                                                                 2011      Taboada et al. (2011)               English           acc.
   Some areas have particularly concerning statis-                               2011      Ott et al. (2011)                   English           acc./F1
tics. A large majority of research work in dia-
log (90.5%), summarization (91.7%), sentiment                                          Table 2: Test-of-Time Award 2021-22 papers
analysis (100%), and language grounding (100%)
is done only on English; however, ways of ex-
                                                                             value in research, i.e., our perception of ideal re-
pressing sentiment (Volkova et al., 2013; Yang and
                                                                             search practices. This can also be seen in the papers
Eisenstein, 2017; Vilares et al., 2018) and visu-
                                                                             that have received the ACL Test-of-Time Award in
ally grounded reasoning (Liu et al., 2021a; Yin
                                                                             the last two years (Table 2). Seven in eight papers
et al., 2021) do vary across languages and cul-
                                                                             included empirical evaluations performed exclu-
tures. Systems in the top tracks tend to evaluate
                                                                             sively on English data. Six papers were exclusively
efficiency, but in general do not consider fairness or
                                                                             concerned with optimizing for accuracy or F1 .
interpretability of the proposed methods. Even the
creation of new resources and evaluation sets (cf.,                          Blackbox NLP Papers. Finally, we check if more
Resource and Evaluation in Table 1) seems to be                              multi-dimensional papers were presented at a work-
directed towards rewarding and enabling S QUARE                              shop devoted to one of the above dimensions. The
O NE experiments; favoring English (77.1%), and                              rationale is that if everyone at a workshop already
with modest efforts on other dimensions. Notably,                            explores one of these dimensions, including an-
we only identified a single paper that considers                             other may be a way to have an edge over other
three dimensions (Renduchintala et al., 2021). This                          submissions. Unfortunately, this does not seem to
paper considers gender bias (Fairness) in relation                           be the case. We manually annotated the first 10 pa-
to speed-quality (Efficiency) trade-offs in multilin-                        pers in the Blackbox NLP 2021 program5 that were
gual machine translation (Multilinguality). Finally,                         available as pre-prints at the time of submission.
we observe that best-paper award winning papers                              Of the 10 papers, only one included more than one
are not more likely to consider more than one of the                         dimension (Abdullah et al., 2021). This number
four dimensions. Only 1 in 8 papers did; the best                            aligns well with the overall statistics of ACL 2021
paper (Xu et al., 2021), like most two-dimensional                           (6.1%). All the other Blackbox NLP papers only
ACL 2021 papers, considered multilinguality and                              considered interpretability for English.
                                                                             3         Square One Bias: Examples
Test-of-Time Award Recipients. Current papers
provide us with a snapshot of actual current re-                             In the following, we highlight both historical and
search practices, but the one-dimensionality of the                          recent examples touching on different aspects of re-
best paper award winning papers at ACL 2021 sug-                             search in NLP that illustrate how the gravitational
gest the S QUARE O NE B IAS also biases what we                                        https://blackboxnlp.github.io/

attraction of the S QUARE O NE has led researchers      Basque, which is both morphologically richer and
to draw false conclusions, unconsciously steer stan-    has a relatively free word order (Ravfogel et al.,
dard research practices, or make unwise choices.        2018) compared to English (Linzen et al., 2016).
                                                        They have also been shown to transfer worse to dis-
Architectural Biases. One pervasive bias in our
                                                        tant languages for dependency parsing compared
models regards morphology. Many of our mod-
                                                        to self-attention models (Ahmad et al., 2019). Such
els were not designed with morphology in mind,
                                                        biases concerning word order are not only inher-
arguably because of the poor/limited morphology
                                                        ent in our models but also in our algorithms. A
of English. Traditional n-gram language models,
                                                        recent unsupervised parsing algorithm (Shen et al.,
for example, have been shown to perform much
                                                        2018) has been shown to be biased towards right-
worse on languages with elaborate morphology due
                                                        branching structures and consequently performs
to data sparsity problems (Khudanpur, 2006; Ben-
                                                        better in right-branching languages like English
der, 2011; Gerz et al., 2018). Such models were
                                                        (Dyer et al., 2019). While the recent generation
nevertheless more commonly used than more lin-
                                                        of self-attention based architectures can be seen
guistically informed alternatives such as factored
                                                        as inherently order-agnostic, recent methods focus-
language models (Bilmes and Kirchhoff, 2003)
                                                        ing on making attention more efficient (Tay et al.,
that represent words as sets of features. Word em-
                                                        2020) introduce new biases into the models. Specif-
beddings have been widely used, in part because
                                                        ically, models that reduce the global attention to a
pre-trained embeddings covered a large part of the
                                                        local sliding window around the token (Liu et al.,
English vocabulary. However, word embeddings
                                                        2018; Child et al., 2019; Zaheer et al., 2020) may
are not useful for tasks that require access to mor-
                                                        incur similar limitations as their n-gram and word
phemes, e.g., semantic tasks in morphologically
                                                        embedding-based predecessors, performing worse
rich languages (Avraham and Goldberg, 2017).
                                                        on languages with relatively free word order.6
   While studies have demonstrated the ability of          The singular focus on maximizing a performance
word embeddings to capture linguistic information       metric such as accuracy introduces a bias towards
in English, it remains unclear whether they capture     models that are expressive enough to fit a given
the information needed for processing morphologi-       distribution well. Such models are typically black-
cally rich languages (Tsarfaty et al., 2020). A bias    box and learn highly non-linear relations that are
towards morphologically rich languages is also ap-      generally not interpretable. Interpretability is gen-
parent in our tokenization algorithms. Subword          erally studied in papers focusing exclusively on
tokenization performs poorly on languages with          this topic; a recent example is BERTology (Rogers
reduplication (Vania and Lopez, 2017), while byte       et al., 2020). Studies proposing more interpretable
pair encoding does not align well with morphol-         methods typically build on state-of-the-art meth-
ogy (Bostrom and Durrett, 2020). Consequently,          ods (Weiss et al., 2018) and much work focuses
languages with productive morphological systems         on leveraging components such as attention for in-
also are disadvantaged when shared ‘language-           terpretability, which have not been designed with
universal’ tokenizers are used in current large-scale   that goal in mind (Serrano and Smith, 2019; Wiegr-
multilingual language models (Ács, 2019; Rust           effe and Pinter, 2019). As a result, researchers
et al., 2021) without any further vocabulary adapta-    eschew directions focusing on models that are in-
tion (Wang et al., 2020; Pfeiffer et al., 2021).        trinsically more interpretable such as generalized
   Another bias in our models relates to word or-       additive models (Hastie and Tibshirani, 2017) and
der. In order for n-gram models to capture inter-       their extensions (Chang et al., 2021; Agarwal et al.,
word dependencies, words need to appear in the          2021) but which have so far not been shown to
n-gram window. This will occur more frequently          match the performance of state-of-the-art methods.
in languages with relatively fixed word order com-         As most datasets on which models are evaluated
pared to languages with relatively free word order      focus on sentences or short documents, state-of-
(Bender, 2011). Word embedding approaches such          the-art methods restrict their input size to around
as skip-gram (Mikolov et al., 2013) adhere to the       512 tokens (Devlin et al., 2019) and leverage meth-
same window-based approach and thus have sim-
ilar weaknesses for languages with relatively free            An older work of Khudanpur (2006) argues that free
                                                        word order is less of a problem as local order within phrases
word order. LSTMs are also sensitive to word or-        is relatively stable. However, it remains to be seen to what
der and perform worse on agreement prediction in        degree this affects current models.

ods that are inefficient when scaling to longer           lead to severe model biases (Hansen and Søgaard,
documents. This has led to the emergence of a             2021a) and hurt model fairness. In interpretability,
wide range of more efficient models (Tay et al.,          we can use feature attribution methods and word-
2020), which, however, are rarely used as baseline        level annotations to evaluate interpretability meth-
methods in NLP. Similarly, the standard pretrain-         ods applied to sequence classifiers (Rei and Sø-
fine-tune paradigm (Ruder et al., 2019) requires          gaard, 2018), but we cannot directly use feature at-
separate model copies to be stored for each task,         tribution methods to obtain rationales for sequence
and thus restricts work on multi-domain, multi-           labelers. Annotation biases can also stem from the
task, multi-lingual, multi-subpopulation methods          characteristics of the annotators, including their do-
that is enabled by more efficient and less resource-      main experience (McAuley and Leskovec, 2013),
intensive (Schwartz et al., 2020) fine-tuning meth-       demographics (Jørgensen and Søgaard, 2021), or
ods (Houlsby et al., 2019; Pfeiffer et al., 2020)         educational level (Al Kuwatly et al., 2020).
   In sum, (what we typically consider as) standard          Annotation biases form an integral part of the
baselines and state-of-the-art architectures favor        S QUARE O NE B IAS: In NLP experiments, we com-
languages with some characteristics over others and       monly rely on the same pools of annotators, e.g.,
are optimized only for performance, which in turn         computer science students, professional linguists,
propagates the S QUARE O NE B IAS: If researchers         or MTurk contributors. Sometimes these biases
study aspects such as multilinguality, efficiency,        percolate through reuse of resources, e.g., through
fairness or interpretability, they are likely to do       human or machine translation into new languages.
so with and for commonly used architectures (i.e.,        Examples of such recycled resources include the
often termed ‘standard architectures’), in order to       ones introduced by Conneau et al. (2018) and Kass-
reduce (too) many degrees of freedom in their em-         ner et al. (2021), among others. Even when such
pirical research. This is in many ways a sensible         translation-based resources resonate with syntax
choice in order to maximize perceived relevance—          and semantics of the target language, and are fluent
and thereby, impact. However, as a result, multi-         and natural, they still suffer from translation arte-
linguality, efficiency, fairness, interpretability, and   facts: they are often target-language surface realiza-
other research areas inherit the same biases, which       tions of source-language-based conceptual thinking
typically slip under the radar.                           (Majewska et al., 2022). As a consequence, eval-
                                                          uations of cross-lingual transfer models on such
Annotation Biases. Many NLP tasks can be cast
                                                          data typically overestimate their performance as
differently and formulated in multiple ways, and
                                                          properties such as word order and even the choice
differences may result in different annotation styles.
                                                          of lexical units are inherently biased by the source
Sentiment, for example, can be annotated at the
                                                          language (Vanmassenhove et al., 2021). Put sim-
document, sentence or word level (Socher et al.,
                                                          ply, the choice of the data creation protocol, e.g.,
2013). In machine comprehension, answers are
                                                          translation-based versus data collection directly in
sometimes assumed to be continuous, but Zhu et al.
                                                          the target language (Clark et al., 2020) can yield
(2020) annotate discontinuous spans. In depen-
                                                          profound differences in model performance for
dency parsing, different annotation guidelines can
                                                          some groups, or may have serious impact on the
lead to very different downstream performance
                                                          interpretability or computational efficiency (e.g.,
(Elming et al., 2013). How we annotate for a task
                                                          sample efficiency) of our models.
may interact in complex ways with dimensions
such as multilinguality, efficiency, fairness, and in-    Selection Biases. For many years, the English
terpretability. The Universal Dependencies project        Penn Treebank (Marcus et al., 1994) was an inte-
(Nivre et al., 2020) is motivated by the observa-         gral part of the S QUARE O NE of NLP. This corpus
tion that not all dependency formalisms are easily        consists entirely of newswire, i.e., articles and edi-
applicable to all languages. Aligning guidelines          torials from the Wall Street Journal, and arguably
across languages has enabled researchers to ask in-       amplified the (existing) bias toward news articles.
teresting questions, but such attempts may limit the      Since news articles tend to reflect a particular set
analysis of outlier languages (Croft et al., 2017).       of linguistic conventions, have a certain length, and
   Other examples of annotation guidelines interact-      are written by certain demographics, the bias to-
ing with the above dimensions exist: Slight nuances       ward news articles had an impact on the linguistic
in how annotation guidelines are formulated can           phenomena studied in NLP (Judge et al., 2006), led
to under-representation of challenges with handling      papers that make contributions to the chosen area,
longer documents (Beltagy et al., 2021), and had         in order to appeal to the reviewers of this area and
impact on early papers in fairness (Hovy and Sø-         implicitly penalizes papers that make contributions
gaard, 2015). Note how such a bias may interact in       along multiple dimensions, as reviewers unfamil-
non-linear ways with efficiency, i.e., efficient meth-   iar with the related areas may not appreciate their
ods for shorter documents need not be efficient for      inter-disciplinary or inter-areal magnitude or value.
longer ones, or fairness, i.e., what mitigates gender    Even new initiatives that seek to improve review-
biases in news articles need not mitigate gender         ing such as ARR7 adhere to this area structure8 and
biases in product reviews.                               thus further the S QUARE O NE B IAS. A review-
Protocol Biases. In the prototypical NLP experi-         ing system that allows papers to be associated with
ment, the dataset is in the English language. As a       multiple dimensions of research and that assigns
consequence, it is also standard protocol in multi-      reviewers with complementary expertise—similar
lingual NLP to use English as a source language          to TACL9 —would ameliorate this situation. Once
in zero-shot cross-lingual transfer (Hu et al., 2020).   a paper is accepted, presentations at conferences
In practice, there are generally better source lan-      are organized by areas, limiting audiences in most
guages than English (Ponti et al., 2018; Lin et al.,     cases to members of said area and thereby reducing
2019; Turc et al., 2021), and results are heavily        the cross-pollination of ideas.10
biased by the common choice of English. For in-          Unexplored Areas of the Research Manifold.
stance, effectiveness and efficiency of few-shot         The discussed biases, which seem to originate from
learning can be impacted by the choice of the            the S QUARE O NE B IAS, leave areas of the research
source language (Pfeiffer et al., 2021; Zhao et al.,     manifold unexplored. Character-based language
2021). English also dominates language pairs in          models are often reported to perform well for mor-
machine translation, leading to lower performance        phologically rich languages or on non-canonical
for non-English translation directions (Fan et al.,      text (Ma et al., 2020), but little is known about
2020), which are particularly important in multilin-     their fairness properties, and attribution-based in-
gual societies. Again, such biases may interact in       terpretability methods have not been developed for
non-trivial ways with dimensions explored in NLP         such models. Annotation biases that stem from
research: It is not inconceivable that there is an       annotator demographics have been studied for En-
algorithm A that is more fair, interpretable or effi-    glish POS tagging (Hovy and Søgaard, 2015) or
cient than algorithm B on, say, English-to-Czech         English summarization (Jørgensen and Søgaard,
transfer or translation, but not on German-to-Czech      2021), for example, but there has been very little
or French-to-Czech.                                      research on such biases for other languages. While
Organizational Biases. The above architectural,          linguistic differences among genders is shared
annotation, selection and protocol biases follow         among some languages, genders differ in very dif-
from the S QUARE O NE B IAS, but they also con-          ferent ways between other languages, e.g., Span-
serve the S QUARE O NE. If our go-to architectures,      ish and Swedish (Johannsen et al., 2015). We dis-
resources, and experimental setups are tailored to       cuss important unexplored areas of the research
some languages over others, some objectives over         manifold in §5, but first we briefly survey existing,
others, and some research paradigms over others,         multi-dimensional work, i.e., the counter-examples
it is considerably more work to explore new sets of         7
languages, new objectives, or new protocols. The            8
organizational biases we discuss below may also              9
reinforce the S QUARE O NE B IAS.                              Another previously pervasive organizational bias, which
                                                         is now fortunately being institutionally mitigated within the
    The organization of our conferences and review-      *ACL community through dedicated mentoring programs and
ing processes perpetuates certain biases. In par-        improved reviewing guidelines, concerned penalizing research
ticular, both during reviewing and for later pre-        papers for their non-native writing style, where it was fre-
                                                         quently suggested to the authors whose native language is not
sentation at conferences, papers are organized in        English to ‘have their paper proofread by a native speaker’. As
areas. Upon submission, a paper is assigned to           one hidden consequence, this attitude might have set a higher
a single area. Reviewers are recruited for their         bar for the native speakers of minor and endangered languages
                                                         working on such languages to put their research problems in
expertise in a specific area, which they are associ-     the spotlight, that way also implicitly hindering more work of
ated with. Such a reviewing system incentivizes          the entire community on these languages.

to our claim that NLP research is biased to one-         5        Blind Spots
dimensional extensions of the square one.
                                                         We identified several under-explored areas on the
                                                         research manifold. The common theme is a lack
                                                         of studies of how dimensions such as multilingual-
4   Counter-Examples                                     ity, fairness, efficiency, and interpretability interact.
                                                         We now summarize some open problems that we
Most of the exceptions to our thesis about the ‘one-     believe are particularly important to address: (i)
dimensionality’ of NLP research, in our classifica-      While recent work has begun to study the trade-off
tion of ACL 2021 Oral Papers, came from studies          between efficiency and fairness, this interaction
of efficiency in a multilingual context. Another         remains largely unexplored, especially outside of
example of this is Ahia et al. (2021), who show that     the empirical risk minimization regime; (ii) fair-
for low-resource languages, weight pruning hurts         ness and interpretability interact in potentially
performance on tail phenomena, but improves ro-          many ways, i.e., interpretability techniques may af-
bustness to out-of-distribution shifts—this is not ob-   fect the fairness of the underlying models (Agarwal,
served in the S QUARE O NE (high-resource) regime.       2021), but rationales may also, for example, be bi-
There are also studies of fairness in a multilin-        ased toward certain demographics in how they are
gual context. Huang et al. (2020), for example,          presented (Feng and Boyd-Graber, 2018; González
show significant differences in social bias for mul-     et al., 2021); (iii) finally, multilinguality and in-
tilingual hate speech systems across different lan-      terpretability seem heavily underexplored. While
guages. Zhao et al. (2020) study gender bias in          there exists resources for English for evaluating in-
multilingual word embeddings and cross-lingual           terpretability methods against gold-standard human
transfer. González et al. (2020) also study gender       annotations, there are, to the best of our knowledge,
bias, but by relying on reflexive pronominal con-        no such resources for other languages.11
structions that do not exist in the English language;
                                                         6        Contributing Factors
this is a good example of research that would not
have been possible taking S QUARE O NE as our            We finally highlight possible factors that may con-
point of departure. Dayanik and Padó (2021) study        tribute to the S QUARE O NE B IAS.
adversarial debiasing in the context of a multilin-
                                                         Biases in NLP Education. We hypothesize that
gual corpus and show some mitigation methods are
                                                         early exposure to predominantly English-centric
more effective for some languages rather than oth-
                                                         experiment settings and tasks using a single per-
ers. Nozza (2021) studies multilingual toxicity clas-
                                                         formance metric may potentially propagate further
sification and finds that models misinterpret non-
                                                         to more advanced NLP research. To investigate to
hateful language-specific taboo interjections as hate
                                                         what extent this may be the case, we created a short
speech in some languages. There has been much
                                                         questionnaire, which we sent to a geographically
less work on other combinations of these dimen-
                                                         diverse set of teachers, including first authors from
sions, e.g., fairness and efficiency. Hansen and
                                                         the last Teaching NLP workshop (Jurgens et al.,
Søgaard (2021b) show that weight pruning has dis-
                                                         2021), asking about the first experiment that they
parate effects on performance across demographics
                                                         presented in their NLP 101 course. We received
and that the min-max difference in group disparities
                                                         71 responses in total. Our first question was: The
is negatively correlated with model size. Renduch-
                                                         last time you taught an introductory NLP course,
intala et al. (2021) observe that techniques to make
                                                         what was the first task you introduced the students
inference more efficient, e.g., greedy search, quan-
                                                         to, or that they had to implement a model for?
tization, or shallow decoder models, have a small
                                                         The relative majority of respondents (31.9%) said
impact on performance, but dramatically amplify
                                                         sentiment analysis, while 10.1% indicated topic
gender bias. In a rare study of fairness and inter-
                                                         classification.12 More importantly, we also asked
pretability, Vig et al. (2020) propose a methodol-
                                                         them about the language of the data used in the
ogy to interpret which parts of a model are causally
implicated in its behavior. They apply this method-            We again note that there are other possible dimensions,
ology to analyze gender bias in pre-trained Trans-       not studied in this work, that can expose more blind spots: e.g.,
                                                         fairness and multi-modality, multilinguality and privacy.
formers, finding that gender bias effects are sparse        12
                                                               The remaining responses included NER, language model-
and concentrated in small parts of the network.          ing, language identification, hate speech detection, etc.

Year   Book                         Language         Task           work that seeks to depart from the standard setting
 1999   Manning and Schütze (1999)   English-French   Alignment      has to work harder, not only to build systems and
 2009   Jurafsky and Martin (2009)   English          LM
 2009   Bird et al. (2009)           English          Name cl.       resources in order to establish comparability with
 2013   Søgaard (2013)               English          Doc.cl.
 2019   Eisenstein (2019)            English          Doc.cl.        existing work but also needs to argue convincingly
                                                                     the importance of such work. We provide practical
Table 3: First experiments in NLP textbooks. The ob-                 recommendations in the next section on how we
jective across all books is optimizing for performance               can facilitate such research as a community.
(AER, perplexity, or accuracy), rather than fairness, in-
terpretability or efficiency.                                        7   Discussion
                                                                     Is S QUARE O NE B IAS not the Flipside of Sci-
experiment, and what metric they optimized for.                      entific Protocol? One potential argument for a
More than three quarters of respondents reported                     community-wide S QUARE O NE B IAS is that when
that they used English language training and eval-                   studying the impact of some technique t, say a
uation data and more than three quarters of the                      novel regularization term, we want to compare
respondents asked the students to optimize for ac-                   some system with and without t, i.e., control for all
curacy or F1. The choice of using English lan-                       other factors. To maximize impact and ease work-
guage datasets is particularly interesting in contrast               load, it makes sense at first sight to stick to a system
to the native languages of the teachers and their                    and experimental protocol that is familiar or well-
students: In around two thirds of the classes, most                  studied. Always returning to the S QUARE O NE is
students shared an L1 language that was not En-                      a way to control for all other factors and relating
glish; and less than a quarter of the teachers were                  new findings to known territory. The reason why
L1 English speakers themselves. We extend this                       this is only seemingly a good idea, however, is that
analysis to prototypical NLP experiments in un-                      the factors we study in NLP research, may be non-
dergraduate and graduate research based on five                      linearly related. The fact that t makes for a positive
exemplary NLP textbooks, spanning 20 years (see                      net contribution under one set of circumstances,
Table 3). We observe that they, like the teachers                    does not imply that it would do so under different
in our survey, take the same point of departure: an                  circumstances. This is illustrated most clearly by
English-language experiment where we use super-                      the research surveyed in §3. Ideally, we thus want
vised learning techniques to optimize for a standard                 to study the impact of t under as many circum-
performance metric, e.g., perplexity or error. We                    stances as possible, but in the absence of resources
note an important difference, however: While the                     to do so, it is a better (collective) search strategy to
first four books largely ignore issues relating to                   apply t to a random set of circumstances (within
fairness, interpretability, and efficiency, the most                 the space of relevant circumstances, of course).
recent NLP textbook in Table 3 (Eisenstein, 2019)
                                                                     Comment on Meta-Research. This paper can
discusses efficiency (briefly) and fairness (more
                                                                     be seen in the line of other meta-research (Davis,
thoroughly). Overall, we believe that teachers and
                                                                     1971; Lakatos, 1976; Weber, 2006; Bloom et al.,
educational materials should engage as early as
                                                                     2020) that seeks to analyze research practices and
possible with the multiple dimensions of NLP in
                                                                     whether a scientific field is heading in the right
order to sensitize researchers regarding these topics
                                                                     direction. Within the NLP community, much of
at the start of their careers.
                                                                     such recent discussion has focused on the nature
Commercial Factors. For commercially focused                         of leaderboards and the practice of benchmarking
NLP, there is an incentive to focus on settings with                 (Ethayarajh and Jurafsky, 2020; Ma et al., 2021).
many users, such as major languages with many                        Should Each Paper Aim to Cover All Dimen-
speakers. Similarly, as long as users do not mind us-                sions? We believe that a researcher should aspire
ing highly accurate black-box systems, researchers                   to cover as many dimensions as possible with their
working on real-world applications can often afford                  research. Considering the dimensions of research
to ignore dimensions such as interpretability and                    encourages us to think more holistically about our
fairness.                                                            research and its final impact. It may also accel-
Momentum of the Status Quo. The S QUARE                              erate progress as follow-up work will already be
O NE is well supported by existing infrastructure,                   able to build on the insights of multi-dimensional
resources, baselines, and experimental results. Any                  analyses of new methods. It will also promote the
cross-pollination of ideas, which will no longer be     valuable feedback on a draft of this paper and the
confined to their own sub-areas. While such multi-      suggestion of the term ‘square one’.
dimensional research may be cumbersome at the
moment, we believe with the proper incentives and
