Mind Your Inflections! - Improving NLP for Non-Standard English with Base-Inflection Encoding - NUS Computing

Page created by Juan Avila
 
CONTINUE READING
Mind Your Inflections!
                                         Improving NLP for Non-Standard English with Base-Inflection Encoding
                                                          Samson Tan§\ , Shafiq Joty‡§ , Lav R. Varshney§ , Min-Yen Kan\
                                                                                 §
                                                                                   Salesforce Research
                                                                          \
                                                                            National University of Singapore
                                                                        ‡
                                                                          Nanyang Technological University
                                                          §
                                                            {samson.tan,sjoty,lvarshney}@salesforce.com
                                                                            \
                                                                              kanmy@comp.nus.edu.sg

                                                               Abstract                          (Kachru et al., 2009); over-sensitivity to these local
                                                                                                 variations predisposes English NLP systems to dis-
                                             Morphological inflection is a process of word
                                                                                                 criminate against speakers of World Englishes by
                                             formation where base words are modified to
arXiv:2004.14870v1 [cs.CL] 30 Apr 2020

                                             express different grammatical categories such       either misunderstanding or misinterpreting them
                                             as tense, case, voice, person, or number. World     (Hern, 2017; Tatman, 2017).
                                             Englishes, such as Colloquial Singapore En-            In particular, Tan et al. (2020) recently show that
                                             glish (CSE) and African American Vernacu-           current question-answering and machine transla-
                                             lar English (AAVE), differ from Standard En-        tion systems are overly sensitive to non-standard
                                             glish dialects in inflection use. Although com-
                                                                                                 inflection use, which is a common feature of di-
                                             prehension by human readers is usually unim-
                                             paired by non-standard inflection use, NLP          alects like Colloquial Singapore English (CSE) and
                                             systems are not so robust. We introduce a new       African American Vernacular English (AAVE).1
                                             Base-Inflection Encoding of English text that       Since humans are able to internally correct for or
                                             is achieved by combining linguistic and statis-     ignore non-standard inflection use (Foster and Wig-
                                             tical techniques. Fine-tuning pre-trained NLP       glesworth, 2016), we should expect NLP systems
                                             models for downstream tasks under this novel        to be equally robust.
                                             encoding achieves robustness to non-standard
                                             inflection use while maintaining performance           Existing work in adversarial robustness for NLP
                                             on Standard English examples. Models us-            primarily focuses on adversarial training methods
                                             ing this encoding also generalize better to         (Belinkov and Bisk, 2018; Ribeiro et al., 2018; Tan
                                             non-standard dialects without explicit training.    et al., 2020) or classifying and correcting adversar-
                                             We suggest metrics to evaluate tokenizers and       ial examples (Zhou et al., 2019a). However, this
                                             extensive model-independent analyses demon-         effectively increases the size of the training dataset
                                             strate the efficacy of the encoding when used
                                                                                                 by including adversarial examples or training a new
                                             together with data-driven subword tokenizers.
                                                                                                 model to identify and correct perturbations, thereby
                                         1   Introduction                                        significantly increasing the overall computational
                                                                                                 cost for creating hardened models.
                                         Large-scale neural models have proven success-
                                                                                                    These approaches also only operate on either raw
                                         ful at a wide range of natural language process-
                                                                                                 text or the model and ignore tokenization, which
                                         ing (NLP) tasks but are susceptible to amplifying
                                                                                                 transforms raw text into a form that the neural net-
                                         discrimination against minority linguistic commu-
                                                                                                 work can learn from. Here we introduce a new
                                         nities (Hovy and Spruit, 2016; Tan et al., 2020)
                                                                                                 representation for word tokens that separates base
                                         due to selection bias in the training data and model
                                                                                                 and inflection to improve NLP robustness, in a
                                         overamplification (Shah et al., 2019).
                                                                                                 sense making grammatical parsing explicit (Pān.ini,
                                            Most datasets implicitly assume a distribution
                                                                                                 c. 500 BCE) and potentially changing the statistical
                                         of perfect Standard English speakers, but this does
                                                                                                 properties of tokens (Zipf, 1965, p. 255).
                                         not accurately reflect the majority of the global En-
                                         glish speaking population that are either second           Many NLP systems presently use a combination
                                         language or non-standard dialect speakers (Crystal,     of a whitespace and punctuation tokenizer followed
                                         2003; Eberhard et al., 2019). This is particularly      by a data-driven subword tokenizer like byte pair
                                         concerning because these World Englishes differ
                                                                                                    1
                                         at the lexical, morphological, and syntactic levels            Examples in Appendix A of Tan et al. (2020).
encoding (BPE) (Sennrich et al., 2016).2 However,                    models represented each word as a single symbol in
a purely data-driven approach may fail to find the                   the vocabulary (Bengio et al., 2001; Collobert et al.,
optimal encoding, both in terms of vocabulary effi-                  2011) and uncommon words were represented by
ciency and cross-dialectal generalization, thereby                   an unknown symbol. However, such a representa-
making the neural model more vulnerable to inflec-                   tion is unable to adequately deal with words absent
tional perturbations. As such, we:                                   in the training vocabulary. Therefore, subword rep-
                                                                     resentations like WordPiece (Schuster and Naka-
• Propose Base-Inflection Encoding (BITE),
                                                                     jima, 2012) and byte pair encoding (Sennrich et al.,
  which uses morphological information to help
                                                                     2016) were proposed to encode out-of-vocabulary
  the data-driven tokenizer use its vocabulary ef-
                                                                     (OOV) words by segmenting them into subwords
  ficiently and generate robust token sequences.
                                                                     and encoding each subword as a separate symbol.
  In contrast to morphological and subword seg-
                                                                     This way, less information is lost in the encoding
  mentation techniques like BPE and Morfessor
                                                                     process since OOV words are approximated as a
  (Creutz and Lagus, 2002), we reduce inflected
                                                                     combination of subwords in the vocabulary. Wang
  forms to their base forms before re-injecting the
                                                                     et al. (2019) propose to reduce vocabulary sizes by
  inflection information into the encoded sequence
                                                                     operating on bytes instead of characters.
  via special inflection symbols. This approach
                                                                        To make subword regularization more tractable,
  gracefully handles the canonicalization of words
                                                                     Kudo (2018) proposed an alternative method of
  with ablaut while allowing the original sentence
                                                                     building a subword vocabulary by reducing an ini-
  to be easily reconstructed.
                                                                     tially oversized vocabulary down to the required
• Demonstrate BITE’s effectiveness at hardening                      size with the aid of a unigram language model,
  the NLP system to non-standard inflection use                      as opposed to incrementally building a vocabulary
  while preserving performance on Standard En-                       like in WordPiece and BPE variants.
  glish examples. Crucially, simply fine-tuning the                  Adversarial robustness in NLP. To harden
  pre-trained model for the downstream task after                    NLP systems against adversarial examples, exist-
  adding BITE is sufficient. Unlike adversarial                      ing work largely uses adversarial training (Good-
  fine-tuning, our method does not increase the                      fellow et al., 2015; Jia and Liang, 2017; Ebrahimi
  dataset size and is more environment-friendly.                     et al., 2018; Belinkov and Bisk, 2018; Ribeiro et al.,
• Show that BERT (Devlin et al., 2019) is less                       2018; Iyyer et al., 2018; Cheng et al., 2019). How-
  perplexed by non-standard dialectal data when                      ever, this generally involves retraining the model
  equipped with BITE, which implies BITE helps                       with the adversarial data, which is computationally
  the model generalize better across dialects.                       expensive and time-consuming. Tan et al. (2020)
                                                                     showed that simply fine-tuning the trained model
• Conduct extensive model-independent analyses                       for a single epoch on appropriately generated adver-
  to demonstrate BITE’s efficacy when used in tan-                   sarial training data is sufficient to harden it against
  dem with data-driven subword tokenizers like                       inflectional adversaries. Instead of adversarial train-
  BPE, WordPiece (Schuster and Nakajima, 2012),                      ing, Zhou et al. (2019b) propose using a BERT-
  and unigram language model (LM) (Kudo, 2018).                      based model to detect adversarial examples and
  Since there is little prior work on quantitatively                 recover the original examples. Jia et al. (2019) and
  evaluating the efficacy of subword encoding                        Huang et al. (2019) use Interval Bound Propagation
  schemes, we propose a number of metrics to                         to train provably robust models.
  operationalize and evaluate the vocabulary ef-
  ficiency and semantic capacity of an encoding                      3   Linguistically-Grounded Tokenization
  scheme. Our evaluation measures are generic
                                                                     In data-driven subword encoding schemes like
  and can be used to evaluate any tokenizer.
                                                                     BPE, the goal is to improve the model’s ability
2       Related Work                                                 to approximate the semantics of an unknown word
                                                                     by encoding words as subwords.
Subword tokenization. Before neural models                              Although the fully data-driven nature of such
can learn, raw text must first be encoded into sym-                  methods make them language-independent, this
bols with the help of a fixed-size vocabulary. Early                 forces them to rely only on the statistics of the sur-
    2
        SentencePiece treats whitespace as just another character.   face forms when transforming words into subwords
Dataset                     Condition   WordPiece (WP)     BITE + WP       WP + Adv. FT.
                                          Clean         74.58        74.50     75.46        79.07
            SQuAD 2.0 Ans. Qns (F1 )
                                          Adv.          61.37        71.33     72.56        72.21
                                          Clean         72.75        72.71     73.69        74.45
            SQuAD 2.0 All Qns (F1 )
                                          Adv.          59.32        69.23     70.66        68.23
                                          Clean         83.44        83.01     82.21        83.86
            MultiNLI (Acc)
                                          Adv.          58.70        76.11     81.05        83.87
                                          Clean         83.59        83.50     83.36        83.86
            MultiNLI-MM (Acc)
                                          Adv.          59.75        76.64     81.04        75.77

Table 1: BERTbase results on the clean and adversarial MultiNLI and SQuAD 2.0 datasets. We compare Word-
Piece+BITE to both WordPiece alone and with one epoch of adversarial fine-tuning. To ensure a fair comparison
with adversarial fine-tuning, we trained the BITE-equipped model for an extra epoch (right column) on clean data.

Algorithm 1 Base-Inflection Encoding (BITE)               makes inefficient use of a limited vocabulary.
Require: Input sentence S = [w1 , . . . , wN ]               When encoded in conjunction with another in-
Ensure: Tokenized sequence T
  T ← [∅]
                                                          flected form like entered, which should be en-
  for all i = 1, . . . , |N | do                          coded as {enter, ed}, this encoding scheme also
      if POS(wi ) ∈ {NOUN, VERB, ADJ} then                produces two different subwords for the same type
           base ← G ET L EMMA(wi )
           inf lection ← G ET I NFLECTION(wi )
                                                          of inflection -ed vs -d. Like the first example,
           T ← T + [base, inf lection]                    the burden of learning that the two suffixes corre-
      else                                                spond to the same tense is transferred to the learn-
           T ← T + [wi ]
      end if                                              ing model.
  end for                                                    A possible solution is to instead encode danced
  return T                                                as {danc, ed} and dancing as {danc, ing},
                                                          but there is no guarantee that a data-driven encod-
                                                          ing scheme will learn this pattern without some
since they do not make any language-specific mor-         language-specific linguistic supervision. In addi-
phological assumptions. To illustrate, the past tense     tion, this unnecessarily splits up the base form into
of go, take, and keep have the inflected forms            two subwords danc and e; the latter contains no
went, took, and kept, respectively, which have            extra semantic or grammatical information yet in-
little to no overlap with their base forms and each       creases the tokenized sequence length. Although
other even though they share the same tense. These        individually minor, encoding many base words in
six surface forms would likely have no subwords           this manner increases the computational cost for
in common in the vocabulary, thereby putting the          any encoder or decoder network.
burden of learning both the relation between base            Finally, although it is theoretically possible to
forms and inflected forms and the relation between        force a data-driven tokenizer to segment inflected
inflections for the same tense on the model. Addi-        forms into morphologically logical subwords by
tionally, since vocabularies are fixed before model       limiting the vocabulary size, many inflected forms
training, such an encoding does not optimally use         are represented as individual symbols at common
a limited vocabulary.                                     vocabulary sizes (30–40k). We found that the
   Even when inflections do not exhibit ablaut and        BERTbase WordPiece tokenizer and BPE3 encoded
there is a significant overlap between the base and       each of the above examples as single symbols.
inflected forms, e.g., the -ed and -d suffixes, there        To address these issues, we propose the Base-
is no guarantee that the suffix will be encoded as        Inflection Encoding framework (or BITE), which
a separate subword and that the base forms and            encodes the base form and inflection of content
suffixes will be consistently represented. To il-         words separately. Similar to how existing subword
lustrate, encoding danced as {dance, d} and               encoding schemes improve the model’s ability to
dancing as {danc, ing} results in two differ-             approximate the semantics of out-of-vocabulary
ent “base forms” for the same word, dance. This           words with in-vocabulary subwords, BITE helps
again places the burden of learning that the two              3
                                                                Trained on Wikipedia+BookCorpus (1M) with a vocabu-
“base forms” mean the same thing on the model and         lary size of 30k symbols.
the model handle out-of-distribution inflection us-     alone: a limited ability to handle out-of-vocabulary
age better by keeping a content word’s base form        words. Hence, we designed BITE to be a gen-
consistent even when its inflected form drastically     eral framework that seamlessly incorporates exist-
changes. This distributional deviation could mani-      ing data-driven schemes to take advantage of their
fest as adversarial examples, such as those gener-      proven ability to handle OOV words.
ated by M ORPHEUS (Tan et al., 2020), or sentences         To achieve this, a whitespace/punctuation-based
produced by non-standard English (L2 or dialect)        pre-tokenizer is first used to transform the input
speakers. As a result, BITE provides adversarial        into a sequence of words and punctuation charac-
robustness to the model.                                ters, as is common in neural machine translation.
                                                        Next, BITE is applied and the resulting sequence
3.1    Base-Inflection Encoding                         is converted into a sequence of integers by a data-
Given an input sentence S = [w1 , . . . , wN ] where    driven encoding scheme (e.g., BPE). In our experi-
wi is the ith word, BITE generates a sequence           ments, we use BITE in this manner and refer to the
of tokens S 0 = [w10 , . . . , wN
                                0 ] such that w 0 =
                                                i       combined tokenizer as “BITE+D”, where D refers
[BASE(wi ),I NFLECT(wi )] where BASE(wi ) is the        to the data-driven encoding scheme.
base form of the word and I NFLECT(wi ) is the
inflection (grammatical category) of the word (Al-      4     Model-based Experiments
gorithm 1). If wi is not inflected, I NFLECT(wi ) is    In this section, we demonstrate the effectiveness of
N ULL and excluded from the final sequence of to-       BITE using the pre-trained cased BERTbase model
kens to reduce the neural network’s computational       (Devlin et al., 2019) implemented by Wolf et al.
cost. In our implementation, we use Penn Treebank       (2019). We do not replace WordPiece but instead
tags to represent inflections. For example, “Jack       extend incorporate it into the BITE framework as
jumped the hurdles” would be encoded as                 described in §3.2. There are both advantages and
      [Jack, jump, VBD, the, hurdle, NNS].              disadvantages to this approach, which we will dis-
                                                        cuss in the next section.
   By lemmatizing each word to obtain the base
form instead of segmenting it like in most data-        4.1    Adversarial Robustness
driven encoding schemes, BITE ensures this base
                                                        We evaluate BITE on question answering and nat-
form is consistent for all inflected forms of a word,
                                                        ural language understanding datasets, SQuAD 2.0
unlike a subword produced by segmentation, which
                                                        (Rajpurkar et al., 2018) and MultiNLI (Williams
can only contain characters available in the original
                                                        et al., 2018), respectively. Following Tan et al.
word. For example, BASE(took), BASE(taking),
                                                        (2020), we report F1 scores on both the full SQuAD
and BASE(taken) all correspond to the same base
                                                        2.0 dataset and only the answerable questions.
form, take, even though it is significantly ortho-
                                                        In addition, we also report scores for the out-of-
graphically different from took.
                                                        domain development set (MultiNLI-MM).
   Similarly, encoding all inflections of the same
                                                           For the experiments in this subsection, we use
grammatical category (e.g., verb-past-tense) in a
                                                        M ORPHEUS (Tan et al., 2020) to generate adver-
canonical form should help the model to learn each
                                                        sarial examples that resemble second language
inflection’s grammatical role more quickly. This
                                                        English speaker (L2) sentences. These adversar-
is because the model does not need to first learn
                                                        ial examples, while synthetic, can be thought of
that the same grammatical category can manifest
                                                        as “worst-case” examples of L2 sentences for the
in orthographically different forms.
                                                        model under test. This is done separately for each
   Finally and most importantly, this encoding pro-
                                                        BERTbase model.
cess is informationally lossless and we can easily
reconstruct the original sentence using the base        BITE vs. BITE-less. First, we demonstrate the
forms and grammatical information preserved by          effectiveness of BITE at making the model robust
the inflection tokens.                                  to inflectional adversaries. After fine-tuning two
                                                        separate BERTbase models (one with BITE, the
3.2    Compatibility with Data-Driven Methods           other with standard WordPiece) on SQuAD 2.0 and
Although BITE has the numerous advantages out-          MultiNLI with Wolf et al. (2019)’s default hyper-
lined above, it suffers from the same weakness as       parameters, we generate adversarial examples for
regular word-level tokenization schemes when used       them using M ORPHEUS. From Table 1, we observe
that the BITE-equipped model not only achieves           in this section’s experiments, we use the model’s
similar performance (±0.5) on clean data, but is         perplexity on monodialectal corpora as a proxy
significantly more robust to inflectional adversaries    for its performance on downstream tasks in the
(10 points for SQuAD, 17 points for MultiNLI).           corresponding dialect. The perplexity reflects the
                                                         pre-trained model’s generalization ability on the
BITE vs. Adversarial Fine-tuning. Next, we               dialectal datasets.
compare the BITE to adversarial fine-tuning (Tan
et al., 2020), an economical variation of adversarial    4.2.1      Corpora
training (Goodfellow et al., 2015) for making mod-       For AAVE, we use the Corpus of Regional African
els robust to inflectional perturbations. In adversar-   American Language (CORAAL) (Kendall and Far-
ial fine-tuning, an adversarial training set is gener-   rington, 2018), which comprises transcriptions of
ated by randomly sampling inflectional adversaries       interviews with African Americans born between
k times from the adversarial distribution found by       1891 and 2005. For our evaluation, only the inter-
M ORPHEUS and adding them to the original train-         viewee’s speech was used. In addition, we strip all
ing set. Rather than retraining the model on this ad-    in-line glosses and annotations from the transcrip-
versarial training set, the previously trained model     tions before dropping all lines with less than three
is simply trained for one extra epoch. In this experi-   words. After pre-processing, this corpus consists of
ment, we follow the above methodology and adver-         slightly under 50k lines of text (1,144,803 tokens,
sarially fine-tune the WordPiece-only BERTbase for       17,324 word types).
one epoch with k set to 4. To ensure a fair compari-
                                                            To obtain a CSE corpus, we scrape the In-
son, we also train the BITE-equipped BERTbase on
                                                         fotech Clinics section of the Hardware Zone Fo-
the same training set for an extra epoch.
                                                         rums4 , a forum frequented mainly by Singapore-
   From Table 1, we observe that BITE is often
                                                         ans and where CSE is commonly used. Similar
more effective than adversarial fine-tuning at mak-
                                                         pre-processing to the AAVE data yields a 2.2 mil-
ing the model more robust against inflectional ad-
                                                         lion line corpus (45,803,898 tokens, 253,326 word
versaries and in some cases (SQuAD 2.0 All and
                                                         types).
MNLI-MM) even without needing the additional
epoch of training.                                       4.2.2      Experimental Setup
   However, the adversarially fine-tuned model con-
                                                         In this experiment, we take the same pre-trained
sistently achieves better performance on clean data.
                                                         BERTbase model and fine-tune two separate vari-
This is likely due to the fact that even though adver-
                                                         ants (with and without BITE) on Wikipedia and
sarial fine-tuning requires only a single epoch of
                                                         BookCorpus (Zhu et al., 2015) using the masked
extra training, the process of generating the training
                                                         language modeling (MLM) loss without the next
set increases its size by a factor of k and therefore
                                                         sentence prediction (NSP) loss. We fine-tune for
the computational cost. In contrast, BITE requires
                                                         one epoch on increasingly large subsets of the
no extra training and is much more economical.
                                                         dataset, since this has been shown to be more ef-
   Adversarial fine-tuning’s poorer performance on
                                                         fective than doing the same number of gradient
the out-of-domain adversarial data (MultiNLI-MM)
                                                         updates on a fixed subset (Raffel et al., 2019).
compared to the in-domain data hints at a possible
                                                            Next, we evaluate model perplexities on the
weakness: it is less effective at inducing model
                                                         AAVE and CSE corpora, which we consider to
robustness when the adversarial example is from
                                                         be from dialectal distributions that differ from the
an out-of-domain distribution.
                                                         training data which is considered to be Standard
   BITE, on the other hand, performs equally well
                                                         English. Since calculating the “masked perplexity”
on both in- and out-of-domain data, demonstrating
                                                         requires randomly masking a certain percentage
its applicability to practical scenarios where the
                                                         of tokens for prediction, we also experiment with
training and test domain may not match.
                                                         doing this for each sentence multiple times before
4.2   Dialectal Variation                                averaging the perplexity. However, we find no sig-
                                                         nificant difference between doing the calculation
Apart from second languages, dialects are another        once or five times; the random effects likely cancel
common source of non-standard inflection use.            out due to the large size of our corpora.
However, there is a dearth of task-specific datasets
                                                            4
in English dialects like AAVE and CSE. Therefore,               https://forums.hardwarezone.com.sg/
80                                                                     20
                                             WordPiece only                                                        WordPiece only
                                            WordPiece + BITE                                                      WordPiece + BITE
              60
 Perplexity

                                                                       Perplexity
                                                                                     15

              40
                                                                                     10

                   0         1      2        3         4           5                      0      1        2        3        4          5
                                                                 6                                                                     6
                             # of MLM training examples      ·10                                  # of MLM training examples     ·10
              (a) Colloquial Singapore English (forum threads)                      (b) African American Vernacular English (CORAAL)

Figure 1: Perplexity of BERTbase with and without BITE on CSE and AAVE corpora. To allow the model to adapt
to the new inflection tokens added to the vocabulary, we perform masked language model (MLM) training for
one epoch on increasingly large subsets of Wikipedia+BookCorpus. We observe a large initial perplexity for the
WordPiece+BITE model as it has not fully adapted to the new inflection tokens; this stabilizes to around half of
the WordPiece-only model’s perplexity as we increase the training dataset size.

4.2.3              Results                                             5                Model-Independent Analyses
From Figure 1, we observe that the BITE-equipped                       Although wrapping an existing pretrained subword
model initially has a much higher perplexity, before                   tokenizer and model like WordPiece and BERTbase
converging to around 50% of the standard model’s                       allows us to quickly reap the benefits of BITE
perplexity as the model adapts to the presence of                      without the computational overhead of pretrain-
the new inflection tokens (e.g., VBD, NNS, etc.).                      ing them from scratch, such an approach likely
Crucially, the models are not trained on dialectal                     does not make use of BITE’s full potential. This
corpora, which demonstrates the effectiveness of                       is due to the compressing effect that BITE offers
BITE at helping models better generalize to dialec-                    on the lexical space. Therefore, in this section, we
tal distributions after a short adaptation phase.                      analyze WordPiece, BPE, and unigram language
CSE vs. AAVE. Astute readers might notice that                         model (Kudo, 2018) subword tokenizers that are
there is a large difference in perplexity between the                  trained from scratch with and without BITE us-
two dialectal corpora, even for the same tokenizer                     ing the tokenizers5 and sentencepiece6
combination. There are two possible explanations.                      libraries. Through our experiments we aim to an-
The first is that CSE, being an amalgam of English,                    swer the following two questions:
Chinese dialects, Malay, etc., differs significantly
                                                                                    • Does BITE help the data-driven tokenizer use
from Standard English not only morphologically,
                                                                                      its vocabulary more efficiently (§5.1)?
but also in word ordering (e.g., topic prominence)
(Tongue, 1974). In addition, numerous loan words                                    • How does BITE improve adversarial robust-
and discourse particles not found in Standard En-                                     ness (§5.2)?
glish like lah, lor and hor are commonplace in
CSE (Leimgruber, 2009). AAVE, however, gen-                               We draw one million examples from English
erally shares the same word ordering as Standard                       Wikipedia and BookCorpus for use as our train-
English due to its largely English origins (Poplack,                   ing set. Unless otherwise mentioned, we use an-
2000) and is less different linguistically (compared                   other 5k examples as our test set. Although Sen-
to CSE vs. Standard English). These differences                        tencePiece can directly process raw text, we pre-
between AAVE and CSE are likely explanations                           tokenize the raw text with the BertPreTokenizer
for the significant differences in perplexity.                         from the tokenizers library before encoding
   Another possible explanation is that the Book-                      them for ease of comparison across the three en-
Corpus may contain examples of AAVE since the                          coding schemes. For practical applications, users
BookCorpus’ source, Smashwords, also publishes                         should be able to apply SentencePiece’s method
African American fiction. We believe the reason                                     5
                                                                                        github.com/huggingface/tokenizers
                                                                                    6
for the difference is a mixture of these two factors.                                   github.com/google/sentencepiece
100
∆ in sequence length (%)
                               15
                                                                                                            90

                                                                                            Coverage (%)
                               10                                                                           80

                                                                             BPE
                                 5                                          Baseline                        70
                                                                           WordPiece                                                                 Baseline
                                                                          Unigram LM                                                                  BITE
                                                                                                            60
                                     0         1            2            3             4                         0   2            4         6            8
                                               Vocabulary Size (symbols)          ·104                                   Vocabulary Size (symbols)           ·104

Figure 2: Relative increase in mean tokenized sequence                                     Figure 3:      Comparison of coverage of the
lengths (%) between BITE-less and BITE-equipped to-                                        Wikipedia+BookCorpus (1M examples) dataset
kenizers after training the data-driven subword tokeniz-                                   between BITE and a trivial baseline (word counts).
ers with varying vocabulary sizes; lower is better. Base-                                  Since BITE has no fixed vocabulary, we determine the
line (dotted red line) denotes the percentage of inflected                                 “vocabulary” by taking the N most common tokens.
forms in an average sequence; this is equivalent to the
increase in sequence length if BITE had no effect on
the data-driven tokenizers’ encoding efficiency.                                           crease (i.e., upper bound) in their mean tokenized
                                                                                           sequence length. However, from Figure 2, we see
of handling whitespace characters instead of the                                           this is not the case as the relative increase (with
BertPreTokenizer with no issues.                                                           and without BITE) in mean sequence length gen-
                                                                                           erally stays below 13%, 5% less than the baseline.
5.1                              Vocabulary Efficiency                                     This demonstrates that BITE helps the data-driven
We may operationalize the question of whether                                              tokenizer make better use of its limited vocabulary.
BITE improves vocabulary efficiency in at least                                               In addition, we see that the gains are inversely
three ways. The most straightforward is to measure                                         proportional to the vocabulary size. This is likely
the sequence-level change: the difference in the                                           due to the following reasons. For a given sentence,
tokenized sequence length with and without BITE.                                           the corresponding tokenized sequence’s length usu-
Efficient use of a fixed-size vocabulary should re-                                        ally decreases as the data-driven tokenizer’s vo-
sult in it comprising symbols that minimize the                                            cabulary size increases as it allows merging of
average tokenized sequence length. An alternative                                          more smaller subwords into longer subwords. On
vocabulary-level perspective is to observe how well                                        the other hand, BITE is vocabulary-independent,
a limited vocabulary represents a corpus. Finally,                                         which means that the tokenized sequence length is
since the base form of a word is sufficient to se-                                         always the same for a given sentence. Therefore,
mantically represent all its inflected forms (Jackson,                                     we can expect the same absolute difference to con-
2014), the number of unique linguistic lexemes7                                            tribute to a larger relative increase as the vocabulary
that can be represented by a vocabulary should also                                        size increases. Additionally, more inflected forms
be a good measure how much “semantic knowl-                                                are memorized as the vocabulary size increases, re-
edge” it contains.                                                                         sulting in an average absolute increase of 0.4 tokens
                                                                                           per sequence for every additional 10k vocabulary
Sequence lengths. A possible concern with                                                  tokens. These two factors together should explain
BITE is that it may significantly increase the length                                      the above phenomenon.
of the tokenized sequence, and hence the compu-
tational cost for sequence modeling, since it splits                                       Vocabulary coverage. Next, we look at this
all inflected content words (nouns, verbs, and ad-                                         question from a vocabulary-level perspective and
jectives) into two tokens. We calculate the percent-                                       operationalize vocabulary efficiency as the cover-
age of inflected tokens to be 17.89%.8 Therefore,                                          age of a representative corpus by a vocabulary’s
if BITE did not enhance WordPiece’s and BPE’s                                              tokens. We measure coverage by computing the
encoding efficiency, we should expect a 17.89% in-                                         total number of tokens (words and punctuation) in
                           7
                               A lexeme is the set of a word’s base and inflected forms.   the corpus that are represented in the vocabulary
                           8
                               Note that only content words are subject to inflection.     divided by the total number of tokens in the corpus.
·104

                                                                                                                                     N
                     4
 Semantic Capacity
                                                                                                                                     X
                                                                                              SemCapacity(T1 , . . . , TN ) =              f (Ti ),   (2)
                                                                                                                                     i=1
                     3                                                  BPE
                                                                    BPE + BITE
                                                                     WordPiece             where N is the total number of unique base forms
                     2                                            WordPiece + BITE
                                                                    Unigram LM             in the evaluation corpus, Ti is the sequence of (sub-
                                                                 Unigram LM + BITE         word) tokens obtained from encoding the ith base
                         0               1              2             3              4     form, and ui is the number of unknown tokens9
                                         Vocabulary Size (symbols)           ·104          in Ti . While not strictly necessary when compar-
                                                                                           ing vocabularies on the same corpus, normalizing
Figure 4: Semantic capacity of each tokenizer combina-                                     Equation (2) by the number of unique base forms in
tion’s vocabulary, as computed in Equations (1) and (2).                                   the corpus may be helpful for cross-corpus compar-
Higher is better.                                                                          isons. Further normalizing the resulting quantity
                                                                                           by the tokenizer’s vocabulary size yields a measure
                                                                                           of semantic efficiency.
   Since BITE does not require a vocabulary to be
                                                                                              Although it is possible to extend Equation (1) to
fixed at training time, we simply set the N most
                                                                                           cover cases where there are only multiple unknown
frequent tokens (base forms and inflections) to be
                                                                                           tokens, this would unnecessarily complicate the
our vocabulary. We use the N most frequent tokens
                                                                                           equation. Hence, we define f (Ti ) = 0 when there
in the unencoded text as our baseline vocabulary.
                                                                                           are only unknown tokens in the encoded sequence.
   From Figure 3, we observe that the BITE vocab-                                          We also implicitly define the penalty of each extra
ulary achieves a higher coverage of the corpus than                                        unknown token to be double10 that of a token in
the baseline, hence demonstrating the efficacy of                                          the vocabulary. If necessary, it is trivial to alter the
BITE at improving vocabulary efficiency. Addi-                                             weight of this penalty by introducing a constant λ:
tionally, we note that this advantage is most signifi-
cant when the vocabulary is between 5–10k tokens.
This implies that inflected word forms comprise a                                                            (
                                                                                                                        1
                                                                                                                 (|Ti |−ui )+λui   |Ti | − ui > 0
large portion of frequently occurring words, which                                           f (Ti , λ) =                                             (3)
comports with intuition.                                                                                       0                   otherwise.

Semantic capacity. Since a word’s base form is                                                To evaluate the semantic capacity of our vocabu-
able to semantically represent all its inflected forms                                     laries, we use WordNet (Miller, 1995) as our repre-
(Jackson, 2014), we posit that the number of base                                          sentative “corpus”. Specifically, we only consider
forms contained in a vocabulary can be a good                                              its single-word lemmas (N = 83118). From Fig-
proxy for the amount of semantic knowledge it                                              ure 4, we see that data-driven tokenizers trained
contains. This may be termed the semantic capac-                                           with BITE tend to produce vocabularies with higher
ity of a vocabulary. In whole-word (non-subword)                                           semantic capacities.
vocabularies, semantic capacity may be measured                                               Additionally, we observe that tokenizer combi-
by the number of unique lexemes in the vocabulary.                                         nations incorporating WordPiece or unigram LM
   For subword vocabularies like BPE’s, however,                                           generally outperform the BPE ones. We believe
such a measure is slightly less useful since every                                         this to be the result of using a language model to in-
English word can be trivially represented given the                                        form vocabulary generation. It is logical that a sym-
alphabet and a hyphen. Thus, we propose a general-                                         bol that maximizes a language model’s likelihood
ization of the above definition which penalizes the                                        on the training data is also semantically “denser”,
use of multiple (and unknown) tokens to represent                                          hence prioritizing such symbols produces semanti-
a single base form. Formally,                                                              cally efficient vocabularies. We leave the in-depth
                                                                                           investigation of this relationship to future work.
                                         (
                                                 1
                                             |Ti |+ui       |Ti | − ui > 0                    9
                             f (Ti ) =                                               (1)          Usually represented as [UNK] or .
                                                                                             10
                                             0              otherwise,                            |T | contributes the extra count.
BPE only       posit that the process of encoding raw text into
                98                                       BPE + BITE
                                                        WordPiece only
                                                       WordPiece + BITE
                                                                          network-operable symbols may have more impact
                                                       Unigram LM only    on a neural network’s generalization and adversar-
 % Similarity

                                                      Unigram LM + BITE

                96                                                        ial robustness than previously assumed.
                                                                             Hence, we propose to improve the tokenization
                                                                          pipeline by incorporating linguistic information
                94
                                                                          to guide the data-driven tokenizer in learning a
                                                                          more efficient vocabulary and generating token se-
                     0       1         2          3                4      quences that increase the neural network’s robust-
                             Vocabulary Size (symbols)                    ness to non-standard inflection use. We show that
                                                                ·104
                                                                          this improves its generalization to second language
Figure 5: Mean percentage of tokens that are the same                     English and World Englishes without requiring ex-
in the clean and adversarial tokenized sequences.                         plicit training on such data. Since dialectal data
                                                                          is often scarce or even nonexistent (in the case of
                                                                          task-specific labeled datasets), an NLP system’s
5.2              Adversarial Robustness.                                  ability to generalize across dialects in a zero-shot
BITE’s ability to make models more robust to in-                          manner is crucial for it to work well for diverse
flectional perturbations can be directly attributed                       linguistic communities.
to its preservation of a consistent, inflection-                             Finally, given the effectiveness of the com-
independent base form. We demonstrate this by                             mon task framework for spurring progress in NLP
measuring the similarity between the encoded                              (Varshney et al., 2019), we hope to do the same for
clean and adversarial sentences, using the Rat-                           tokenization. As a first step, we propose to evaluate
cliff/Obershelp algorithm (Ratcliff and Metzener,                         an encoding scheme’s efficacy by measuring its vo-
1988) as implemented by Python’s difflib. We                              cabulary efficiency and semantic capacity (which
use the MultiNLI in-domain development dataset                            may have interesting connections to information-
and the M ORPHEUS adversaries generated in §4.1                           theoretic limits (Ziv and Lempel, 1978)). We have
for this experiment.                                                      already shown that Base-Inflection Encoding helps
   We find that clean and adversarial sequences                           a data-driven tokenizer use its limited vocabulary
tokenized by the BITE-equipped tokenizers were                            more efficiently by increasing its semantic capacity
more similar (1–2% for WordPiece, 2–2.5% for                              when the combination is trained from scratch.
BPE) than those tokenized by the ones without
BITE (Figure 5). The decrease in similarity for
all conditions as the vocabulary size increases is                        References
unsurprising; a larger vocabulary will generally                          Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic
result in shorter sequences and the same number of                          and natural noise both break neural machine transla-
differing tokens will be a larger relative change.                          tion. In 6th International Conference on Learning
                                                                            Representations, Vancouver, BC, Canada.
   The above results demonstrate that the improved
robustness shown in §4.1 can be directly attributed                       Yoshua Bengio, Réjean Ducharme, and Pascal Vincent.
to the separation of each content word’s base forms                         2001. A neural probabilistic language model. In
                                                                            T. K. Leen, T. G. Dietterich, and V. Tresp, editors,
from its inflection and keeping it consistent as the
                                                                            Advances in Neural Information Processing Systems
inflection varies, hence mitigating any significant                         13, pages 932–938. MIT Press.
token-level changes.
                                                                          Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019.
6               Conclusion                                                  Robust neural machine translation with doubly ad-
                                                                            versarial inputs. In Proceedings of the 57th Annual
                                                                            Meeting of the Association for Computational Lin-
The tokenization stage of the modern deep learning                          guistics, pages 4324–4333, Florence, Italy. Associa-
NLP pipeline has not received as much attention                             tion for Computational Linguistics.
as the modeling stage, with researchers often de-
faulting to a commonly used subword tokenizer                             Ronan Collobert, Jason Weston, Léon Bottou, Michael
                                                                            Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
like BPE. Adversarial robustness techniques in                              2011. Natural language processing (almost) from
NLP also largely focus on augmenting the train-                             scratch. Journal of Machine Learning Research,
ing data with adversarial examples. However, we                             12(76):2493–2537.
Mathias Creutz and Krista Lagus. 2002. Unsupervised         In Proceedings of the 2018 Conference of the North
 discovery of morphemes. In Proceedings of the              American Chapter of the Association for Computa-
 ACL-02 Workshop on Morphological and Phonolog-             tional Linguistics: Human Language Technologies,
 ical Learning, pages 21–30. Association for Compu-         Volume 1 (Long Papers), pages 1875–1885, New
 tational Linguistics.                                      Orleans, Louisiana. Association for Computational
                                                            Linguistics.
David Crystal. 2003. English as a Global Language.
  Cambridge University Press.                             Howard Jackson. 2014.          Words and their Meaning.
                                                            Routledge.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of        Robin Jia and Percy Liang. 2017. Adversarial exam-
   deep bidirectional transformers for language under-      ples for evaluating reading comprehension systems.
   standing. In Proceedings of the 2019 Conference          In Proceedings of the 2017 Conference on Empiri-
   of the North American Chapter of the Association         cal Methods in Natural Language Processing, pages
   for Computational Linguistics: Human Language            2021–2031, Copenhagen, Denmark. Association for
  Technologies, Volume 1 (Long and Short Papers),           Computational Linguistics.
   pages 4171–4186, Minneapolis, Minnesota. Associ-
   ation for Computational Linguistics.                   Robin Jia, Aditi Raghunathan, Kerem Göksel, and
                                                            Percy Liang. 2019. Certified robustness to adver-
David M. Eberhard, Gary F. Simons, and Charles D.           sarial word substitutions. In Proceedings of the
  Fennig, editors. 2019. Ethnologue: Languages of           2019 Conference on Empirical Methods in Natu-
  the World, 22 edition. SIL International.                 ral Language Processing and the 9th International
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing           Joint Conference on Natural Language Processing
   Dou. 2018. HotFlip: White-box adversarial exam-          (EMNLP-IJCNLP), pages 4129–4142, Hong Kong,
   ples for text classification. In Proceedings of the      China. Association for Computational Linguistics.
   56th Annual Meeting of the Association for Compu-
                                                          Braj B. Kachru, Yamuna Kachru, and Cecil Nelson,
   tational Linguistics (Volume 2: Short Papers), pages
                                                            editors. 2009. The Handbook of World Englishes.
   31–36, Melbourne, Australia. Association for Com-
                                                            Wiley-Blackwell.
   putational Linguistics.
Pauline Foster and Gillian Wigglesworth. 2016. Cap-       Tyler Kendall and Charlie Farrington. 2018. The cor-
  turing accuracy in second language performance:           pus of regional african american language.
  The case for a weighted clause ratio. Annual Review
                                                          Taku Kudo. 2018. Subword regularization: Improving
  of Applied Linguistics, 36:98–116.
                                                            neural network translation models with multiple sub-
Ian J. Goodfellow, Jonathon Shlens, and Christian           word candidates. In Proceedings of the 56th Annual
   Szegedy. 2015. Explaining and harnessing adversar-       Meeting of the Association for Computational Lin-
   ial examples. In 3rd International Conference on         guistics (Volume 1: Long Papers), pages 66–75, Mel-
   Learning Representations, San Diego, California.         bourne, Australia. Association for Computational
                                                            Linguistics.
Alex Hern. 2017. Facebook translates ’good morning’
  into ’attack them’, leading to arrest. The Guardian.    Jacob RE Leimgruber. 2009. Modelling variation in
                                                             Singapore English. Ph.D. thesis, Oxford University.
Dirk Hovy and Shannon L. Spruit. 2016. The social
  impact of natural language processing. In Proceed-      George A. Miller. 1995. Wordnet: A lexical database
  ings of the 54th Annual Meeting of the Association        for english. Communications of the ACM, 38:39–41.
  for Computational Linguistics (Volume 2: Short Pa-
  pers), pages 591–598, Berlin, Germany. Association      Pān.ini. c. 500 BCE. As..tadhyāyı̄.
  for Computational Linguistics.
                                                          Shana Poplack. 2000. The English History of African
Po-Sen Huang, Robert Stanforth, Johannes Welbl,             American English. Blackwell Malden, MA.
  Chris Dyer, Dani Yogatama, Sven Gowal, Krish-
  namurthy Dvijotham, and Pushmeet Kohli. 2019.           Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
  Achieving verified robustness to symbol substitu-         Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
  tions via interval bound propagation. In Proceed-         Wei Li, and Peter J. Liu. 2019. Exploring the limits
  ings of the 2019 Conference on Empirical Methods          of transfer learning with a unified text-to-text trans-
  in Natural Language Processing and the 9th Inter-         former. arXiv e-prints, arXiv:1910.10683.
  national Joint Conference on Natural Language Pro-
  cessing (EMNLP-IJCNLP), pages 4083–4093, Hong           Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
  Kong, China. Association for Computational Lin-           Know what you don’t know: Unanswerable ques-
  guistics.                                                 tions for SQuAD. In Proceedings of the 56th An-
                                                            nual Meeting of the Association for Computational
Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke           Linguistics (Volume 2: Short Papers), pages 784–
 Zettlemoyer. 2018. Adversarial example generation          789, Melbourne, Australia. Association for Compu-
 with syntactically controlled paraphrase networks.         tational Linguistics.
John W Ratcliff and David E Metzener. 1988. Pattern-      Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  matching: The gestalt approach. Dr Dobbs Journal,         Chaumond, Clement Delangue, Anthony Moi, Pier-
  13(7):46.                                                 ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
                                                            icz, and Jamie Brew. 2019. Huggingface’s trans-
Marco Tulio Ribeiro, Sameer Singh, and Carlos               formers: State-of-the-art natural language process-
 Guestrin. 2018. Semantically equivalent adversar-          ing. arXiv e-prints, arXiv:1910.03771.
 ial rules for debugging NLP models. In Proceedings
 of the 56th Annual Meeting of the Association for        Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei
 Computational Linguistics (Volume 1: Long Papers),         Wang. 2019a. Learning to discriminate perturba-
 pages 856–865, Melbourne, Australia. Association           tions for blocking adversarial attacks in text classi-
 for Computational Linguistics.                             fication. In Proceedings of the 2019 Conference on
                                                            Empirical Methods in Natural Language Processing
Mike Schuster and Kaisuke Nakajima. 2012. Japanese          and the 9th International Joint Conference on Natu-
  and korean voice search. In 2012 IEEE Interna-            ral Language Processing (EMNLP-IJCNLP), pages
  tional Conference on Acoustics, Speech and Signal         4904–4913, Hong Kong, China. Association for
  Processing (ICASSP), pages 5149–5152. IEEE.               Computational Linguistics.
                                                          Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei
Rico Sennrich, Barry Haddow, and Alexandra Birch.           Wang. 2019b. Learning to discriminate perturba-
  2016. Neural machine translation of rare words with       tions for blocking adversarial attacks in text classi-
  subword units. In Proceedings of the 54th Annual          fication. In Proceedings of the 2019 Conference on
  Meeting of the Association for Computational Lin-         Empirical Methods in Natural Language Processing
  guistics (Volume 1: Long Papers), pages 1715–1725.        and the 9th International Joint Conference on Natu-
  Association for Computational Linguistics.                ral Language Processing (EMNLP-IJCNLP), pages
                                                            4906–4915, Hong Kong, China. Association for
Deven Shah, H. Andrew Schwartz, and Dirk Hovy.              Computational Linguistics.
  2019. Predictive biases in natural language process-
  ing models: A conceptual framework and overview.        Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
  arXiv e-prints, arXiv:1912.11078.                         dinov, Raquel Urtasun, Antonio Torralba, and Sanja
                                                            Fidler. 2015. Aligning books and movies: Towards
Samson Tan, Shafiq Joty, Min-Yen Kan, and Richard           story-like visual explanations by watching movies
  Socher. 2020. Its Morphin Time! Combating                 and reading books. In The IEEE International Con-
  linguistic discrimination with inflectional perturba-     ference on Computer Vision (ICCV).
  tions. In Proceedings of the 58th Annual Meeting of
  the Association for Computational Linguistics (Vol-     George K. Zipf. 1965. The Psycho-Biology of Lan-
  ume 1: Long Papers), Seattle, Washington. Associa-        guage: An Introdution to Dynamic Philology. MIT
  tion for Computational Linguistics.                       Press, Cambridge, MA, USA.

Rachael Tatman. 2017. Gender and dialect bias in          Jacob Ziv and Abraham Lempel. 1978. Compres-
  YouTube’s automatic captions. In Proceedings of            sion of individual sequences via variable-rate cod-
  the First ACL Workshop on Ethics in Natural Lan-           ing. IEEE Transactions on Information Theory, IT-
  guage Processing, pages 53–59, Valencia, Spain. As-        24(5):530–536.
  sociation for Computational Linguistics.

Ray K Tongue. 1974. The English of Singapore and
  Malaysia. Eastern Universities Press.

Lav R. Varshney, Nitish Shirish Keskar, and Richard
  Socher. 2019.     Pretrained AI models: Perfor-
  mativity, mobility, and change. arXiv e-prints,
  arXiv:1909.03290.

Changhan Wang, Kyunghyun Cho, and Jiatao Gu.
  2019. Neural machine translation with byte-level
  subwords. arXiv e-prints, arXiv:1909.03341.

Adina Williams, Nikita Nangia, and Samuel Bowman.
  2018. A broad-coverage challenge corpus for sen-
  tence understanding through inference. In Proceed-
  ings of the 2018 Conference of the North American
  Chapter of the Association for Computational Lin-
  guistics: Human Language Technologies, Volume
  1 (Long Papers), pages 1112–1122. Association for
  Computational Linguistics.
You can also read