Mind Your Inflections! - Improving NLP for Non-Standard English with Base-Inflection Encoding - NUS Computing
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Mind Your Inflections! Improving NLP for Non-Standard English with Base-Inflection Encoding Samson Tan§\ , Shafiq Joty‡§ , Lav R. Varshney§ , Min-Yen Kan\ § Salesforce Research \ National University of Singapore ‡ Nanyang Technological University § {samson.tan,sjoty,lvarshney}@salesforce.com \ kanmy@comp.nus.edu.sg Abstract (Kachru et al., 2009); over-sensitivity to these local variations predisposes English NLP systems to dis- Morphological inflection is a process of word criminate against speakers of World Englishes by formation where base words are modified to arXiv:2004.14870v1 [cs.CL] 30 Apr 2020 express different grammatical categories such either misunderstanding or misinterpreting them as tense, case, voice, person, or number. World (Hern, 2017; Tatman, 2017). Englishes, such as Colloquial Singapore En- In particular, Tan et al. (2020) recently show that glish (CSE) and African American Vernacu- current question-answering and machine transla- lar English (AAVE), differ from Standard En- tion systems are overly sensitive to non-standard glish dialects in inflection use. Although com- inflection use, which is a common feature of di- prehension by human readers is usually unim- paired by non-standard inflection use, NLP alects like Colloquial Singapore English (CSE) and systems are not so robust. We introduce a new African American Vernacular English (AAVE).1 Base-Inflection Encoding of English text that Since humans are able to internally correct for or is achieved by combining linguistic and statis- ignore non-standard inflection use (Foster and Wig- tical techniques. Fine-tuning pre-trained NLP glesworth, 2016), we should expect NLP systems models for downstream tasks under this novel to be equally robust. encoding achieves robustness to non-standard inflection use while maintaining performance Existing work in adversarial robustness for NLP on Standard English examples. Models us- primarily focuses on adversarial training methods ing this encoding also generalize better to (Belinkov and Bisk, 2018; Ribeiro et al., 2018; Tan non-standard dialects without explicit training. et al., 2020) or classifying and correcting adversar- We suggest metrics to evaluate tokenizers and ial examples (Zhou et al., 2019a). However, this extensive model-independent analyses demon- effectively increases the size of the training dataset strate the efficacy of the encoding when used by including adversarial examples or training a new together with data-driven subword tokenizers. model to identify and correct perturbations, thereby 1 Introduction significantly increasing the overall computational cost for creating hardened models. Large-scale neural models have proven success- These approaches also only operate on either raw ful at a wide range of natural language process- text or the model and ignore tokenization, which ing (NLP) tasks but are susceptible to amplifying transforms raw text into a form that the neural net- discrimination against minority linguistic commu- work can learn from. Here we introduce a new nities (Hovy and Spruit, 2016; Tan et al., 2020) representation for word tokens that separates base due to selection bias in the training data and model and inflection to improve NLP robustness, in a overamplification (Shah et al., 2019). sense making grammatical parsing explicit (Pān.ini, Most datasets implicitly assume a distribution c. 500 BCE) and potentially changing the statistical of perfect Standard English speakers, but this does properties of tokens (Zipf, 1965, p. 255). not accurately reflect the majority of the global En- glish speaking population that are either second Many NLP systems presently use a combination language or non-standard dialect speakers (Crystal, of a whitespace and punctuation tokenizer followed 2003; Eberhard et al., 2019). This is particularly by a data-driven subword tokenizer like byte pair concerning because these World Englishes differ 1 at the lexical, morphological, and syntactic levels Examples in Appendix A of Tan et al. (2020).
encoding (BPE) (Sennrich et al., 2016).2 However, models represented each word as a single symbol in a purely data-driven approach may fail to find the the vocabulary (Bengio et al., 2001; Collobert et al., optimal encoding, both in terms of vocabulary effi- 2011) and uncommon words were represented by ciency and cross-dialectal generalization, thereby an unknown symbol. However, such a representa- making the neural model more vulnerable to inflec- tion is unable to adequately deal with words absent tional perturbations. As such, we: in the training vocabulary. Therefore, subword rep- resentations like WordPiece (Schuster and Naka- • Propose Base-Inflection Encoding (BITE), jima, 2012) and byte pair encoding (Sennrich et al., which uses morphological information to help 2016) were proposed to encode out-of-vocabulary the data-driven tokenizer use its vocabulary ef- (OOV) words by segmenting them into subwords ficiently and generate robust token sequences. and encoding each subword as a separate symbol. In contrast to morphological and subword seg- This way, less information is lost in the encoding mentation techniques like BPE and Morfessor process since OOV words are approximated as a (Creutz and Lagus, 2002), we reduce inflected combination of subwords in the vocabulary. Wang forms to their base forms before re-injecting the et al. (2019) propose to reduce vocabulary sizes by inflection information into the encoded sequence operating on bytes instead of characters. via special inflection symbols. This approach To make subword regularization more tractable, gracefully handles the canonicalization of words Kudo (2018) proposed an alternative method of with ablaut while allowing the original sentence building a subword vocabulary by reducing an ini- to be easily reconstructed. tially oversized vocabulary down to the required • Demonstrate BITE’s effectiveness at hardening size with the aid of a unigram language model, the NLP system to non-standard inflection use as opposed to incrementally building a vocabulary while preserving performance on Standard En- like in WordPiece and BPE variants. glish examples. Crucially, simply fine-tuning the Adversarial robustness in NLP. To harden pre-trained model for the downstream task after NLP systems against adversarial examples, exist- adding BITE is sufficient. Unlike adversarial ing work largely uses adversarial training (Good- fine-tuning, our method does not increase the fellow et al., 2015; Jia and Liang, 2017; Ebrahimi dataset size and is more environment-friendly. et al., 2018; Belinkov and Bisk, 2018; Ribeiro et al., • Show that BERT (Devlin et al., 2019) is less 2018; Iyyer et al., 2018; Cheng et al., 2019). How- perplexed by non-standard dialectal data when ever, this generally involves retraining the model equipped with BITE, which implies BITE helps with the adversarial data, which is computationally the model generalize better across dialects. expensive and time-consuming. Tan et al. (2020) showed that simply fine-tuning the trained model • Conduct extensive model-independent analyses for a single epoch on appropriately generated adver- to demonstrate BITE’s efficacy when used in tan- sarial training data is sufficient to harden it against dem with data-driven subword tokenizers like inflectional adversaries. Instead of adversarial train- BPE, WordPiece (Schuster and Nakajima, 2012), ing, Zhou et al. (2019b) propose using a BERT- and unigram language model (LM) (Kudo, 2018). based model to detect adversarial examples and Since there is little prior work on quantitatively recover the original examples. Jia et al. (2019) and evaluating the efficacy of subword encoding Huang et al. (2019) use Interval Bound Propagation schemes, we propose a number of metrics to to train provably robust models. operationalize and evaluate the vocabulary ef- ficiency and semantic capacity of an encoding 3 Linguistically-Grounded Tokenization scheme. Our evaluation measures are generic In data-driven subword encoding schemes like and can be used to evaluate any tokenizer. BPE, the goal is to improve the model’s ability 2 Related Work to approximate the semantics of an unknown word by encoding words as subwords. Subword tokenization. Before neural models Although the fully data-driven nature of such can learn, raw text must first be encoded into sym- methods make them language-independent, this bols with the help of a fixed-size vocabulary. Early forces them to rely only on the statistics of the sur- 2 SentencePiece treats whitespace as just another character. face forms when transforming words into subwords
Dataset Condition WordPiece (WP) BITE + WP WP + Adv. FT. Clean 74.58 74.50 75.46 79.07 SQuAD 2.0 Ans. Qns (F1 ) Adv. 61.37 71.33 72.56 72.21 Clean 72.75 72.71 73.69 74.45 SQuAD 2.0 All Qns (F1 ) Adv. 59.32 69.23 70.66 68.23 Clean 83.44 83.01 82.21 83.86 MultiNLI (Acc) Adv. 58.70 76.11 81.05 83.87 Clean 83.59 83.50 83.36 83.86 MultiNLI-MM (Acc) Adv. 59.75 76.64 81.04 75.77 Table 1: BERTbase results on the clean and adversarial MultiNLI and SQuAD 2.0 datasets. We compare Word- Piece+BITE to both WordPiece alone and with one epoch of adversarial fine-tuning. To ensure a fair comparison with adversarial fine-tuning, we trained the BITE-equipped model for an extra epoch (right column) on clean data. Algorithm 1 Base-Inflection Encoding (BITE) makes inefficient use of a limited vocabulary. Require: Input sentence S = [w1 , . . . , wN ] When encoded in conjunction with another in- Ensure: Tokenized sequence T T ← [∅] flected form like entered, which should be en- for all i = 1, . . . , |N | do coded as {enter, ed}, this encoding scheme also if POS(wi ) ∈ {NOUN, VERB, ADJ} then produces two different subwords for the same type base ← G ET L EMMA(wi ) inf lection ← G ET I NFLECTION(wi ) of inflection -ed vs -d. Like the first example, T ← T + [base, inf lection] the burden of learning that the two suffixes corre- else spond to the same tense is transferred to the learn- T ← T + [wi ] end if ing model. end for A possible solution is to instead encode danced return T as {danc, ed} and dancing as {danc, ing}, but there is no guarantee that a data-driven encod- ing scheme will learn this pattern without some since they do not make any language-specific mor- language-specific linguistic supervision. In addi- phological assumptions. To illustrate, the past tense tion, this unnecessarily splits up the base form into of go, take, and keep have the inflected forms two subwords danc and e; the latter contains no went, took, and kept, respectively, which have extra semantic or grammatical information yet in- little to no overlap with their base forms and each creases the tokenized sequence length. Although other even though they share the same tense. These individually minor, encoding many base words in six surface forms would likely have no subwords this manner increases the computational cost for in common in the vocabulary, thereby putting the any encoder or decoder network. burden of learning both the relation between base Finally, although it is theoretically possible to forms and inflected forms and the relation between force a data-driven tokenizer to segment inflected inflections for the same tense on the model. Addi- forms into morphologically logical subwords by tionally, since vocabularies are fixed before model limiting the vocabulary size, many inflected forms training, such an encoding does not optimally use are represented as individual symbols at common a limited vocabulary. vocabulary sizes (30–40k). We found that the Even when inflections do not exhibit ablaut and BERTbase WordPiece tokenizer and BPE3 encoded there is a significant overlap between the base and each of the above examples as single symbols. inflected forms, e.g., the -ed and -d suffixes, there To address these issues, we propose the Base- is no guarantee that the suffix will be encoded as Inflection Encoding framework (or BITE), which a separate subword and that the base forms and encodes the base form and inflection of content suffixes will be consistently represented. To il- words separately. Similar to how existing subword lustrate, encoding danced as {dance, d} and encoding schemes improve the model’s ability to dancing as {danc, ing} results in two differ- approximate the semantics of out-of-vocabulary ent “base forms” for the same word, dance. This words with in-vocabulary subwords, BITE helps again places the burden of learning that the two 3 Trained on Wikipedia+BookCorpus (1M) with a vocabu- “base forms” mean the same thing on the model and lary size of 30k symbols.
the model handle out-of-distribution inflection us- alone: a limited ability to handle out-of-vocabulary age better by keeping a content word’s base form words. Hence, we designed BITE to be a gen- consistent even when its inflected form drastically eral framework that seamlessly incorporates exist- changes. This distributional deviation could mani- ing data-driven schemes to take advantage of their fest as adversarial examples, such as those gener- proven ability to handle OOV words. ated by M ORPHEUS (Tan et al., 2020), or sentences To achieve this, a whitespace/punctuation-based produced by non-standard English (L2 or dialect) pre-tokenizer is first used to transform the input speakers. As a result, BITE provides adversarial into a sequence of words and punctuation charac- robustness to the model. ters, as is common in neural machine translation. Next, BITE is applied and the resulting sequence 3.1 Base-Inflection Encoding is converted into a sequence of integers by a data- Given an input sentence S = [w1 , . . . , wN ] where driven encoding scheme (e.g., BPE). In our experi- wi is the ith word, BITE generates a sequence ments, we use BITE in this manner and refer to the of tokens S 0 = [w10 , . . . , wN 0 ] such that w 0 = i combined tokenizer as “BITE+D”, where D refers [BASE(wi ),I NFLECT(wi )] where BASE(wi ) is the to the data-driven encoding scheme. base form of the word and I NFLECT(wi ) is the inflection (grammatical category) of the word (Al- 4 Model-based Experiments gorithm 1). If wi is not inflected, I NFLECT(wi ) is In this section, we demonstrate the effectiveness of N ULL and excluded from the final sequence of to- BITE using the pre-trained cased BERTbase model kens to reduce the neural network’s computational (Devlin et al., 2019) implemented by Wolf et al. cost. In our implementation, we use Penn Treebank (2019). We do not replace WordPiece but instead tags to represent inflections. For example, “Jack extend incorporate it into the BITE framework as jumped the hurdles” would be encoded as described in §3.2. There are both advantages and [Jack, jump, VBD, the, hurdle, NNS]. disadvantages to this approach, which we will dis- cuss in the next section. By lemmatizing each word to obtain the base form instead of segmenting it like in most data- 4.1 Adversarial Robustness driven encoding schemes, BITE ensures this base We evaluate BITE on question answering and nat- form is consistent for all inflected forms of a word, ural language understanding datasets, SQuAD 2.0 unlike a subword produced by segmentation, which (Rajpurkar et al., 2018) and MultiNLI (Williams can only contain characters available in the original et al., 2018), respectively. Following Tan et al. word. For example, BASE(took), BASE(taking), (2020), we report F1 scores on both the full SQuAD and BASE(taken) all correspond to the same base 2.0 dataset and only the answerable questions. form, take, even though it is significantly ortho- In addition, we also report scores for the out-of- graphically different from took. domain development set (MultiNLI-MM). Similarly, encoding all inflections of the same For the experiments in this subsection, we use grammatical category (e.g., verb-past-tense) in a M ORPHEUS (Tan et al., 2020) to generate adver- canonical form should help the model to learn each sarial examples that resemble second language inflection’s grammatical role more quickly. This English speaker (L2) sentences. These adversar- is because the model does not need to first learn ial examples, while synthetic, can be thought of that the same grammatical category can manifest as “worst-case” examples of L2 sentences for the in orthographically different forms. model under test. This is done separately for each Finally and most importantly, this encoding pro- BERTbase model. cess is informationally lossless and we can easily reconstruct the original sentence using the base BITE vs. BITE-less. First, we demonstrate the forms and grammatical information preserved by effectiveness of BITE at making the model robust the inflection tokens. to inflectional adversaries. After fine-tuning two separate BERTbase models (one with BITE, the 3.2 Compatibility with Data-Driven Methods other with standard WordPiece) on SQuAD 2.0 and Although BITE has the numerous advantages out- MultiNLI with Wolf et al. (2019)’s default hyper- lined above, it suffers from the same weakness as parameters, we generate adversarial examples for regular word-level tokenization schemes when used them using M ORPHEUS. From Table 1, we observe
that the BITE-equipped model not only achieves in this section’s experiments, we use the model’s similar performance (±0.5) on clean data, but is perplexity on monodialectal corpora as a proxy significantly more robust to inflectional adversaries for its performance on downstream tasks in the (10 points for SQuAD, 17 points for MultiNLI). corresponding dialect. The perplexity reflects the pre-trained model’s generalization ability on the BITE vs. Adversarial Fine-tuning. Next, we dialectal datasets. compare the BITE to adversarial fine-tuning (Tan et al., 2020), an economical variation of adversarial 4.2.1 Corpora training (Goodfellow et al., 2015) for making mod- For AAVE, we use the Corpus of Regional African els robust to inflectional perturbations. In adversar- American Language (CORAAL) (Kendall and Far- ial fine-tuning, an adversarial training set is gener- rington, 2018), which comprises transcriptions of ated by randomly sampling inflectional adversaries interviews with African Americans born between k times from the adversarial distribution found by 1891 and 2005. For our evaluation, only the inter- M ORPHEUS and adding them to the original train- viewee’s speech was used. In addition, we strip all ing set. Rather than retraining the model on this ad- in-line glosses and annotations from the transcrip- versarial training set, the previously trained model tions before dropping all lines with less than three is simply trained for one extra epoch. In this experi- words. After pre-processing, this corpus consists of ment, we follow the above methodology and adver- slightly under 50k lines of text (1,144,803 tokens, sarially fine-tune the WordPiece-only BERTbase for 17,324 word types). one epoch with k set to 4. To ensure a fair compari- To obtain a CSE corpus, we scrape the In- son, we also train the BITE-equipped BERTbase on fotech Clinics section of the Hardware Zone Fo- the same training set for an extra epoch. rums4 , a forum frequented mainly by Singapore- From Table 1, we observe that BITE is often ans and where CSE is commonly used. Similar more effective than adversarial fine-tuning at mak- pre-processing to the AAVE data yields a 2.2 mil- ing the model more robust against inflectional ad- lion line corpus (45,803,898 tokens, 253,326 word versaries and in some cases (SQuAD 2.0 All and types). MNLI-MM) even without needing the additional epoch of training. 4.2.2 Experimental Setup However, the adversarially fine-tuned model con- In this experiment, we take the same pre-trained sistently achieves better performance on clean data. BERTbase model and fine-tune two separate vari- This is likely due to the fact that even though adver- ants (with and without BITE) on Wikipedia and sarial fine-tuning requires only a single epoch of BookCorpus (Zhu et al., 2015) using the masked extra training, the process of generating the training language modeling (MLM) loss without the next set increases its size by a factor of k and therefore sentence prediction (NSP) loss. We fine-tune for the computational cost. In contrast, BITE requires one epoch on increasingly large subsets of the no extra training and is much more economical. dataset, since this has been shown to be more ef- Adversarial fine-tuning’s poorer performance on fective than doing the same number of gradient the out-of-domain adversarial data (MultiNLI-MM) updates on a fixed subset (Raffel et al., 2019). compared to the in-domain data hints at a possible Next, we evaluate model perplexities on the weakness: it is less effective at inducing model AAVE and CSE corpora, which we consider to robustness when the adversarial example is from be from dialectal distributions that differ from the an out-of-domain distribution. training data which is considered to be Standard BITE, on the other hand, performs equally well English. Since calculating the “masked perplexity” on both in- and out-of-domain data, demonstrating requires randomly masking a certain percentage its applicability to practical scenarios where the of tokens for prediction, we also experiment with training and test domain may not match. doing this for each sentence multiple times before 4.2 Dialectal Variation averaging the perplexity. However, we find no sig- nificant difference between doing the calculation Apart from second languages, dialects are another once or five times; the random effects likely cancel common source of non-standard inflection use. out due to the large size of our corpora. However, there is a dearth of task-specific datasets 4 in English dialects like AAVE and CSE. Therefore, https://forums.hardwarezone.com.sg/
80 20 WordPiece only WordPiece only WordPiece + BITE WordPiece + BITE 60 Perplexity Perplexity 15 40 10 0 1 2 3 4 5 0 1 2 3 4 5 6 6 # of MLM training examples ·10 # of MLM training examples ·10 (a) Colloquial Singapore English (forum threads) (b) African American Vernacular English (CORAAL) Figure 1: Perplexity of BERTbase with and without BITE on CSE and AAVE corpora. To allow the model to adapt to the new inflection tokens added to the vocabulary, we perform masked language model (MLM) training for one epoch on increasingly large subsets of Wikipedia+BookCorpus. We observe a large initial perplexity for the WordPiece+BITE model as it has not fully adapted to the new inflection tokens; this stabilizes to around half of the WordPiece-only model’s perplexity as we increase the training dataset size. 4.2.3 Results 5 Model-Independent Analyses From Figure 1, we observe that the BITE-equipped Although wrapping an existing pretrained subword model initially has a much higher perplexity, before tokenizer and model like WordPiece and BERTbase converging to around 50% of the standard model’s allows us to quickly reap the benefits of BITE perplexity as the model adapts to the presence of without the computational overhead of pretrain- the new inflection tokens (e.g., VBD, NNS, etc.). ing them from scratch, such an approach likely Crucially, the models are not trained on dialectal does not make use of BITE’s full potential. This corpora, which demonstrates the effectiveness of is due to the compressing effect that BITE offers BITE at helping models better generalize to dialec- on the lexical space. Therefore, in this section, we tal distributions after a short adaptation phase. analyze WordPiece, BPE, and unigram language CSE vs. AAVE. Astute readers might notice that model (Kudo, 2018) subword tokenizers that are there is a large difference in perplexity between the trained from scratch with and without BITE us- two dialectal corpora, even for the same tokenizer ing the tokenizers5 and sentencepiece6 combination. There are two possible explanations. libraries. Through our experiments we aim to an- The first is that CSE, being an amalgam of English, swer the following two questions: Chinese dialects, Malay, etc., differs significantly • Does BITE help the data-driven tokenizer use from Standard English not only morphologically, its vocabulary more efficiently (§5.1)? but also in word ordering (e.g., topic prominence) (Tongue, 1974). In addition, numerous loan words • How does BITE improve adversarial robust- and discourse particles not found in Standard En- ness (§5.2)? glish like lah, lor and hor are commonplace in CSE (Leimgruber, 2009). AAVE, however, gen- We draw one million examples from English erally shares the same word ordering as Standard Wikipedia and BookCorpus for use as our train- English due to its largely English origins (Poplack, ing set. Unless otherwise mentioned, we use an- 2000) and is less different linguistically (compared other 5k examples as our test set. Although Sen- to CSE vs. Standard English). These differences tencePiece can directly process raw text, we pre- between AAVE and CSE are likely explanations tokenize the raw text with the BertPreTokenizer for the significant differences in perplexity. from the tokenizers library before encoding Another possible explanation is that the Book- them for ease of comparison across the three en- Corpus may contain examples of AAVE since the coding schemes. For practical applications, users BookCorpus’ source, Smashwords, also publishes should be able to apply SentencePiece’s method African American fiction. We believe the reason 5 github.com/huggingface/tokenizers 6 for the difference is a mixture of these two factors. github.com/google/sentencepiece
100 ∆ in sequence length (%) 15 90 Coverage (%) 10 80 BPE 5 Baseline 70 WordPiece Baseline Unigram LM BITE 60 0 1 2 3 4 0 2 4 6 8 Vocabulary Size (symbols) ·104 Vocabulary Size (symbols) ·104 Figure 2: Relative increase in mean tokenized sequence Figure 3: Comparison of coverage of the lengths (%) between BITE-less and BITE-equipped to- Wikipedia+BookCorpus (1M examples) dataset kenizers after training the data-driven subword tokeniz- between BITE and a trivial baseline (word counts). ers with varying vocabulary sizes; lower is better. Base- Since BITE has no fixed vocabulary, we determine the line (dotted red line) denotes the percentage of inflected “vocabulary” by taking the N most common tokens. forms in an average sequence; this is equivalent to the increase in sequence length if BITE had no effect on the data-driven tokenizers’ encoding efficiency. crease (i.e., upper bound) in their mean tokenized sequence length. However, from Figure 2, we see of handling whitespace characters instead of the this is not the case as the relative increase (with BertPreTokenizer with no issues. and without BITE) in mean sequence length gen- erally stays below 13%, 5% less than the baseline. 5.1 Vocabulary Efficiency This demonstrates that BITE helps the data-driven We may operationalize the question of whether tokenizer make better use of its limited vocabulary. BITE improves vocabulary efficiency in at least In addition, we see that the gains are inversely three ways. The most straightforward is to measure proportional to the vocabulary size. This is likely the sequence-level change: the difference in the due to the following reasons. For a given sentence, tokenized sequence length with and without BITE. the corresponding tokenized sequence’s length usu- Efficient use of a fixed-size vocabulary should re- ally decreases as the data-driven tokenizer’s vo- sult in it comprising symbols that minimize the cabulary size increases as it allows merging of average tokenized sequence length. An alternative more smaller subwords into longer subwords. On vocabulary-level perspective is to observe how well the other hand, BITE is vocabulary-independent, a limited vocabulary represents a corpus. Finally, which means that the tokenized sequence length is since the base form of a word is sufficient to se- always the same for a given sentence. Therefore, mantically represent all its inflected forms (Jackson, we can expect the same absolute difference to con- 2014), the number of unique linguistic lexemes7 tribute to a larger relative increase as the vocabulary that can be represented by a vocabulary should also size increases. Additionally, more inflected forms be a good measure how much “semantic knowl- are memorized as the vocabulary size increases, re- edge” it contains. sulting in an average absolute increase of 0.4 tokens per sequence for every additional 10k vocabulary Sequence lengths. A possible concern with tokens. These two factors together should explain BITE is that it may significantly increase the length the above phenomenon. of the tokenized sequence, and hence the compu- tational cost for sequence modeling, since it splits Vocabulary coverage. Next, we look at this all inflected content words (nouns, verbs, and ad- question from a vocabulary-level perspective and jectives) into two tokens. We calculate the percent- operationalize vocabulary efficiency as the cover- age of inflected tokens to be 17.89%.8 Therefore, age of a representative corpus by a vocabulary’s if BITE did not enhance WordPiece’s and BPE’s tokens. We measure coverage by computing the encoding efficiency, we should expect a 17.89% in- total number of tokens (words and punctuation) in 7 A lexeme is the set of a word’s base and inflected forms. the corpus that are represented in the vocabulary 8 Note that only content words are subject to inflection. divided by the total number of tokens in the corpus.
·104 N 4 Semantic Capacity X SemCapacity(T1 , . . . , TN ) = f (Ti ), (2) i=1 3 BPE BPE + BITE WordPiece where N is the total number of unique base forms 2 WordPiece + BITE Unigram LM in the evaluation corpus, Ti is the sequence of (sub- Unigram LM + BITE word) tokens obtained from encoding the ith base 0 1 2 3 4 form, and ui is the number of unknown tokens9 Vocabulary Size (symbols) ·104 in Ti . While not strictly necessary when compar- ing vocabularies on the same corpus, normalizing Figure 4: Semantic capacity of each tokenizer combina- Equation (2) by the number of unique base forms in tion’s vocabulary, as computed in Equations (1) and (2). the corpus may be helpful for cross-corpus compar- Higher is better. isons. Further normalizing the resulting quantity by the tokenizer’s vocabulary size yields a measure of semantic efficiency. Since BITE does not require a vocabulary to be Although it is possible to extend Equation (1) to fixed at training time, we simply set the N most cover cases where there are only multiple unknown frequent tokens (base forms and inflections) to be tokens, this would unnecessarily complicate the our vocabulary. We use the N most frequent tokens equation. Hence, we define f (Ti ) = 0 when there in the unencoded text as our baseline vocabulary. are only unknown tokens in the encoded sequence. From Figure 3, we observe that the BITE vocab- We also implicitly define the penalty of each extra ulary achieves a higher coverage of the corpus than unknown token to be double10 that of a token in the baseline, hence demonstrating the efficacy of the vocabulary. If necessary, it is trivial to alter the BITE at improving vocabulary efficiency. Addi- weight of this penalty by introducing a constant λ: tionally, we note that this advantage is most signifi- cant when the vocabulary is between 5–10k tokens. This implies that inflected word forms comprise a ( 1 (|Ti |−ui )+λui |Ti | − ui > 0 large portion of frequently occurring words, which f (Ti , λ) = (3) comports with intuition. 0 otherwise. Semantic capacity. Since a word’s base form is To evaluate the semantic capacity of our vocabu- able to semantically represent all its inflected forms laries, we use WordNet (Miller, 1995) as our repre- (Jackson, 2014), we posit that the number of base sentative “corpus”. Specifically, we only consider forms contained in a vocabulary can be a good its single-word lemmas (N = 83118). From Fig- proxy for the amount of semantic knowledge it ure 4, we see that data-driven tokenizers trained contains. This may be termed the semantic capac- with BITE tend to produce vocabularies with higher ity of a vocabulary. In whole-word (non-subword) semantic capacities. vocabularies, semantic capacity may be measured Additionally, we observe that tokenizer combi- by the number of unique lexemes in the vocabulary. nations incorporating WordPiece or unigram LM For subword vocabularies like BPE’s, however, generally outperform the BPE ones. We believe such a measure is slightly less useful since every this to be the result of using a language model to in- English word can be trivially represented given the form vocabulary generation. It is logical that a sym- alphabet and a hyphen. Thus, we propose a general- bol that maximizes a language model’s likelihood ization of the above definition which penalizes the on the training data is also semantically “denser”, use of multiple (and unknown) tokens to represent hence prioritizing such symbols produces semanti- a single base form. Formally, cally efficient vocabularies. We leave the in-depth investigation of this relationship to future work. ( 1 |Ti |+ui |Ti | − ui > 0 9 f (Ti ) = (1) Usually represented as [UNK] or . 10 0 otherwise, |T | contributes the extra count.
BPE only posit that the process of encoding raw text into 98 BPE + BITE WordPiece only WordPiece + BITE network-operable symbols may have more impact Unigram LM only on a neural network’s generalization and adversar- % Similarity Unigram LM + BITE 96 ial robustness than previously assumed. Hence, we propose to improve the tokenization pipeline by incorporating linguistic information 94 to guide the data-driven tokenizer in learning a more efficient vocabulary and generating token se- 0 1 2 3 4 quences that increase the neural network’s robust- Vocabulary Size (symbols) ness to non-standard inflection use. We show that ·104 this improves its generalization to second language Figure 5: Mean percentage of tokens that are the same English and World Englishes without requiring ex- in the clean and adversarial tokenized sequences. plicit training on such data. Since dialectal data is often scarce or even nonexistent (in the case of task-specific labeled datasets), an NLP system’s 5.2 Adversarial Robustness. ability to generalize across dialects in a zero-shot BITE’s ability to make models more robust to in- manner is crucial for it to work well for diverse flectional perturbations can be directly attributed linguistic communities. to its preservation of a consistent, inflection- Finally, given the effectiveness of the com- independent base form. We demonstrate this by mon task framework for spurring progress in NLP measuring the similarity between the encoded (Varshney et al., 2019), we hope to do the same for clean and adversarial sentences, using the Rat- tokenization. As a first step, we propose to evaluate cliff/Obershelp algorithm (Ratcliff and Metzener, an encoding scheme’s efficacy by measuring its vo- 1988) as implemented by Python’s difflib. We cabulary efficiency and semantic capacity (which use the MultiNLI in-domain development dataset may have interesting connections to information- and the M ORPHEUS adversaries generated in §4.1 theoretic limits (Ziv and Lempel, 1978)). We have for this experiment. already shown that Base-Inflection Encoding helps We find that clean and adversarial sequences a data-driven tokenizer use its limited vocabulary tokenized by the BITE-equipped tokenizers were more efficiently by increasing its semantic capacity more similar (1–2% for WordPiece, 2–2.5% for when the combination is trained from scratch. BPE) than those tokenized by the ones without BITE (Figure 5). The decrease in similarity for all conditions as the vocabulary size increases is References unsurprising; a larger vocabulary will generally Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic result in shorter sequences and the same number of and natural noise both break neural machine transla- differing tokens will be a larger relative change. tion. In 6th International Conference on Learning Representations, Vancouver, BC, Canada. The above results demonstrate that the improved robustness shown in §4.1 can be directly attributed Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. to the separation of each content word’s base forms 2001. A neural probabilistic language model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, from its inflection and keeping it consistent as the Advances in Neural Information Processing Systems inflection varies, hence mitigating any significant 13, pages 932–938. MIT Press. token-level changes. Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. 6 Conclusion Robust neural machine translation with doubly ad- versarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- The tokenization stage of the modern deep learning guistics, pages 4324–4333, Florence, Italy. Associa- NLP pipeline has not received as much attention tion for Computational Linguistics. as the modeling stage, with researchers often de- faulting to a commonly used subword tokenizer Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. like BPE. Adversarial robustness techniques in 2011. Natural language processing (almost) from NLP also largely focus on augmenting the train- scratch. Journal of Machine Learning Research, ing data with adversarial examples. However, we 12(76):2493–2537.
Mathias Creutz and Krista Lagus. 2002. Unsupervised In Proceedings of the 2018 Conference of the North discovery of morphemes. In Proceedings of the American Chapter of the Association for Computa- ACL-02 Workshop on Morphological and Phonolog- tional Linguistics: Human Language Technologies, ical Learning, pages 21–30. Association for Compu- Volume 1 (Long Papers), pages 1875–1885, New tational Linguistics. Orleans, Louisiana. Association for Computational Linguistics. David Crystal. 2003. English as a Global Language. Cambridge University Press. Howard Jackson. 2014. Words and their Meaning. Routledge. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Robin Jia and Percy Liang. 2017. Adversarial exam- deep bidirectional transformers for language under- ples for evaluating reading comprehension systems. standing. In Proceedings of the 2019 Conference In Proceedings of the 2017 Conference on Empiri- of the North American Chapter of the Association cal Methods in Natural Language Processing, pages for Computational Linguistics: Human Language 2021–2031, Copenhagen, Denmark. Association for Technologies, Volume 1 (Long and Short Papers), Computational Linguistics. pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Robin Jia, Aditi Raghunathan, Kerem Göksel, and Percy Liang. 2019. Certified robustness to adver- David M. Eberhard, Gary F. Simons, and Charles D. sarial word substitutions. In Proceedings of the Fennig, editors. 2019. Ethnologue: Languages of 2019 Conference on Empirical Methods in Natu- the World, 22 edition. SIL International. ral Language Processing and the 9th International Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Joint Conference on Natural Language Processing Dou. 2018. HotFlip: White-box adversarial exam- (EMNLP-IJCNLP), pages 4129–4142, Hong Kong, ples for text classification. In Proceedings of the China. Association for Computational Linguistics. 56th Annual Meeting of the Association for Compu- Braj B. Kachru, Yamuna Kachru, and Cecil Nelson, tational Linguistics (Volume 2: Short Papers), pages editors. 2009. The Handbook of World Englishes. 31–36, Melbourne, Australia. Association for Com- Wiley-Blackwell. putational Linguistics. Pauline Foster and Gillian Wigglesworth. 2016. Cap- Tyler Kendall and Charlie Farrington. 2018. The cor- turing accuracy in second language performance: pus of regional african american language. The case for a weighted clause ratio. Annual Review Taku Kudo. 2018. Subword regularization: Improving of Applied Linguistics, 36:98–116. neural network translation models with multiple sub- Ian J. Goodfellow, Jonathon Shlens, and Christian word candidates. In Proceedings of the 56th Annual Szegedy. 2015. Explaining and harnessing adversar- Meeting of the Association for Computational Lin- ial examples. In 3rd International Conference on guistics (Volume 1: Long Papers), pages 66–75, Mel- Learning Representations, San Diego, California. bourne, Australia. Association for Computational Linguistics. Alex Hern. 2017. Facebook translates ’good morning’ into ’attack them’, leading to arrest. The Guardian. Jacob RE Leimgruber. 2009. Modelling variation in Singapore English. Ph.D. thesis, Oxford University. Dirk Hovy and Shannon L. Spruit. 2016. The social impact of natural language processing. In Proceed- George A. Miller. 1995. Wordnet: A lexical database ings of the 54th Annual Meeting of the Association for english. Communications of the ACM, 38:39–41. for Computational Linguistics (Volume 2: Short Pa- pers), pages 591–598, Berlin, Germany. Association Pān.ini. c. 500 BCE. As..tadhyāyı̄. for Computational Linguistics. Shana Poplack. 2000. The English History of African Po-Sen Huang, Robert Stanforth, Johannes Welbl, American English. Blackwell Malden, MA. Chris Dyer, Dani Yogatama, Sven Gowal, Krish- namurthy Dvijotham, and Pushmeet Kohli. 2019. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Achieving verified robustness to symbol substitu- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, tions via interval bound propagation. In Proceed- Wei Li, and Peter J. Liu. 2019. Exploring the limits ings of the 2019 Conference on Empirical Methods of transfer learning with a unified text-to-text trans- in Natural Language Processing and the 9th Inter- former. arXiv e-prints, arXiv:1910.10683. national Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 4083–4093, Hong Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Kong, China. Association for Computational Lin- Know what you don’t know: Unanswerable ques- guistics. tions for SQuAD. In Proceedings of the 56th An- nual Meeting of the Association for Computational Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Linguistics (Volume 2: Short Papers), pages 784– Zettlemoyer. 2018. Adversarial example generation 789, Melbourne, Australia. Association for Compu- with syntactically controlled paraphrase networks. tational Linguistics.
John W Ratcliff and David E Metzener. 1988. Pattern- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien matching: The gestalt approach. Dr Dobbs Journal, Chaumond, Clement Delangue, Anthony Moi, Pier- 13(7):46. ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. Huggingface’s trans- Marco Tulio Ribeiro, Sameer Singh, and Carlos formers: State-of-the-art natural language process- Guestrin. 2018. Semantically equivalent adversar- ing. arXiv e-prints, arXiv:1910.03771. ial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei Computational Linguistics (Volume 1: Long Papers), Wang. 2019a. Learning to discriminate perturba- pages 856–865, Melbourne, Australia. Association tions for blocking adversarial attacks in text classi- for Computational Linguistics. fication. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing Mike Schuster and Kaisuke Nakajima. 2012. Japanese and the 9th International Joint Conference on Natu- and korean voice search. In 2012 IEEE Interna- ral Language Processing (EMNLP-IJCNLP), pages tional Conference on Acoustics, Speech and Signal 4904–4913, Hong Kong, China. Association for Processing (ICASSP), pages 5149–5152. IEEE. Computational Linguistics. Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei Rico Sennrich, Barry Haddow, and Alexandra Birch. Wang. 2019b. Learning to discriminate perturba- 2016. Neural machine translation of rare words with tions for blocking adversarial attacks in text classi- subword units. In Proceedings of the 54th Annual fication. In Proceedings of the 2019 Conference on Meeting of the Association for Computational Lin- Empirical Methods in Natural Language Processing guistics (Volume 1: Long Papers), pages 1715–1725. and the 9th International Joint Conference on Natu- Association for Computational Linguistics. ral Language Processing (EMNLP-IJCNLP), pages 4906–4915, Hong Kong, China. Association for Deven Shah, H. Andrew Schwartz, and Dirk Hovy. Computational Linguistics. 2019. Predictive biases in natural language process- ing models: A conceptual framework and overview. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- arXiv e-prints, arXiv:1912.11078. dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards Samson Tan, Shafiq Joty, Min-Yen Kan, and Richard story-like visual explanations by watching movies Socher. 2020. Its Morphin Time! Combating and reading books. In The IEEE International Con- linguistic discrimination with inflectional perturba- ference on Computer Vision (ICCV). tions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Vol- George K. Zipf. 1965. The Psycho-Biology of Lan- ume 1: Long Papers), Seattle, Washington. Associa- guage: An Introdution to Dynamic Philology. MIT tion for Computational Linguistics. Press, Cambridge, MA, USA. Rachael Tatman. 2017. Gender and dialect bias in Jacob Ziv and Abraham Lempel. 1978. Compres- YouTube’s automatic captions. In Proceedings of sion of individual sequences via variable-rate cod- the First ACL Workshop on Ethics in Natural Lan- ing. IEEE Transactions on Information Theory, IT- guage Processing, pages 53–59, Valencia, Spain. As- 24(5):530–536. sociation for Computational Linguistics. Ray K Tongue. 1974. The English of Singapore and Malaysia. Eastern Universities Press. Lav R. Varshney, Nitish Shirish Keskar, and Richard Socher. 2019. Pretrained AI models: Perfor- mativity, mobility, and change. arXiv e-prints, arXiv:1909.03290. Changhan Wang, Kyunghyun Cho, and Jiatao Gu. 2019. Neural machine translation with byte-level subwords. arXiv e-prints, arXiv:1909.03341. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
You can also read