Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding Samson Tan§\ , Shafiq Joty§‡ , Lav R. Varshneyf§ , Min-Yen Kan\ § Salesforce AI Research \ National University of Singapore ‡ Nanyang Technological University f University of Illinois at Urbana-Champaign § {samson.tan,sjoty}@salesforce.com \ kanmy@comp.nus.edu.sg f varshney@illinois.edu Abstract Inflectional variation is a common feature of World Englishes such as Colloquial Singa- arXiv:2004.14870v2 [cs.CL] 11 Oct 2020 pore English and African American Vernacu- lar English. Although comprehension by hu- Figure 1: Base-Inflection Encoding reduces inflected man readers is usually unimpaired by non- words to their base forms, then reinjects the grammati- standard inflections, current NLP systems are cal information into the sentence as inflection symbols. not yet robust. We propose Base-Inflection En- coding (BITE), a method to tokenize English text by reducing inflected words to their base these variations predisposes English NLP systems forms before reinjecting the grammatical infor- to discriminate against speakers of World Englishes mation as special symbols. Fine-tuning pre- by either misunderstanding or misinterpreting them trained NLP models for downstream tasks us- (Hern, 2017; Tatman, 2017). Left unchecked, these ing our encoding defends against inflectional adversaries while maintaining performance on biases could inadvertently propagate to future mod- clean data. Models using BITE generalize bet- els via metrics built around pretrained models, such ter to dialects with non-standard inflections as BERTScore (Zhang et al., 2020). without explicit training and translation mod- In particular, Tan et al. (2020) show that cur- els converge faster when trained with BITE. Fi- rent question answering and machine transla- nally, we show that our encoding improves the tion systems are overly sensitive to non-standard vocabulary efficiency of popular data-driven inflections—a common feature of dialects such as subword tokenizers. Since there has been no prior work on quantitatively evaluating vocab- Colloquial Singapore English (CSE) and African ulary efficiency, we propose metrics to do so.1 American Vernacular English (AAVE).2 Since peo- ple naturally correct for or ignore non-standard 1 Introduction inflection use (Foster and Wigglesworth, 2016), we Large-scale neural models have proven success- should expect NLP systems to be equally robust. ful at a wide range of natural language process- Existing work on adversarial robustness for NLP ing (NLP) tasks but are susceptible to amplifying primarily focuses on adversarial training methods discrimination against minority linguistic commu- (Belinkov and Bisk, 2018; Ribeiro et al., 2018; Tan nities (Hovy and Spruit, 2016; Tan et al., 2020) et al., 2020) or classifying and correcting adversar- due to selection bias in the training data and model ial examples (Zhou et al., 2019a). However, this overamplification (Shah et al., 2019). effectively increases the size of the training dataset Most datasets implicitly assume a distribution of by including adversarial examples or training a new error-free Standard English speakers, but this does model to identify and correct perturbations, thereby not accurately reflect the majority of the global significantly increasing the overall computational English speaking population who are either sec- cost of creating robust models. ond language (L2) or non-standard dialect speakers These approaches also only operate on either (Crystal, 2003; Eberhard et al., 2019). These World raw text or the model, ignoring tokenization—an Englishes differ at lexical, morphological, and syn- operation that transforms raw text into a form that tactic levels (Kachru et al., 2009); sensitivity to the neural network can learn from. We introduce a 1 2 Code will be available at github.com/salesforce/bite. Examples in Appendix A.
new representation for word tokens that separates models represented each word as a single symbol base from inflection. This improves both model in the vocabulary (Bengio et al., 2001; Collobert robustness and vocabulary efficiency by explicitly et al., 2011) and uncommon words were repre- inducing linguistic structure in the input to the NLP sented by an unknown symbol. However, such system (Erdmann et al., 2019; Henderson, 2020). a representation is unable to adequately deal with Many extant NLP systems use a combination of words absent in the training vocabulary. Therefore, a whitespace and punctuation tokenizer followed subword representations like WordPiece (Schus- by a data-driven subword tokenizer such as byte ter and Nakajima, 2012) and BPE (Sennrich et al., pair encoding (BPE; Sennrich et al. (2016)). How- 2016) were proposed to encode out-of-vocabulary ever, a purely data-driven approach may fail to find (OOV) words by segmenting them into subwords the optimal encoding, both in terms of vocabulary and encoding each subword as a separate symbol. efficiency and cross-dialectal generalization. This This way, less information is lost in the encoding could make the neural model more vulnerable to process since OOV words are approximated as a inflectional perturbations. Hence, we: combination of subwords in the vocabulary. Wang et al. (2019) reduce vocabulary sizes by operating • Propose Base-InflecTion Encoding (BITE), on bytes instead of characters (as in standard BPE). which uses morphological information to help To make subword regularization more tractable, the data-driven tokenizer use its vocabulary effi- Kudo (2018) proposed an alternative method of ciently and generate robust symbol3 sequences. building a subword vocabulary by reducing an ini- In contrast to morphological segmentors such tially oversized vocabulary down to the required as Linguistica (Goldsmith, 2000) and Morfessor size with the aid of a unigram language model, as (Creutz and Lagus, 2002), we reduce inflected opposed to incrementally building a vocabulary as forms to their base forms before reinjecting the in WordPiece and BPE variants. However, machine inflection information into the encoded sequence translation systems operating on subwords still as special symbols. This approach gracefully han- have trouble translating rare words from highly- dles the canonicalization of words with noncon- inflected categories (Koehn and Knowles, 2017). catenative morphology while generally allowing Sadat and Habash (2006),Koehn and Hoang the original sentence to be reconstructed. (2007), and Kann and Schütze (2016) propose to • Demonstrate BITE’s effectiveness at making neu- improve machine translation and morphological ral NLP systems robust to non-standard inflection reinflection by encoding morphological features use while preserving performance on Standard separately while Sylak-Glassman et al. (2015) pro- English examples. Crucially, simply fine-tuning pose a schema for inflectional features. Avraham the pretrained model for the downstream task af- and Goldberg (2017) explore the effect of learning ter adding BITE is sufficient. Unlike adversarial word embeddings from base forms and morpho- training, BITE does not enlarge the dataset and logical tags for Hebrew, while Chaudhary et al. is more computationally efficient. (2018) show that representing words as base forms, • Show that BITE helps BERT (Devlin et al., 2019) phonemes, and morphological tags improve cross- generalize to dialects unseen during training and lingual transfer for low-resource languages. also helps Transformer-big (Ott et al., 2018) con- Adversarial robustness in NLP. To harden verge faster for the WMT’14 En-De task. NLP systems against adversarial examples, exist- • Propose metrics like symbol complexity to oper- ing work largely uses adversarial training (Good- ationalize and evaluate the vocabulary efficiency fellow et al., 2015; Jia and Liang, 2017; Ebrahimi of an encoding scheme. Our metrics are generic et al., 2018; Belinkov and Bisk, 2018; Ribeiro et al., and can be used to evaluate any tokenizer. 2018; Iyyer et al., 2018; Cheng et al., 2019). How- 2 Related Work ever, this generally involves retraining the model with the adversarial data, which is computationally Subword tokenization. Before neural models expensive and time-consuming. Tan et al. (2020) can learn, raw text must first be encoded into sym- showed that simply fine-tuning a trained model bols with the help of a fixed-size vocabulary. Early for a single epoch on appropriately generated ad- 3 Following Sennrich et al. (2016), we use symbol instead versarial training data is sufficient to harden the of token to avoid confusion with the unencoded word token. model against inflectional adversaries. Instead of
adversarial training, Piktus et al. (2019) train word Algorithm 1 Base-InflecTion Encoding (BITE) embeddings to be robust to misspellings, while Require: Input sentence S = [w1 , . . . , wN ] Zhou et al. (2019b) propose using a BERT-based Ensure: Encoded sequence S 0 S 0 ← [∅] model to detect adversaries and recover clean ex- for all i = 1, . . . , |N | do amples. Jia et al. (2019) and Huang et al. (2019) if POS(wi ) ∈ {NOUN, VERB, ADJ} then base ← G ET L EMMA(wi , POS(wi )) use Interval Bound Propagation to train provably inflection ← G ET I NFLECTION(wi ) robust pre-Transformer models, while Shi et al. S 0 ← S 0 + [base, inflection] (2020) propose an efficient algorithm for training else S 0 ← S 0 + [wi ] certifiably robust Transformer architectures. end if end for Summary. Popular subword tokenizers operate return S 0 on surface forms in a purely data-driven manner. Existing adversarial robustness methods for large- scale Transformers are computationally expensive, lap between the base and inflected forms, e.g., the while provably robust methods have only been -ed and -d suffixes, the suffix may be encoded as a shown to work for pre-Transformer architectures separate subword and base forms / suffixes may not and small-scale Transformers. be consistently represented. To illustrate, encod- Our work uses linguistic information (inflec- ing danced as [dance, d] and dancing as [danc, ing] tional morphology) in conjunction with data-driven results in two different “base forms” for the same subword encoding schemes to make large-scale word, dance. This again burdens the model with NLP models robust to non-standard inflections and learning the two “base forms” mean the same thing generalize better to L2 and World Englishes, while and makes inefficient use of a limited vocabulary. preserving performance for Standard English. We When encoded in conjunction with another in- also show that our method helps existing subword flected form like entered, which should be encoded tokenizers use their vocabulary more efficiently. as [enter, ed], this encoding scheme also produces two different subwords for the same type of inflec- 3 Linguistically-Grounded Tokenization tion -ed vs -d. As in the first example, the burden of learning that the two suffixes correspond to the Data-driven subword tokenizers like BPE improve same tense is transferred to the learning model. a model’s ability to approximate the semantics of unknown words by splitting them into subwords. A possible solution is to instead encode danced Although the fully data-driven nature of such as [danc, ed] and dancing as [danc, ing], but there methods make them language-agnostic, this forces is no guarantee that a data-driven encoding scheme them to rely only on the statistics of the surface will learn this pattern without some language- forms when transforming words into subwords specific linguistic supervision. In addition, this since they do not exploit any language-specific mor- unnecessarily splits up the base form into two sub- phological regularities. To illustrate, the past tense words danc and e; the latter contains no extra se- of go, take, and keep have the inflected forms went, mantic or grammatical information yet increases took, and kept, respectively, which have little to the encoded sequence length. Although individ- no overlap with their base forms4 and each other ually minor, encoding many base words in this even though they share the same tense. These six manner increases the computational cost for any surface forms would likely have no subwords in encoder or decoder network. common in the vocabulary. Consequently, the neu- Finally, although it is theoretically possible to ral model would have the burden of learning both force a data-driven tokenizer to segment inflected the relation between base forms and inflected forms forms into morphologically logical subwords by and the relation between inflections for the same limiting the vocabulary size, many inflected forms tense. Additionally, since vocabularies are fixed are represented as individual symbols at common before model training, such an encoding does not vocabulary sizes (30–40k). We found that the optimally use a limited vocabulary. BERTbase WordPiece tokenizer and BPE5 encoded Even when inflections do not orthographically each of the above examples as single symbols. alter the base form and there is a significant over- 5 Trained on Wikipedia+BookCorpus (1M) with a vocabu- 4 Base (no quotes) is synonymous with lemma in this paper. lary size of 30k symbols.
3.1 Base-Inflection Encoding Implementation details. We use the BertPreTo- kenizer from the tokenizers6 library for whites- To address these issues, we propose the Base- pace and punctuation splitting. We use the NLTK InflecTion Encoding framework (or BITE), which (Bird et al., 2009) implementation of the aver- encodes the base form and inflection of content aged perceptron tagger (Collins, 2002) with greedy words separately. Similar to how existing subword decoding to generate POS tags, which serve to encoding schemes improve the model’s ability to improve lemmatization accuracy and as inflec- approximate the semantics of out-of-vocabulary tion symbols. For lemmatization and reinflection, words with in-vocabulary subwords, BITE helps we use lemminflect7 , which uses a dictionary the model better handle out-of-distribution inflec- look-up together with rules for lemmatizing and tion usage by keeping a content word’s base form inflecting words. A benefit of this approach is that consistent even when its inflected form drastically the neural network can now generate orthographi- changes. This distributional deviation could mani- cally appropriate inflected forms by generating the fest as adversarial examples, such as those gener- base form and the corresponding inflection symbol. ated by M ORPHEUS (Tan et al., 2020), or sentences produced by L2 or World Englishes speakers. By keeping the base forms consistent, BITE provides 3.2 Compatibility with Data-Driven Methods adversarial robustness to the model. Although BITE has the numerous advantages out- lined above, it suffers from the same weakness as BITE (Fig. 1). Given an input sentence S = regular word-level tokenization schemes when used [w1 , . . . , wN ] where wi is the ith word, BITE gen- alone: a limited ability to handle out-of-vocabulary erates a sequence of symbols S 0 = [w10 , . . . , wN 0 ] words. Hence, we designed BITE to be a gen- 0 such that wi = [BASE(wi ),I NFLECT(wi )] where eral framework that seamlessly incorporates exist- BASE(wi ) is the base form of the word and ing data-driven schemes to take advantage of their I NFLECT(wi ) is the inflection (grammatical cat- proven ability to handle OOV words. egory) of the word (Algorithm 1). If wi is not in- To achieve this, a whitespace/punctuation-based flected, I NFLECT(wi ) is N ULL and excluded from pretokenizer is first used to transform the input into the sequence of symbols to reduce the neural net- a sequence of words and punctuation characters, as work’s computational cost. In our implementation, is common in machine translation. Next, BITE is we use Penn Treebank tags to represent inflections. applied and the resulting sequence is converted into By lemmatizing each inflected word to obtain a sequence of integers by a data-driven encoding the base form instead of segmenting it like in most scheme (Fig. 6 in Appendix B). In our experiments, data-driven encoding schemes, BITE ensures this we use BITE in this manner and refer to the com- base form is consistent for all inflected forms of bined tokenizer as “BITE+D”, where D refers to a word, unlike a subword produced by segmen- the data-driven encoding scheme. tation, which can only contain characters present in the original word. For example, BASE(took), 4 Model-Based Experiments BASE(taking), and BASE(taken) all correspond to the same base form, take, even though it is or- We first demonstrate the effectiveness of BITE us- thographically significantly different from took. ing the pretrained cased BERTbase (Devlin et al., Similarly, encoding all inflections of the same 2019) before training a full Transformer (Vaswani grammatical category (e.g., verb-past-tense) in a et al., 2017) from scratch. We do not replace Word- canonical form should help the model to learn each Piece and BPE but instead incorporate them into inflection’s grammatical role more quickly. This the BITE framework as described in §3.2. The ad- is because the model does not need to first learn vantages and disadvantages to this approach will that the same grammatical category can manifest be discussed in the next section. We do not do any in orthographically different forms. hyperparameter tuning but use the original models’ Crucially, the original sentence can usually be in all experiments (detailed in Appendix B). reconstructed from the base forms and grammatical information preserved by the inflection symbols, 6 github.com/huggingface/tokenizers 7 except in cases of overabundance (Thornton, 2019). github.com/bjascob/LemmInflect
SQuAD 2 Ans. (F1 ) SQuAD 2 All (F1 ) MNLI (Acc.) MNLI-MM (Acc.) Encoding Clean M ORPHEUS Clean M ORPHEUS Clean M ORPHEUS Clean M ORPHEUS WordPiece (WP) 74.58 61.37 72.75 59.32 83.44 58.70 83.59 59.75 BITE + WP 74.50 71.33 72.71 69.23 83.01 76.11 83.50 76.64 WP + Adv. FT. 79.07 72.21 74.45 68.23 83.86 83.87 83.86 75.77 BITE + WP (+1 epoch) 75.46 72.56 73.69 70.66 82.21 81.05 83.36 81.04 Table 1: BERTbase results on the clean and adversarial MultiNLI and SQuAD 2.0 examples. We compare BITE+WordPiece to both WordPiece alone and with one epoch of adversarial fine-tuning. For fair comparison with adversarial fine-tuning, we trained the BITE+WordPiece model for an extra epoch (bottom) on clean data. 4.1 Adversarial Robustness (Classification) Condition Encoding BLEU METEOR We evaluate BITE’s ability to improve model ro- BPE only 29.13 47.80 Clean BITE + BPE 29.61 48.31 bustness for question answering and natural lan- guage understanding using SQuAD 2.0 (Rajpurkar BPE only 14.71 39.54 M ORPHEUS BITE + BPE 17.77 41.58 et al., 2018) and MultiNLI (Williams et al., 2018), respectively. We use M ORPHEUS (Tan et al., 2020), Table 2: Results on newstest2014 for Transformer-big an adversarial attack targeting inflectional mor- trained on WMT’16 English-German (En-De). phology, to test the overall system’s robustness to non-standard inflections. They previously demon- strated M ORPHEUS’s ability to generate plausible and semantically equivalent adversarial examples the above methodology and adversarially fine-tune resembling L2 English sentences. We attack each the WordPiece-only BERTbase for one epoch with BERTbase model separately and report F1 scores on k set to 4. To ensure a fair comparison, we also the answerable questions and the full SQuAD 2.0 train the BITE+WordPiece BERTbase on the origi- dataset, following Tan et al. (2020). In addition, nal training set for an extra epoch. for MNLI, we report scores for both the in-domain (MNLI) and out-of-domain dev. set (MNLI-MM). From Table 1, we observe that BITE is often more effective than adversarial fine-tuning at mak- BITE+WordPiece vs. only WordPiece. First, ing the model more robust against inflectional ad- we demonstrate the effectiveness of BITE at mak- versaries and in some cases (SQuAD 2.0 All and ing the model robust to inflectional adversaries. MNLI-MM) even without needing the additional After fine-tuning two separate BERTbase models on epoch of training. However, the adversarially fine- SQuAD 2.0 and MultiNLI, we generate adversarial tuned model consistently achieves better perfor- examples for them using M ORPHEUS. From Ta- mance on clean data. This is likely because even ble 1, we observe that the BITE+WordPiece model though adversarial fine-tuning requires only a sin- not only achieves similar performance (±0.5) on gle epoch of extra training, the process of generat- clean data, but is significantly more robust to inflec- ing the training set increases its size by a factor of k tional adversaries (10-point difference for SQuAD and hence the number of updates. In contrast, BITE 2.0, 17-point difference for MultiNLI). requires no extra training and is more economical. BITE vs. adversarial fine-tuning. Next, we compare the BITE to adversarial fine-tuning (Tan Adversarial fine-tuning is also less effective at et al., 2020), an economical variation of adversarial inducing model robustness when the adversarial training (Goodfellow et al., 2015) for making mod- example is from an out-of-domain distribution (8 els robust to inflectional variation. In adversarial point difference between MNLI and MNLI-MM). fine-tuning, an adversarial training set is generated This makes it less useful for practical scenarios, by randomly sampling inflectional adversaries k where this is often the case. In contrast, BITE per- times from the adversarial distribution found by forms equally well on both in- and out-of-domain M ORPHEUS and adding them to the original train- data, demonstrating its applicability to practical ing set. Rather than retraining the model on this ad- scenarios where the training and testing domains versarial training set, the previously trained model may not match. This is the result of preserving the is simply trained for one extra epoch. We follow base forms, which we investigate further in §5.2.
4.2 Machine Translation 150 Pseudo Perplexity WordPiece only Next, we evaluate BITE’s impact on machine trans- WordPiece + BITE lation using the Transformer-big architecture (Ott 100 et al., 2018) and WMT’14 English–German (En– De) task. We apply BITE+BPE to the English 50 examples and compare it to the BPE-only baseline. 0 1 2 3 4 5 More details about our experimental setup can be # of MLM training examples ·10 6 found in Appendix B.3. (a) Colloquial Singapore English (forum threads) To obtain the final models, we perform early- 20 Pseudo Perplexity stopping based on the validation perplexity and av- WordPiece only erage the last ten checkpoints. We observe that the WordPiece + BITE 15 BITE+BPE model converges 28% faster (Fig. 7) than the baseline (20k vs. 28k updates) in addition to outperforming it by 0.48 BLEU on the standard 10 data and 3.06 BLEU on the M ORPHEUS adversarial 0 1 2 3 4 5 examples (Table 2). This suggests that explicit en- # of MLM training examples ·10 6 coding of morphological information helps models (b) African American Vernacular English (CORAAL) learn better and more robust representations faster. 10 Pseudo Perplexity WordPiece only 4.3 Dialectal Variation 9 WordPiece + BITE Apart from second languages, dialects are another 8 common source of non-standard inflections. How- 7 ever, there is a dearth of task-specific datasets in 6 English dialects like AAVE and CSE. Therefore, 0 1 2 3 4 5 in this section’s experiments, we use the model’s # of MLM training examples ·10 6 pseudo perplexity (pPPL) (Wang and Cho, 2019) (c) Standard English (Wikipedia+BookCorpus) on monodialectal corpora as a proxy for its per- formance on downstream tasks in the correspond- Figure 2: Pseudo perplexity of BERTbase on CSE, ing dialect. The pPPL measures how certain the AAVE, Standard English corpora. BITEabl refers to the pretrained model is about its prediction and re- ablated version without grammatical information. flects its generalization ability on the dialectal datasets. To ensure fair comparisons across dif- To obtain a CSE corpus, we scrape the Infotech ferent subword segmentations, we normalize the Clinics section of the Hardware Zone Forums8 , a pseudo log-likelihoods by the number of word to- forum frequented by Singaporeans and where CSE kens fed into the WordPiece component of each is commonly used. Similar preprocessing to the tokenization pipeline (Mielke, 2019). This avoids AAVE data yields a 2.2M line corpus (45,803,898 unfairly penalizing BITE for inevitably generating word tokens, 253,326 word types). longer sequences. Finally, we scale the pseudo log- likelihoods by the masking probability (0.15) so Setup. We take the same pretrained BERTbase that the final pPPLs are within a reasonable range. model and fine-tune two separate variants (with and without BITE) on English Wikipedia and BookCor- Corpora. For AAVE, we use the Corpus of Re- pus (Zhu et al., 2015) using the masked language gional African American Language (CORAAL) modeling (MLM) loss without the next sentence (Kendall and Farrington, 2018), which comprises prediction (NSP) loss. We fine-tune for one epoch transcriptions of interviews with African Ameri- on increasingly large subsets of the dataset, since cans born between 1891 and 2005. For our evalua- this has been shown to be more effective than do- tion, only the interviewee’s speech was used. In ad- ing the same number of gradient updates on a fixed dition, we strip all in-line glosses and annotations subset (Raffel et al., 2019). Preprocessing steps are from the transcriptions before dropping all lines described in Appendix B.1. with less than three words. After preprocessing, Next, we evaluate model pPPLs on the AAVE this corpus consists of slightly under 50k lines of 8 text (1,144,803 word tokens, 17,324 word types). forums.hardwarezone.com.sg
and CSE corpora, which we consider to be from Clean M ORPHEUS dialectal distributions that differ from the training Dataset BITEabl BITE BITEabl BITE data which is considered to be Standard English. SQuAD 2 (F1 ) Ans. Qns. 68.85 74.50 70.68 71.33 Since calculating the stochastic pPPL requires ran- All Qns. 72.90 72.71 69.29 69.23 domly masking a certain percentage of symbols MNLI (Acc.) for prediction, we also experiment with doing this Matched 82.28 83.01 80.17 76.11 for each sentence multiple times before averaging Mismatched 83.18 83.50 81.21 76.64 them. However, we find no significant difference WMT’14 (BLEU) 28.14 29.61 20.91 17.77 between doing the calculation once or five times; the random effects likely canceled out due to the Table 3: Effect of reinjecting grammatical information large sizes of our corpora. via inflection symbols. BITEabl refers to the ablation with the dummy symbol instead of inflection symbols. Results. From Fig. 2, we observe that the BITE+WordPiece model initially has a much higher pPPL on the dialectal corpora, before con- task performance, we ablate the extra grammatical verging to 50–65% of the standard model’s pPPL information from the encoding by replacing all in- as the model adapts to the presence of the new in- flection symbols with a dummy symbol (BITEabl ). flection symbols (e.g., VBD, NNS, etc.). Crucially, As expected, BITEabl is significantly more robust the models are not trained on dialectal corpora, to adversarial inflections (Table 3) and the slight which demonstrates the effectiveness of BITE at performance drop is likely due to the POS tagger helping models better generalize to unseen dialects. being adversarially affected. However, different For Standard English, WordPiece+BITE performs tasks likely require different levels of attention to slightly worse than WordPiece, reflecting the re- inflections and BITE allows the network to learn sults on QA and NLI in Table 1. However, it is this for each task. For example, NLI performance important to note that the WordPiece vocabulary on clean data is only slightly affected by the ab- used was not optimized for BITE; results from §4.2 sence of morphosyntactic information, while MT indicate that training the data-driven tokenizer from and QA performance is more significantly affected. scratch with BITE might improve performance. In a similar ablation for the pPPL experiments, we find that both the canonicalizing effect of the CSE vs. AAVE. Astute readers might notice that base form and knowledge of each word’s grammat- there is a large difference in pPPL between the ical role contribute to the lower pPPL on dialectal two dialectal corpora, even for the same tokenizer data (Table 4 in the Appendix). We discuss this in combination. One possible explanation is that CSE greater detail in Appendix B.2 and also report the differs significantly from Standard English in mor- pseudo log-likelihoods and per-symbol pPPLs in phology and syntax due to its Austronesian and the spirit of transparency and reproducibility. Sinitic influences (Tongue, 1974). In addition, loan words and discourse particles not found in Standard English like lah, lor and hor are commonplace in 5 Model-Independent Analyses CSE (Leimgruber, 2009). AAVE, however, gener- Finally, we analyze WordPiece, BPE, and unigram ally shares the same syntax as Standard English due LM subword tokenizers that are trained with and to its largely English origins (Poplack, 2000) and without BITE. Implementation details can be found is more similar linguistically. These differences in Appendix B.4. Through our experiments, we ex- are likely responsible for the significant increase in plore how BITE improves adversarial robustness pPPL for CSE compared to AAVE. and helps the data-driven tokenizer use its vocab- Another possible explanation is that the Book- ulary more efficiently. We use 1M examples from Corpus may contain examples of AAVE since the Wikipedia+BookCorpus for training. BookCorpus’ source, Smashwords, also publishes African American fiction. We believe the reason 5.1 Vocabulary Efficiency for the difference is a mixture of these two factors. We may operationalize the question of whether 4.4 Ablation Study BITE improves vocabulary efficiency in numerous To tease apart the effects of BITE’s two compo- ways. We discuss two vocabulary-level measures nents (lemmatization and inflection symbol) on here and a sequence-level measure in Appendix C.
BPE Coverage (%) 90 Symbol Complexity ·105 3.5 BPE + BITE WordPiece 80 WordPiece + BITE 70 Baseline 3 Unigram LM BITE Unigram LM + BITE 60 0 0.5 1 1.5 2 2.5 Vocabulary Size (symbols) ·104 2 Figure 3: Comparison of coverage between BITE and 0 1 2 3 4 a trivial baseline (word counts). 4 Vocabulary Size (symbols) ·10 Vocabulary coverage. One measure of vocabu- Figure 4: Symbol complexities of tokenizer vocabular- lary efficiency is the coverage of a representative ies as computed in Eqs. (1) and (2). Lower is better. corpus by a vocabulary’s symbols. We measure coverage by computing the total number of tokens by the number of word types in the corpus may (words and punctuation) in the corpus that are rep- be helpful for cross-corpus comparisons. For sim- resented in the vocabulary divided by the total num- plicity, we define f (Si ) = 0 when there are only ber of tokens in the corpus. We use the 1M subset unknown symbols in the encoded sequence and the of Wikipedia+BookCorpus as our representative penalty of each extra unknown symbol to be double corpus. Since BITE does not require a vocabulary that of a symbol in the vocabulary.9 A general form size to be fixed before training, we set the N most of Eq. (2) is included in Appendix B.4. frequent types (base forms and inflections) to be To measure the symbol complexities of our vo- our vocabulary. We use the N most frequent types cabularies, we use WordNet’s single-word lemmas in the unencoded text as our baseline vocabulary. (Miller, 1995) as our “corpus” (N = 83118). From From Fig. 3, we observe that the BITE vocabu- Fig. 4, we see that training data-driven tokenizers lary achieves a higher coverage of the corpus than with BITE produces vocabularies with lower sym- the baseline, hence demonstrating the efficacy of bol complexities. Additionally, we observe that BITE at improving vocabulary efficiency. Addition- tokenizer combinations incorporating WordPiece ally, we note that this advantage is most significant or unigram LM generally outperform the BPE ones. (5–7%) when the vocabulary contains less than 10k We believe this to be the result of using a language symbols. This implies that inflected word forms model to inform vocabulary creation. It is logical comprise a large portion of frequently occurring that a symbol that maximizes a language model’s types, which comports with intuition. likelihood on the training data is also semantically Symbol complexity. Another measure of vocab- “denser”, hence prioritizing such symbols produces ulary efficiency is the total number of symbols efficient vocabularies. We leave the in-depth inves- needed to encode a representative set of word types. tigation of this relationship to future work. We term this the symbol complexity. Formally, 5.2 Adversarial Robustness given N , the total number of word types in the evaluation corpus; Si , the sequence of symbols BITE’s ability to make models more robust to in- obtained from encoding the ith type; and u, the flectional variation can be directly attributed to its number of unknown symbols in Si , we define: preservation of consistent, inflection-independent base forms. We demonstrate this by measuring N the similarity between the encoded clean and ad- versarial sentences with the Ratcliff/Obershelp al- X SymbComp(S1 , . . . , SN ) = f (Si ), (1) i=1 gorithm (Ratcliff and Metzener, 1988). We use ( the MultiNLI in-domain development set and the |Si | + ui |Si | − ui > 0 M ORPHEUS adversaries generated in §4.1. f (Si ) = (2) We find that clean and adversarial sequences en- 0 otherwise. coded by the BITE+D tokenizers were more sim- While not strictly necessary when comparing vo- ilar (1–2.5%) than those encoded without BITE 9 cabularies on the same corpus, normalizing Eq. (1) |S| contributes the extra count.
BPE only 6 Limitations BPE + BITE 98 WordPiece only WordPiece + BITE Unigram LM only Our BITE implementation relies on an external % Similarity Unigram LM + BITE POS tagger to assign inflection tags to each word. 96 This tagger requires language-specific training data, which can be a challenge for low resource lan- 94 guages. However, this could be an advantage since the overall system can be improved by training the tagger on dialect-specific datasets, or readily ex- tended to other languages given a suitable tagger. 0 1 2 3 4 Another drawback of BITE is that it increases the Vocabulary Size (symbols) ·104 length of the encoded sequence which may lead to Figure 5: Mean percentage of symbols that are the extremely long sequences if used on morphologi- same in the clean and adversarial encoded sequences. cally rich languages. However, this is not an issue for English Transformer models since the increase in length will always be
Acknowledgments Mathias Creutz and Krista Lagus. 2002. Unsupervised We are grateful to Michael Yoshitaka Erlewine discovery of morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonolog- from the NUS Dept. of English Language and Liter- ical Learning, pages 21–30. Association for Compu- ature and our anonymous reviewers for their invalu- tational Linguistics. able feedback. We also thank Xuan-Phi Nguyen David Crystal. 2003. English as a Global Language. for his help with reproducing the Transformer-big Cambridge University Press. baseline. Samson is supported by Salesforce and Singapore’s Economic Development Board under Michael Denkowski and Alon Lavie. 2014. Meteor uni- its Industrial Postgraduate Programme. versal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation. References Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Oded Avraham and Yoav Goldberg. 2017. The inter- Kristina Toutanova. 2019. BERT: Pre-training of play of semantics and morphology in word embed- deep bidirectional transformers for language under- dings. In Proceedings of the 15th Conference of the standing. In Proceedings of the 2019 Conference European Chapter of the Association for Computa- of the North American Chapter of the Association tional Linguistics: Volume 2, Short Papers, pages for Computational Linguistics: Human Language 422–426, Valencia, Spain. Association for Computa- Technologies, Volume 1 (Long and Short Papers), tional Linguistics. pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine transla- David M. Eberhard, Gary F. Simons, and Charles D. tion. In 6th International Conference on Learning Fennig, editors. 2019. Ethnologue: Languages of Representations, Vancouver, BC, Canada. the World, 22 edition. SIL International. Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing 2001. A neural probabilistic language model. In Dou. 2018. HotFlip: White-box adversarial exam- T. K. Leen, T. G. Dietterich, and V. Tresp, editors, ples for text classification. In Proceedings of the Advances in Neural Information Processing Systems 56th Annual Meeting of the Association for Compu- 13, pages 932–938. MIT Press. tational Linguistics (Volume 2: Short Papers), pages 31–36, Melbourne, Australia. Association for Com- Steven Bird, Ewan Klein, and Edward Loper. putational Linguistics. 2009. Natural Language Processing with Python. O’Reilly Media. Alexander Erdmann, Salam Khalifa, Mai Oudah, Nizar Habash, and Houda Bouamor. 2019. A little lin- Aditi Chaudhary, Chunting Zhou, Lori Levin, Graham guistics goes a long way: Unsupervised segmenta- Neubig, David R. Mortensen, and Jaime Carbonell. tion with limited language specific guidance. In Pro- 2018. Adapting word embeddings to new languages ceedings of the 16th Workshop on Computational with morphological and phonological subword rep- Research in Phonetics, Phonology, and Morphol- resentations. In Proceedings of the 2018 Conference ogy, pages 113–124, Florence, Italy. Association for on Empirical Methods in Natural Language Process- Computational Linguistics. ing, pages 3285–3295, Brussels, Belgium. Associa- tion for Computational Linguistics. Pauline Foster and Gillian Wigglesworth. 2016. Cap- turing accuracy in second language performance: Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. The case for a weighted clause ratio. Annual Review Robust neural machine translation with doubly ad- of Applied Linguistics, 36:98–116. versarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin- John Goldsmith. 2000. Linguistica: An automatic mor- guistics, pages 4324–4333, Florence, Italy. Associa- phological analyzer. In Proceedings of 36th meeting tion for Computational Linguistics. of the Chicago Linguistic Society. Michael Collins. 2002. Discriminative training meth- Ian J. Goodfellow, Jonathon Shlens, and Christian ods for hidden Markov models: Theory and experi- Szegedy. 2015. Explaining and harnessing adversar- ments with perceptron algorithms. In Proceedings ial examples. In 3rd International Conference on of the 2002 Conference on Empirical Methods in Learning Representations, San Diego, California. Natural Language Processing, pages 1–8. James Henderson. 2020. The unstoppable rise of com- Ronan Collobert, Jason Weston, Léon Bottou, Michael putational linguistics in deep learning. In Proceed- Karlen, Koray Kavukcuoglu, and Pavel Kuksa. ings of the 58th Annual Meeting of the Association 2011. Natural language processing (almost) from for Computational Linguistics (Volume 1: Long Pa- scratch. Journal of Machine Learning Research, pers), Seattle, Washington. Association for Compu- 12(76):2493–2537. tational Linguistics.
Alex Hern. 2017. Facebook translates ‘good morning’ Conference on Empirical Methods in Natural Lan- into ‘attack them’, leading to arrest. The Guardian. guage Processing and Computational Natural Lan- guage Learning (EMNLP-CoNLL), pages 868–876, Dirk Hovy and Shannon L. Spruit. 2016. The social Prague, Czech Republic. Association for Computa- impact of natural language processing. In Proceed- tional Linguistics. ings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- Philipp Koehn and Rebecca Knowles. 2017. Six chal- pers), pages 591–598, Berlin, Germany. Association lenges for neural machine translation. In Proceed- for Computational Linguistics. ings of the First Workshop on Neural Machine Trans- lation, pages 28–39, Vancouver. Association for Po-Sen Huang, Robert Stanforth, Johannes Welbl, Computational Linguistics. Chris Dyer, Dani Yogatama, Sven Gowal, Krish- namurthy Dvijotham, and Pushmeet Kohli. 2019. Taku Kudo. 2018. Subword regularization: Improving Achieving verified robustness to symbol substitu- neural network translation models with multiple sub- tions via interval bound propagation. In Proceed- word candidates. In Proceedings of the 56th Annual ings of the 2019 Conference on Empirical Methods Meeting of the Association for Computational Lin- in Natural Language Processing and the 9th Inter- guistics (Volume 1: Long Papers), pages 66–75, Mel- national Joint Conference on Natural Language Pro- bourne, Australia. Association for Computational cessing (EMNLP-IJCNLP), pages 4083–4093, Hong Linguistics. Kong, China. Association for Computational Lin- guistics. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tok- Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke enizer and detokenizer for neural text processing. Zettlemoyer. 2018. Adversarial example generation arXiv preprint arXiv:1808.06226. with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North Guillaume Lample and Alexis Conneau. 2019. Cross- American Chapter of the Association for Computa- lingual language model pretraining. In Advances tional Linguistics: Human Language Technologies, in Neural Information Processing Systems 32, pages Volume 1 (Long Papers), pages 1875–1885, New 7059–7069. Curran Associates, Inc. Orleans, Louisiana. Association for Computational Jacob RE Leimgruber. 2009. Modelling variation in Linguistics. Singapore English. Ph.D. thesis, Oxford University. Robin Jia and Percy Liang. 2017. Adversarial exam- Sabrina J. Mielke. 2019. Can you compare perplexity ples for evaluating reading comprehension systems. across different segmentations? In Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages George A. Miller. 1995. Wordnet: A lexical database 2021–2031, Copenhagen, Denmark. Association for for english. Communications of the ACM, 38:39–41. Computational Linguistics. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Robin Jia, Aditi Raghunathan, Kerem Göksel, and Fan, Sam Gross, Nathan Ng, David Grangier, and Percy Liang. 2019. Certified robustness to adver- Michael Auli. 2019. fairseq: A fast, extensible sarial word substitutions. In Proceedings of the toolkit for sequence modeling. In Proceedings of 2019 Conference on Empirical Methods in Natu- NAACL-HLT 2019: Demonstrations. ral Language Processing and the 9th International Joint Conference on Natural Language Processing Myle Ott, Sergey Edunov, David Grangier, and (EMNLP-IJCNLP), pages 4129–4142, Hong Kong, Michael Auli. 2018. Scaling neural machine trans- China. Association for Computational Linguistics. lation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 1–9, Braj B. Kachru, Yamuna Kachru, and Cecil Nelson, Brussels, Belgium. Association for Computational editors. 2009. The Handbook of World Englishes. Linguistics. Wiley-Blackwell. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Katharina Kann and Hinrich Schütze. 2016. Single- Jing Zhu. 2002. Bleu: a method for automatic eval- model encoder-decoder with explicit morphological uation of machine translation. In Proceedings of representation for reinflection. In Proceedings of the the 40th annual meeting on association for compu- 54th Annual Meeting of the Association for Compu- tational linguistics, pages 311–318. Association for tational Linguistics (Volume 2: Short Papers), pages Computational Linguistics. 555–560, Berlin, Germany. Association for Compu- tational Linguistics. Aleksandra Piktus, Necati Bora Edizel, Piotr Bo- janowski, Edouard Grave, Rui Ferreira, and Fabrizio Tyler Kendall and Charlie Farrington. 2018. The cor- Silvestri. 2019. Misspelling oblivious word embed- pus of regional african american language. dings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Philipp Koehn and Hieu Hoang. 2007. Factored trans- Computational Linguistics: Human Language Tech- lation models. In Proceedings of the 2007 Joint nologies, Volume 1 (Long and Short Papers), pages
3226–3234, Minneapolis, Minnesota. Association Zhouxing Shi, Huan Zhang, Kai-Wei Chang, Minlie for Computational Linguistics. Huang, and Cho-Jui Hsieh. 2020. Robustness ver- ification for transformers. In 8th International Con- Shana Poplack. 2000. The English History of African ference on Learning Representations, Addis Ababa, American English. Blackwell Malden, MA. Ethiopia. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine John Sylak-Glassman, Christo Kirov, David Yarowsky, Lee, Sharan Narang, Michael Matena, Yanqi Zhou, and Roger Que. 2015. A language-independent fea- Wei Li, and Peter J. Liu. 2019. Exploring the limits ture schema for inflectional morphology. In Pro- of transfer learning with a unified text-to-text trans- ceedings of the 53rd Annual Meeting of the Associ- former. arXiv e-prints, arXiv:1910.10683. ation for Computational Linguistics and the 7th In- ternational Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 674– Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. 680, Beijing, China. Association for Computational Know what you don’t know: Unanswerable ques- Linguistics. tions for SQuAD. In Proceedings of the 56th An- nual Meeting of the Association for Computational Samson Tan, Shafiq Joty, Min-Yen Kan, and Richard Linguistics (Volume 2: Short Papers), pages 784– Socher. 2020. It’s morphin’ time! Combating 789, Melbourne, Australia. Association for Compu- linguistic discrimination with inflectional perturba- tational Linguistics. tions. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, John W Ratcliff and David E Metzener. 1988. Pattern- pages 2920–2935, Online. Association for Computa- matching: The gestalt approach. Dr Dobbs Journal, tional Linguistics. 13(7):46. Rachael Tatman. 2017. Gender and dialect bias in Marco Tulio Ribeiro, Sameer Singh, and Carlos YouTube’s automatic captions. In Proceedings of Guestrin. 2018. Semantically equivalent adversar- the First ACL Workshop on Ethics in Natural Lan- ial rules for debugging NLP models. In Proceedings guage Processing, pages 53–59, Valencia, Spain. As- of the 56th Annual Meeting of the Association for sociation for Computational Linguistics. Computational Linguistics (Volume 1: Long Papers), pages 856–865, Melbourne, Australia. Association Anna M Thornton. 2019. Overabundance in morphol- for Computational Linguistics. ogy. In Oxford Research Encyclopedia of Linguis- tics. Fatiha Sadat and Nizar Habash. 2006. Combination Ray K Tongue. 1974. The English of Singapore and of Arabic preprocessing schemes for statistical ma- Malaysia. Eastern Universities Press. chine translation. In Proceedings of the 21st Interna- tional Conference on Computational Linguistics and Lav R. Varshney, Nitish Shirish Keskar, and Richard 44th Annual Meeting of the Association for Compu- Socher. 2019. Pretrained AI models: Perfor- tational Linguistics, pages 1–8, Sydney, Australia. mativity, mobility, and change. arXiv e-prints, Association for Computational Linguistics. arXiv:1909.03290. Julian Salazar, Davis Liang, Toan Q. Nguyen, and Ka- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob trin Kirchhoff. 2020. Masked language model scor- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz ing. In Proceedings of the 58th Annual Meeting Kaiser, and Illia Polosukhin. 2017. Attention is all of the Association for Computational Linguistics, you need. In I. Guyon, U. V. Luxburg, S. Bengio, pages 2699–2712, Online. Association for Compu- H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- tational Linguistics. nett, editors, Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008. Curran Asso- Mike Schuster and Kaisuke Nakajima. 2012. Japanese ciates, Inc. and korean voice search. In 2012 IEEE Interna- tional Conference on Acoustics, Speech and Signal Alex Wang and Kyunghyun Cho. 2019. BERT has Processing (ICASSP), pages 5149–5152. IEEE. a mouth, and it must speak: BERT as a Markov random field language model. arXiv e-prints, arXiv:1902.04094. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with Changhan Wang, Kyunghyun Cho, and Jiatao Gu. subword units. In Proceedings of the 54th Annual 2019. Neural machine translation with byte-level Meeting of the Association for Computational Lin- subwords. arXiv e-prints, arXiv:1909.03341. guistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- Deven Shah, H. Andrew Schwartz, and Dirk Hovy. tence understanding through inference. In Proceed- 2019. Predictive biases in natural language process- ings of the 2018 Conference of the North American ing models: A conceptual framework and overview. Chapter of the Association for Computational Lin- arXiv e-prints, arXiv:1912.11078. guistics: Human Language Technologies, Volume
1 (Long Papers), pages 1112–1122. Association for Computational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. Huggingface’s trans- formers: State-of-the-art natural language process- ing. arXiv e-prints, arXiv:1910.03771. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. In International Conference on Learning Representations. Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei Wang. 2019a. Learning to discriminate perturba- tions for blocking adversarial attacks in text classi- fication. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 4904–4913, Hong Kong, China. Association for Computational Linguistics. Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei Wang. 2019b. Learning to discriminate perturba- tions for blocking adversarial attacks in text classi- fication. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 4906–4915, Hong Kong, China. Association for Computational Linguistics. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Con- ference on Computer Vision (ICCV). Jacob Ziv and Abraham Lempel. 1978. Compres- sion of individual sequences via variable-rate cod- ing. IEEE Transactions on Information Theory, IT- 24(5):530–536.
A Examples of Inflectional Variation in Datasets and metrics. MultiNLI (Williams English Dialects et al., 2018) is a natural language inference dataset of 392,702 training examples, 10k in-domain and African American Vernacular English 10k out-of-domain dev. examples, and 10k in- (Kendall and Farrington, 2018) domain and 10k out-of-domain test examples span- • I dreamed about we was over my uh, father ning 10 domains. Each example comprises a mother house, and then we was moving. premise, hypothesis, and a label indicating whether the premise entails, contradicts, or is irrelevant • I be over with my friends. to the hypothesis. Models are evaluated using Accuracy = # correct predictions # predictions . • And this boy name RD-NAME-3, he was SQuAD 2.0 (Rajpurkar et al., 2018) is an extrac- tryna be tricky, pretend like he don’t do noth- tive question answering dataset comprising more ing all the time. than 100k answerable questions and 50k unanswer- able questions (130,319 training examples, 11,873 Colloquial Singapore English (Singlish) development examples, and 8,862 test examples). (Source: forums.hardwarezone.com.sg) Each example is composed of a question, a passage, • Anyone face the problem after fresh installed and an answer. Answerable questions are questions the Win 10 Pro, under NetWork File sharing that can be answered by a span in the passage and after you enable this function (Auto discov- unanswerable questions are questions that cannot ery), the computer still failed to detect our be answered by a span in the passage. Models are Users connected to the same NetWork? evaluated using the F1 score. Wikipedia+BookCorpus is a combination of En- • I have try it already, but no solutions appear. glish Wikipedia and BookCorpus. We use Lample and Conneau (2019)’s script to download and pre- • How do time machine works?? process the Wikipedia dump before removing blank B Implementation/Experiment Details lines, overly short lines (less than three words or four characters), and lines with doc tags. We also All models are trained on 8 16GB Tesla V100s. remove blank and overly short lines from Book- Corpus before concatenating and shuffling both datasets. B.2 Discussion for Perplexity Experiments Effect of lemmatization and inflection symbols. We conduct two ablations to investigate the effects of lemmatization and inflection symbols on the models’ pseudo perplexities: the first simply lem- matizes the input before encoding it with Word- Piece (WordPiece+L EMM) and the second replaces every inflection symbol generated by BITE with Figure 6: How BITE fits into the tokenization pipeline. a dummy symbol (WordPiece+BITEabl ). The lat- ter is the same ablation used in Table 3 and from B.1 Classification Experiments Table 4, we see that this condition consistently achieved the lowest pPPL on all three corpora. For our BERT experiments, we build BITE on However, we believe that the highly predictable top of the BertTokenizer class in Wolf et al. dummy symbols likely account for the significant (2019) and use their BERT implementation and drops in pseudo perplexity. fine-tuning scripts11 . BERTbase has 110M parame- To test this hypothesis, we perform another ab- ters. We do not perform a hyperparameter search lation, WordPiece+L EMM, where the the dummy and instead use the example hyperparameters for symbols are removed entirely. If the dummy sym- the respective scripts. bols were not truly responsible for the large drops 11 github.com/huggingface/transformers/.../examples in pPPL, we should observe similar results for
WordPiece (WP) WP + L EMM WP + BITE WP + BITEabl Dataset — (Lemmatize) (+Infl. Symbols) (+Dummy Symbol) Colloquial Singapore English Total word tokens before WP 45803898 45803898 51982873 51982873 Pseudo Negative Log-Likelihood 30910290 30558864 31110740 30292923 pPPL (per word token before WP) 92.58 85.43 52.67 48.66 pPPL (per symbol after WP) 49.10 46.39 32.02 30.20 African American Vernacular English Total word tokens before WP 1144803 1144803 1320730 1320730 Pseudo Negative Log-Likelihood 452269 444021 453031 434621 pPPL (per word token before WP) 13.92 13.27 9.84 8.96 pPPL (per symbol after WP) 12.90 12.41 9.18 8.43 Standard English Total word tokens before WP 252153 252153 290391 290391 Pseudo Negative Log-Likelihood 77339 78074 90148 75467 pPPL (per word token before WP) 7.72 7.87 7.92 5.65 pPPL (per symbol after WP) 6.34 6.36 6.07 4.86 Table 4: Effect of lemmatization, inflection symbols, and dummy symbol on pseudoperplexity (pPPL). We also show the effect of normalizing by the word token vs. subword symbol count. Lower is better. Bolded values indicate lowest row-wise pPPLs, excluding WP+BITEabl due to the confounding effect of the highly predictable dummy symbols. both WordPiece+L EMM and WordPiece+BITEabl . dition, with the exception of the inflection/dummy From Table 4 (pPPL per word token before symbols that replaced some unused tokens, the vo- WP), we see that the decrease in pPPL between cabularies of all the WordPiece tokenizers used in WordPiece+L EMM and WordPiece is less drastic, our pseudo perplexity experiments are exactly the thereby lending evidence for rejecting the null hy- same since we do not retrain them. Therefore, we pothesis. attempt to balance these two factors by normalizing by the number of word tokens fed into the Word- Poorer performance on Standard English. We Piece component of each tokenization pipeline in observe that lemmatizing all content words and/or Fig. 2. We also report the per subword pPPL and reinjecting the grammatical information appears to raw pseudo negative log-likelihood in Table 4. have the opposite effect on Standard English data compared to the dialectal data. Intuitively, such B.3 Machine Translation Experiments an encoding should result in even more significant reductions in perplexity on Standard English since BPE only the POS tagger and lemmatizer were trained on 4.8 BITE + BPE Standard English data. A possible explanation for Perplexity these results is that the WordPiece tokenizer and 4.6 BERT model are overfitted on Standard English, since they were both (pre-)trained on Standard En- 4.4 glish data. 4.2 Normalizing log-likehoods. In an earlier ver- sion of this paper, we computed pseudo perplexity 1 2 3 4 by normalizing the pseudo log-likehoods with the # Updates ·104 number of masked subword symbols (the default). A reviewer pointed out that per subword symbol Figure 7: Validation perplexity over the course of train- perplexities are not directly comparable across dif- ing for Transformer-big. ferent subword segmentations/vocabularies, but per word perplexities are (Mielke, 2019; Salazar For our Transformer-big experiments, we use the et al., 2020). However, using the same denominator fairseq (Ott et al., 2019) implementation and would unfairly penalize models using BITE since the hyperparameters from Ott et al. (2018): it inevitably increases the symbol sequence length, • Parameters: 210,000,000 which affects the predicted log-likelihoods. In ad- • BPE operations: 32,000
You can also read