You should evaluate your language model on marginal likelihood over tokenisations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
You should evaluate your language model on marginal likelihood over tokenisations Kris Cao and Laura Rimell DeepMind, London, UK {kriscao, laurarimell}@deepmind.com Abstract Typically, language models are trained and eval- Neural language models typically tokenise in- uated using a single canonical tokenisation out of put text into sub-word units to achieve an open the multitude of possible ones, but this tokenisation vocabulary. The standard approach is to use a may be suboptimal (Bostrom and Durrett, 2020) for arXiv:2109.02550v2 [cs.CL] 21 Sep 2021 single canonical tokenisation at both train and many reasons. For example, different tokenisations test time. We suggest that this approach is un- – that is, different surface segmentations – can re- satisfactory and may bottleneck our evaluation veal different morphological analyses of the word of language model performance. Using only in question (think un-ion-izeable vs. union-izable), the one-best tokenisation ignores tokeniser un- certainty over alternative tokenisations, which and committing to a particular analysis can discard may hurt model out-of-domain performance. useful information, particularly if the best analysis from the tokeniser is erroneous (Dyer, 2010). In this paper, we argue that instead, lan- guage models should be evaluated on their Further, tokenisers themselves are trained us- marginal likelihood over tokenisations. We ing an objective which optimises the likelihood compare different estimators for the marginal of the data. This can be explicit (the unigram to- likelihood based on sampling, and show that keniser of Kudo (2018) optimises a unigram lan- it is feasible to estimate the marginal likeli- guage modelling objective) or implicit (BPE aims hood with a manageable number of samples. to minimise the description length of the training We then evaluate pretrained English and Ger- man language models on both the one-best- data, which has close connections to probabilistic tokenisation and marginal perplexities, and methods; MacKay 2003). In this sense they are show that the marginal perplexity can be signif- also language models, albeit far less powerful than icantly better than the one best, especially on the neural language models we train on their out- out-of-domain data. We link this difference in puts. This raises a difficult question: to what extent perplexity to the tokeniser uncertainty as mea- are our large language models bottlenecked by the sured by tokeniser entropy. We discuss some tokenisers that we use to train them? implications of our results for language model training and evaluation, particularly with re- We argue that rather than evaluating language gard to tokenisation robustness. models using the one-best tokenisation from the tokeniser, one should evaluate language models 1 Introduction using the marginal likelihood over all possible to- Neural end-to-end language models have largely kenisations of an input. This divorces language done away with traditional pipeline approaches to- model performance from the performance of the wards building NLP systems. However, one compo- tokenisation model, and we believe this gives a bet- nent which stubbornly remains is the tokenisation ter indicator of the intrinsic quality of the language step, used right at the start of preprocessing. At the model. time of writing, the most widely used tokenisers, In this paper, we take a language model pre- such as BPE (Sennrich et al., 2016) and unigram trained using a single tokenisation, and estimate (Kudo, 2018), break up the input text into subword the marginal likelihood of the model on test data, units, potentially backing off to character-level seg- taking multiple tokenisations of each input into mentation if necessary. This allows for coverage account. While summing exactly over exponen- of every possible input sequence; on the downside, tially many tokenisations is intractable, we can a single input sequence may now have multiple estimate the marginal likelihood using importance possible tokenisations. sampling. One contribution of this paper is to show-
case low-variance estimators of the marginal likeli- sampled tokenisations at training time, and how hood based on sampling without replacement. We this might improve language model robustness. cast the tokeniser as the proposal distribution for our importance sampling estimator, which clearly 2 Taking multiple tokenisations into delimits the role of the tokeniser. Indeed, as the consideration number of samples we consider increases, the lan- guage model becomes less and less coupled to the We denote by D (for document) a string of text tokeniser, and our evaluation becomes more intrin- whose score we would like to calculate. Given a sic to the language model itself, rather than the vocabulary V of sub-word tokens (which is usu- language model + tokeniser combination. ally induced by the tokeniser), we denote by Ti potential tokenisations of D – i.e. sequences of We demonstrate that there can be a significant tokens t1 t2 . . . tni such that each ti ∈ V and the difference – which we call the marginal gap – in sequence detokenises to D. An autoregressive neu- marginal likelihood compared to one-best tokenisa- ral language model (with parameters θ) is a model tion likelihood, especially on out-of-domain evalua- which decomposes the probability of the full se- tion sets. This suggests that the tokeniser is failing quence into aQseries of left-to-right predictions: to generalise well to out-of-domain data, and is Pθ (T, D) = ni=1 Pθ (ti |t
use the importance sampling estimator in Goldwater et al. (2011). The problem of tokenis- ing consistently only arises when sampling from X P (T, D) the tokeniser; the one-best tokenisation of an input P (D) = P (T, D) = ET ∼Q (1) Q(T |D) from the unigram tokeniser will always tokenise T each occurrence of a type identically. Now, it remains to find a suitable proposal dis- 2.1 Lowering the variance of the estimator tribution Q(T |D). In this paper, we use the uni- gram tokeniser of Kudo (2018), as this is the only A naive approach to estimating the marginal like- probabilistic tokeniser that we are aware of. This lihood using Equation 1 would be to sample n to- tokeniser first constructs a lattice of all possible kenisations T1 , . . . , Tn at random from Q(T |D), tokenisations given an input and a lexicon of word score the resulting tokenisations using the language pieces. Distinct tokenisations of the input corre- model Pθ (Ti , D), and average the resulting impor- spond to paths through this lattice, and the score tance weighted scores. However, due to Jensen’s of a tokenisation is the sum of the scores of the inequality, this is only a lower bound of the true tokens along the path. As the score decomposes marginal likelihood. We can obtain a tighter bound along lattice segments, many interesting quanti- with the same number of samples by taking the ties, such as Q(D) (the marginal likelihood of an average in probability space rather than log space input text under the tokeniser), are exactly calcu- (as in Burda et al. (2016)) lable. This allows not only for sampling from the ! lattice of possible tokenisations, but also calculat- 1 X Pθ (Ti , D) log Pθ (D) ≈ log (2) ing the score of a given tokenisation (i.e. estimate n Q(Ti |D) i Q(T |D) = Q(T, D)/Q(D)), which is necessary to estimate the importance weight. Changing the sampling procedure Taking n independent samples from Q can result in high- Tokenising consistently There is prior evidence variance estimates if the entropy of Q is low and it (Lazaridou et al., 2021) to suggest that Transformer assigns low probability to tokenisations with high language models are able to effectively leverage posterior probability under the language model Pθ . memory, and that perplexities of repeated words in In this case, one would expect to see multiple re- a document can be much lower than the perplexity peated samples, which do not sufficiently explore of the first occurrence of that word. We show in the sample space. One option to lower the variance Section 4.3 that this copying ability is tied to the of the estimate is to instead sample without replace- exact tokenisation of that word: if a word reoccurs ment (WOR). By enforcing that all samples are in a document with a different tokenisation, its distinct, we can explore the sample space better, perplexity is much higher than if it reappears with However, sampling without replacement with- the same tokenisation. out exactly enumerating all possible sample out- Armed with this insight, we design an alterna- comes is tricky. Kool et al. (2019) show how to tive proposal distribution which samples a single sample without replacement for sequence models tokenisation for each unique whitespace-delimited using stochastic beam search (SBS). Unfortunately, type in a document, and then shares that tokenisa- the segmentation lattice used in the unigram to- tion for each token of that type in the document. keniser is not locally normalised, and we cannot We note that it is possible to adapt a pre-trained naively use SBS. We therefore adapt the SBS al- unigram tokeniser to do this, by passing in only the gorithm by first running the forward algorithm on unique whitespace types in a document to the to- the segmentation lattice to calculate the normal- keniser and reconstructing the document from the ising constant at each point of the lattice; we can sampled tokenisations. This is possible because the then combine Viterbi backwards n-best search with unigram tokeniser does not consider context when the constrained Gumbel-max trick used in SBS to tokenising, and whitespace tokens are tokenised exactly sample n tokenisations WOR. independently. We note that this two-stage word If we sample without replacement, the inclusion generation process, where first we generate the vo- probability of a tokenisation Ti is no longer equal cabulary for a document, and then generate the to Q(Ti |D). Kool et al. (2019) show that, for the document from that vocabulary, has close connec- expectation of a function f under a distribution tions to the two-stage language models proposed Q, an unbiased estimator using a set of k samples
without replacement is given by Algorithm 1: Recursive algorithm for lat- tice entropy k X Q(Ti ) Result: entropy Hn of segmentation lattice ET ∼Q f (T ) ≈ f (Ti ) (3) qκ (Ti ) init H0 = 0, α[i] the forward marginals ; i=1 for i = 1 to n do κ is the perturbed score of the k + 1th item during for w token terminating at position i do search and qκ (T ) = 1−exp(− exp(log Q(T )−κ)) j = start position of w ; is the probability that a Gumbel variable with loca- // ϕ(w) is the score of token w ; tion log Q(T ) takes a value greater than κ. In our p(w) = exp(α[j] + ϕ(w) − α[i]) ; case, f (T ) = Pθ (T, D)/Q(T ), and if we calcu- Hi += p(w)(Hj + log p(w)) late this sum before taking the logarithm to obtain end a tighter bound, then the Q(T ) terms cancel and end we obtain the following estimator for the marginal return Hn likelihood of a document: X Pθ (Ti , D) ! 2.2 Summing over the n-best tokenisations log Pθ (D) ≥ log (4) qκ (Ti ) An alternative approach to estimating i P T Pθ (T, D) is to restrict the sum to a smaller set Including the best tokenisation To lower the of suitable candidates. As the unigram tokenisation variance of the estimate further (at the cost of in- objective decomposes over segments, one can use troducing some bias), we can always include the Viterbi search to find exactly the n highest scoring best tokenisation from the tokeniser in our set of tokenisations from the tokeniser. We then score samples (Botev et P al., 2017). This method decom- each tokenisation using the language model, and poses estimating T Pθ (T, D) as Pθ (T ∗ , D) + sum the contribution of each estimate to obtain a (lower bound) estimate of the marginal likelihood. P T 6=T ∗ Pθ (T, D). We can then estimate the sum over all tokenisations using exactly the same meth- This estimator is high-bias and low-variance ods as before, using the new distribution Q∗ which compared to the sampling-based estimators; we places 0 mass on T ∗ and renormalises the result- show in Section 4.1 that, although the n-best ing probabilities for other tokenisations. It remains estimator performs well, it is possible to tune to simulate samples from Q∗ using samples from the sample-based estimators to perform better by Q. We note that for sampling with replacement, trading bias for variance. a simple technique to sample from Q∗ is simple rejection sampling, where we discard any sample 3 Measuring segmentation lattice from Q that equals T ∗ . However, if Q(T ) is partic- entropy ularly peaked around T ∗ , then this procedure may We believe that the entropy of the tokeniser seg- require many rejection steps. Therefore, we do not mentation lattice is an important quantity to mea- investigate this estimator further. sure. The entropy quantifies the uncertainty of the When sampling without replacement, we have tokeniser, and has a nice interpretation as the (log- to be a little more careful. We note that the fol- arithm of the) size of the set of alternatives the lowing scheme samples k times exactly without tokeniser is choosing uniformly over. While the replacement from Q∗ : entropy over hidden states of other structured mod- 1. Take k + 1 items T1 , . . . , Tk+1 WOR from Q. els like HMMs and CRFs have previously been 2. If any Ti = T ∗ , discard it from the sample. published (Hernando et al., 2005; Mann and Mc- 3. Otherwise discard Tk+1 Callum, 2007; Ilic, 2011), and a uniform treatment in terms of expectation semirings is given in Li and We also note (by conditioning on the event that Eisner (2009), we are unaware of previous elemen- T ∗ appears in the sample) that the inclusion proba- tary derivations of the entropy of a segmentation bilities are easily calculated (if T ∗ appears in the lattice. We give the algorithm in Algorithm 1. sample, take κ to be the perturbed score of the Note that the recursion has a particularly nice k + 2th item; otherwise take it to be the perturbed interpretation in terms of information theory. Re- score of the k + 1th item). call that the entropy of a random variable can be
thought of as the necessary number of bits to trans- mit the random variable. The recursion states that, to transmit the lattice up to position i (which takes Hi bits), we can transmit a prefix of the lattice (us- ing Hj bits), and then transmit the token w that goes from j to i (using log P (w) bits). The to- tal number of bits necessary is then the weighted sum of all possible ways of doing this, where the weights are given by the probability of that particu- lar decomposition. 4 Experiments Figure 1: The effect of temperature scaling on the es- timated perplexity on all English datasets, using WOR For our experiments, we first pretrain language 1-best. The y-axis is the percentage difference in per- models using one-best tokenisations from a to- plexity relative to the n-best baseline (lower is better). keniser using WMT news shared task data (Bar- Note the x-axis is scaled as 1/τ , rather than τ . rault et al., 2020). We train models on both English and German data up to September 2017, reserv- Further, to ensure results are comparable across ing the rest of the 2017 data for validation and different tokenisations with potentially different model selection. We use a Transformer-XL (Dai numbers of tokens, we calculate perplexity by di- et al., 2019) model with 18 layers and a hidden size viding the total likelihood across all documents by of 1024. During evaluation time, we do not use the total number of whitespace-delimited tokens. Transformer-XL memory, due to the interaction We present our results in Table 1. of batching and sampled tokenisation. While this Our results show that there can be a significant may depress our results, we are not interested in difference between the one-best tokenisation like- absolute model performance per se, but rather in lihood and the marginal likelihood, particularly as the relative performance of the marginal likelihood one moves further away from the training data do- vs. the one-best likelihood. main. Indeed, the relative perplexity improvement The tokeniser we use at both training and evalu- reaches up to 1.9% on E N - AR X IV, and 0.9% on ation time is a unigram tokeniser as implemented D E - M C4. Further, tokenising words consistently in the SentencePiece package (Kudo, 2018), with in a document has a large impact on the marginal a vocabulary size of 50529. We train the tokeniser likelihood estimation. We investigate this effect on the same training set, with a random sample of further in Section 4.3. While the n-best estimator 100 million sentences for English, and 10 million appears to perform the best in this comparison, we documents for German. show in the next section that by tuning the sam- pling temperature of the WOR 1-best estimator, it 4.1 Measuring the marginal likelihood is possible to obtain even better estimates of the For both English and German, we use 500 docu- marginal likelihood. ments sampled randomly from the WMT train and test data and 500 randomly sampled Wikipedia doc- The effect of sampling temperature We also uments (W IKI). For English, we also use 500 docu- investigate sharpening the tokeniser distribution be- ments from the C USTOM N EWS and arXiv abstracts fore sampling by multiplying the log-probability (AR X IV) datasets of Lazaridou et al. (2021), and of each tokenisation by a factor of 1/τ before sam- for German, we additionally use 200 documents pling. Using τ < 1 has often shown to give im- from the M C4 dataset in Xue et al. (2020). proved results in various tasks (Kool et al., 2019; For each method outlined in Section 2, we sam- Melis et al., 2019; Adlam et al., 2020), and can be ple 128 different tokenisations of each document, understood as a way of tuning the bias-variance and calculate Pθ (Ti , D) for each sample, before tradeoff with the n-best estimator at the high-bias, aggregating the sample scores into an estimate of low variance end, and independently sampling at the marginal likelihood. We parallelise evaluating the other. We compare the WOR with 1-best es- all the samples for a document on a multi-host TPU timator at a various rate of temperatures on our setup; each dataset takes 15-30 minutes to evaluate. English datasets, and show the results in Figure
Consistent tokenization Inconsistent tokenization WR WOR WOR 1-best n-best WR WOR WOR 1-best n-best WMT train (16.49) 16.59 16.58 16.48 16.47 16.81 16.79 16.48 16.48 WMT test (22.62) 22.73 22.72 22.59 22.56 23.07 23.01 22.60 22.58 English C USTOM N EWS (37.09) 37.11 37.12 36.93 36.88 37.90 37.89 37.03 36.95 W IKI (60.22) 61.09 61.02 59.82 59.71 63.37 63.33 60.06 59.92 AR X IV (179.20) 176.38 176.11 175.87 175.98 179.76 179.74 177.52 176.90 WMT train (31.84) 32.51 32.58 31.80 31.77 33.04 33.12 31.80 31.78 German WMT test (37.16) 37.68 38.16 37.12 37.08 38.87 38.91 37.13 37.09 W IKI (66.08) 69.44 69.30 65.86 65.63 72.37 72.41 66.01 65.78 M C4 (194.02) 206.89 207.15 192.84 192.21 219.63 219.19 193.68 192.87 Table 1: Comparing the different estimators of model marginal perplexity on evaluation sets. The number in brackets represents the one-best tokenisation perplexity. Consistent vs. inconsistent tokenisation refers to whether we tokenise each appearance of a whitespace-delimited type consistently in a document or not. 1. One can see that it is possible to improve on All words Multi-token words the n-best estimator by trading some bias for vari- First (1) (2) First (1) (2) ance, and this can result in a better estimate of the WMT Tr 3.88 2.59 17.01 10.73 4.07 21.11 marginal, especially for out of domain datasets. WMT Te 4.19 2.59 16.69 12.15 4.11 20.40 CN EWS 6.31 2.99 16.19 17.01 4.88 20.36 4.2 Tokeniser entropy and the marginal gap W IKI 7.84 3.62 16.54 17.80 5.63 19.81 AR X IV 9.94 3.97 14.93 17.56 5.41 18.03 Next, we investigate what causes the gap between marginal likelihood and one-best likelihood, and Table 2: Investigating the caching ability of language whether there are easily measurable factors that models. For words which appear multiple times with might predict this difference. We hypothesise that, different tokenisations, we show the average loss of the more uncertain the tokeniser is, the bigger this the first occurrence of that word, of subsequent occur- gap becomes. We pool together the documents in rences of that word with the same tokenisation (1), and subsequent occurrences of that word in a different to- all our evaluation sets, and test whether there is a kenisation (2). WMT Tr and WMT Te are the WMT correlation between tokeniser entropy and marginal training and test evaluation sets respectively. gap. Our results, shown in Figure 2, demonstrate that there is a correlation between entropy and the Twi = ht1wi . . . tnwii i, and each sampled tokenisa- marginal gap (Spearman r = 0.57 for English, tion Ti can have different token sequences Twi i for 0.49 for German); interestingly, it appears that high the same underlying word. We look for words tokeniser entropy is predictive of a bigger marginal wk ∈ (wi , . . . , wn ) such that: gap, but large marginal gaps are possible even if 1. For some tokenisation Ti of wi , for some l < the tokeniser has low entropy. k, wl = wk and Twi k = Twi l (the word has 4.3 Analysing the caching behaviour of appeared before with the same tokenisation). language models 2. For some other tokenisation Tj , for all l < k Our results show that tokenising word types con- such that wl = wk , Twj k 6= Twj l (all previous sistently within a document leads to significantly occurrences of this word in the document were tighter estimates of the marginal likelihood com- tokenised differently). pared to independently tokenising input tokens. We We then calculate Pθ (wk |w
(a) English Figure 3: The performance of the n-best marginal like- lihood estimator on the AR X IV evaluation set as we vary the number of samples, taken in order of Q(T |D) in orange and Pθ (T, D) in blue. dataset, as this showed the biggest relative improve- ment between the marginal likelihood and the one- best likelihood. We take the samples from our n-best estimator with n = 128, and incrementally sum the samples (which are given in decreasing order of likelihood under the tokeniser) to simulate having smaller n. As an oracle experiment to to see how many samples contribute significantly to the (b) German marginal likelihood, we also order the samples by Figure 2: The correlation between entropy per token their language model scores (i.e. we order accord- and the marginal gap per token in nats (not in per- ing to Pθ (T, D) rather than Q(T |D)) before taking plexity), categorised by evaluation dataset. Some data the incremental sum. We show the results in Figure points which extend beyond the right of the graph are 3. Our results show that, although ostensibly many trucated; they follow the same trend. samples are necessary to estimate the marginal like- lihood accurately, only very few samples (in the of tokens. By comparing the loss of words paired order of 5) actually contribute significantly. in this way, we can control for extra confounding In practical terms, our results suggest that one factors (such as token unigram probability), and needs to take many samples with current tokenis- isolate the ability of the language model to recog- ers to accurately estimate the marginal likelihood, nise whether different token sequences correspond but that many of these samples are not effective. to the same underlying form. We therefore believe that a prerequisite for more We show our results for our various datasets, widespread adoption of marginal likelihood as an together with selected subsets of words, in Table 2. evaluation metric is tokenisers that better fit the lan- We see that, if the language model sees a word after guage model posterior over tokenisations. Current already seeing it in the same tokenisation, its loss tokenisers make very strong independence assump- is significantly lower than the loss associated with tions to make learning and inference tractable, and the first time the word is seen (as was also reported we believe there is significant scope to design to- in Lazaridou et al. (2021)). However, this ability is kenisers which relax these assumptions. strongly tied to the exact tokenisation of the word: if it appears again, but in a different tokenisation, 5 Related Work then its loss can in fact be even greater. 5.1 Tokenisation and segmentation 4.4 How many samples are necessary? Unsupervised word segmentation has a long and Next, we investigate how many samples are neces- illustrious history. The earliest motivations were in sary to obtain an accurate estimate of the marginal information retrieval, and the motivation was that likelihood. We experiment on the E N - AR X IV collapsing a set of related query terms might help
smooth counts over each of those terms individu- et al., 2017), and word representation (Cao and Rei, ally and result in better retrieval results. The earli- 2016). One downside is that dependency lengths est approaches, such as the Porter stemmer (Porter, become longer on the character-level, and lexical 1997), were rule-based. However, the power of information has to be memorised by the compo- data-driven statistical methods quickly became ap- sitional machinery of the model. For this reason, parent, and tools such as Morfessor (Virpioja et al., traditionally fully character-based approaches did 2013) used likelihood-based objectives, typically not perform as well as their token-level counter- with Bayesian smoothing methods (see also Gold- parts, although recent progress suggests this may water et al. (2011)), to induce segmentations. change soon (Choe et al., 2019; Clark et al., 2021). Sennrich et al. (2016) used a different algorithm There also exist approaches which mix character- to induce segmentations: byte-pair encoding (Gage, level and segment-level approaches (Buckman and 1994). Originally designed as a data compression Neubig, 2018; Kawakami et al., 2019; He et al., algorithm, BPE tokenisers are now some of the 2020), although these segmental language models predominantly used tokenisation methods. Alterna- require more complex inference procedures. tive approaches, such as WordPiece (Schuster and Nakajima, 2012) and SentencePiece (Kudo, 2018), 6 Conclusion explicitly use a language modelling objective to in- duce a token lexicon. Previous methods have used In this paper, we argue for using model marginal train-time tokenisation randomisation as a regulari- likelihood over tokenisations as an evaluation met- sation aid (Kudo, 2018; Provilkov et al., 2020), but ric for language models, rather than one-best to- still use the one-best tokenisation at test time. kenisation likelihood. We introduce practical low- variance estimators for measuring the marginal like- Another strand of work has investigated whether lihood, and demonstrate that there can be signifi- tokenisers that caputre linguistic morphology can cant difference between the marginal and the one- improve language models. Bostrom and Durrett best likelihoods, particularly on strongly out-of- (2020) showed that unigram and BPE tokenisers domain evaluation sets. Evaluating with marginal for English and Japanese have low recall on re- likelihood thus goes some way toward loosening covering linguistic segments, since many morpho- the bottleneck imposed by tokeniser quality in the logically complex words are treated as a single currently dominant language modelling paradigm, token. Linguistically aligned tokenisers have been and our results suggest that the field may be under- shown to result in better language model perplex- estimating the generalisation capability of mod- ity (Schwartz et al., 2020; Park et al., 2021) and ern language models. We further demonstrate better downstream task performance (Alkaoud and that tokeniser entropy is a good predictor of this Syed, 2020), especially for morphologically rich “marginal gap”, suggesting that tokeniser entropy, languages. These experiments also use one-best especially when out-of-domain, can be a guide to tokenisation at test time. the number of samples needed for evaluation. Rather than considering one-best or stochastic More broadly, our experiments suggest that the samples of tokenisations, one can use entire seg- field should continue seeking better ways to in- mentation lattices as input to a model. This ap- corporate tokenisation into end-to-end language proach has been considered for morphological tag- modelling. Sampling from the tokeniser during ging (Seker and Tsarfaty, 2020), parsing (Goldberg training is an obvious possibility; alternatively, one and Tsarfaty, 2008), and spoken intent recognition could incorporate the segmentation lattice into the (Ladhak et al., 2016), among others. model directly, which has been beneficial for pars- ing morphologically rich languages (Goldberg and 5.2 Tokenisation-free approaches Tsarfaty, 2008; Tsarfaty et al., 2020). Further, de- An alternative approach to inducing a tokenisation veloping more contextual tokenisers which make is to decompose input sequences into well-defined fewer independence assumptions can also result in orthographic units, such as characters. These ap- both better language models trained on their one- proaches circumvent the problem of inducing a best tokenisation, and better evaluation estimates lexicon, and have been used for text classifica- of the marginal likelihood with fewer samples. tion (Conneau et al., 2017), language modelling We conduct experiments on German and En- (Al-Rfou et al., 2019), machine translation (Lee glish corpora in this paper. However, these two
languages are only a small sample in the full space Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof of language typology. English is a morphologi- Monz, Makoto Morishita, Masaaki Nagata, Toshi- aki Nakazawa, Santanu Pal, Matt Post, and Marcos cally impoverished language, and while German Zampieri. 2020. Findings of the 2020 conference on compounding and inflection offer some additional machine translation (WMT20). In Proceedings of challenges, many languages have more complex the Fifth Conference on Machine Translation, pages patterns of word formation and inflection. We be- 1–55, Online. Association for Computational Lin- lieve that estimating marginal likelihood will be guistics. important for morphologically richer languages, Kaj Bostrom and Greg Durrett. 2020. Byte pair encod- where tokenisation makes a bigger difference (Gerz ing is suboptimal for language model pretraining. In et al., 2018; Mielke et al., 2019). Findings of the Association for Computational Lin- Finally, improved understanding of the interac- guistics: EMNLP 2020, pages 4617–4624, Online. Association for Computational Linguistics. tion between tokenisation and language modelling has implications for evaluating language models Aleksandar Botev, Bowen Zheng, and David Barber. on both downstream tasks and language generation 2017. Complementary Sum Sampling for Likeli- tasks. Evidence has shown that gains in language hood Approximation in Large Scale Classification. In Proceedings of the 20th International Conference modelling, as measured in perplexity, often lead on Artificial Intelligence and Statistics, volume 54 of to improvements in downstream task performance Proceedings of Machine Learning Research, pages (Radford et al., 2019). It would be instructive to 1030–1038, Fort Lauderdale, FL, USA. PMLR. extend our marginal likelihood approach to down- Jacob Buckman and Graham Neubig. 2018. Neural lat- stream task evaluation. On generation tasks, since tice language models. Transactions of the Associa- the tokeniser affects language model training but tion for Computational Linguistics. is only implicitly used when sampling (via the to- keniser vocabulary), the effect of tokenisation algo- Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdi- nov. 2016. Importance weighted autoencoders. In rithms requires careful investigation. 4th International Conference on Learning Represen- tations, ICLR 2016, San Juan, Puerto Rico, May 2-4, Acknowledgements 2016, Conference Track Proceedings. The authors would like to thank Dani Yogatama Kris Cao and Marek Rei. 2016. A joint model for word and the rest of the Language group at DeepMind embedding and word morphology. In Proceedings for comments and discussion, Gábor Melis and of the 1st Workshop on Representation Learning for Phil Blunsom for comments on an earlier draft, NLP, pages 18–26, Berlin, Germany. Association for Computational Linguistics. and Mark Rowland for clarification remarks on sampling without replacement. We would also like Dokook Choe, Rami Al-Rfou, Mandy Guo, Heey- to thank our anonymous reviewers. oung Lee, and Noah Constant. 2019. Bridging the gap for tokenizer-free language models. CoRR, abs/1908.10322. References Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Ben Adlam, Jasper Snoek, and Samuel L. Smith. 2020. Wieting. 2021. CANINE: pre-training an efficient Cold posteriors and aleatoric uncertainty. tokenization-free encoder for language representa- tion. CoRR, abs/2103.06874. Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2019. Character-level lan- Alexis Conneau, Holger Schwenk, Loïc Barrault, and guage modeling with deeper self-attention. Proceed- Yann Lecun. 2017. Very deep convolutional net- ings of the AAAI Conference on Artificial Intelli- works for text classification. In Proceedings of the gence, 33(01):3159–3166. 15th Conference of the European Chapter of the As- sociation for Computational Linguistics: Volume 1, Mohamed Alkaoud and Mairaj Syed. 2020. On the im- Long Papers, pages 1107–1116, Valencia, Spain. As- portance of tokenization in Arabic embedding mod- sociation for Computational Linguistics. els. In Proceedings of the Fifth Arabic Natural Lan- guage Processing Workshop (WANLP, pages 119– Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- 129. bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, a fixed-length context. In Proceedings of the 57th Marta R. Costa-jussà, Christian Federmann, Yvette Annual Meeting of the Association for Computa- Graham, Roman Grundkiewicz, Barry Haddow, tional Linguistics, pages 2978–2988, Florence, Italy. Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Association for Computational Linguistics.
Christopher Dyer. 2010. A Formal Model of Ambiguity Faisal Ladhak, Ankur Gandhe, Markus Dreyer, Lam- and its Applications in Machine Translation. Ph.D. bert Mathias, Ariya Rastrow, and Björn Hoffmeister. thesis, University of Maryland. 2016. Latticernn: Recurrent neural networks over lattices. In Interspeech 2016, pages 695–699. Philip Gage. 1994. A new algorithm for data compres- sion. C Users J., 12(2):23–38. Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gri- bovskaya, Devang Agrawal, Adam Liska, Tayfun Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi Terzi, Mai Gimenez, Cyprien de Masson d’Autume, Reichart, and Anna Korhonen. 2018. On the relation Sebastian Ruder, Dani Yogatama, Kris Cao, Tomas between linguistic typology and (limitations of) mul- Kocisky, Susannah Young, and Phil Blunsom. 2021. tilingual language modeling. In Proceedings of the Pitfalls of static language modelling. 2018 Conference on Empirical Methods in Natural Language Processing, pages 316–327, Brussels, Bel- Jason Lee, Kyunghyun Cho, and Thomas Hofmann. gium. Association for Computational Linguistics. 2017. Fully character-level neural machine trans- Yoav Goldberg and Reut Tsarfaty. 2008. A single gen- lation without explicit segmentation. Transactions erative model for joint morphological segmentation of the Association for Computational Linguistics, and syntactic parsing. In Proceedings of ACL-08: 5:365–378. HLT, pages 371–379, Columbus, Ohio. Association for Computational Linguistics. Zhifei Li and Jason Eisner. 2009. First- and second- order expectation semirings with applications to Sharon Goldwater, Thomas L. Griffiths, and Mark minimum-risk training on translation forests. In Pro- Johnson. 2011. Producing power-law distributions ceedings of the 2009 Conference on Empirical Meth- and damping word frequencies with two-stage lan- ods in Natural Language Processing, pages 40–51, guage models. Journal of Machine Learning Re- Singapore. Association for Computational Linguis- search, 12(68):2335–2382. tics. Xuanli He, Gholamreza Haffari, and Mohammad David J. C. MacKay. 2003. Information theory, infer- Norouzi. 2020. Dynamic programming encoding ence, and learning algorithms. for subword segmentation in neural machine trans- lation. In Proceedings of the 58th Annual Meet- Gideon Mann and Andrew McCallum. 2007. Efficient ing of the Association for Computational Linguistics, computation of entropy gradient for semi-supervised pages 3042–3051, Online. Association for Computa- conditional random fields. In Human Language tional Linguistics. Technologies 2007: The Conference of the North American Chapter of the Association for Compu- D. Hernando, V. Crespi, and G. Cybenko. 2005. Effi- tational Linguistics; Companion Volume, Short Pa- cient computation of the hidden markov model en- pers, pages 109–112, Rochester, New York. Associ- tropy for a given observation sequence. IEEE Trans- ation for Computational Linguistics. actions on Information Theory, 51(7):2681–2685. Gábor Melis, Charles Blundell, Tomáš Kočiský, Velimir M. Ilic. 2011. Entropy semiring forward- Karl Moritz Hermann, Chris Dyer, and Phil Blun- backward algorithm for HMM entropy computation. som. 2019. Pushing the bounds of dropout. CoRR, abs/1108.0347. Kazuya Kawakami, Chris Dyer, and Phil Blunsom. Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian 2019. Learning to discover, ground and use words Roark, and Jason Eisner. 2019. What kind of lan- with segmental neural language models. In Proceed- guage is hard to language-model? In Proceedings of ings of the 57th Annual Meeting of the Association the 57th Annual Meeting of the Association for Com- for Computational Linguistics, pages 6429–6441, putational Linguistics, pages 4975–4989, Florence, Florence, Italy. Association for Computational Lin- Italy. Association for Computational Linguistics. guistics. Hyunji Hayley Park, Katherine J. Zhang, Coleman Ha- Wouter Kool, Herke Van Hoof, and Max Welling. 2019. ley, Kenneth Steimel, Han Liu, and Lane Schwartz. Stochastic beams and where to find them: The 2021. Morphology matters: A multilingual lan- Gumbel-top-k trick for sampling sequences with- guage modeling analysis. Transactions of the Asso- out replacement. In Proceedings of the 36th In- ciation for Computational Linguistics, 9:261–276. ternational Conference on Machine Learning, vol- ume 97 of Proceedings of Machine Learning Re- M. F. Porter. 1997. An Algorithm for Suffix Stripping, search, pages 3499–3508. PMLR. page 313–316. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple sub- Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. word candidates. In Proceedings of the 56th Annual 2020. BPE-dropout: Simple and effective subword Meeting of the Association for Computational Lin- regularization. In Proceedings of the 58th Annual guistics (Volume 1: Long Papers), pages 66–75, Mel- Meeting of the Association for Computational Lin- bourne, Australia. Association for Computational guistics, pages 1882–1892, Online. Association for Linguistics. Computational Linguistics.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, and Dario Amodeiand Ilya Sutskever. 2019. Lan- guage models are unsupervised multitask learners. OpenAI Technical Report. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152. Lane Schwartz, Francis Tyers, Lori Levin, Christo Kirov, Patrick Littell, Chi-kiu Lo, Emily Prud’hommeaux, Hyunji Hayley Park, Kenneth Steimel, Rebecca Knowles, Jeffrey Micher, Lonny Strunk, Han Liu, Coleman Haley, Katherine J. Zhang, Robbie Jimmerson, Vasilisa Andriyanets, Aldrian Obaja Muis, Naoki Otani, Jong Hyuk Park, and Zhisong Zhang. 2020. Neural polysyn- thetic language modelling. Final Report of the Neural Polysynthetic Language Modelling Team at the 2019 Frederick Jelinek Memorial Summer Workshop. Amit Seker and Reut Tsarfaty. 2020. A pointer net- work architecture for joint morphological segmen- tation and tagging. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4368–4378, Online. Association for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computa- tional Linguistics. Reut Tsarfaty, Dan Bareket, Stav Klein, and Amit Seker. 2020. From SPMRL to NMRL: What did we learn (and unlearn) in a decade of parsing morphologically-rich languages (MRLs)? In Pro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 7396– 7408, Online. Association for Computational Lin- guistics. Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. 2013. Morfessor 2.0: Python Im- plementation and Extensions for Morfessor Baseline. Technical report. Linting Xue, Noah Constant, Adam Roberts, Mi- hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mt5: A mas- sively multilingual pre-trained text-to-text trans- former. CoRR, abs/2010.11934.
You can also read