You should evaluate your language model on marginal likelihood over tokenisations

Page created by Lewis Mccormick

Home & Garden

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

You should evaluate your language model on marginal likelihood over tokenisations

You should evaluate your language model on marginal likelihood over
                                                                         tokenisations

                                                                          Kris Cao and Laura Rimell
                                                                            DeepMind, London, UK
                                                                   {kriscao, laurarimell}@deepmind.com

                                                               Abstract                              Typically, language models are trained and eval-
                                             Neural language models typically tokenise in-        uated using a single canonical tokenisation out of
                                             put text into sub-word units to achieve an open      the multitude of possible ones, but this tokenisation
                                             vocabulary. The standard approach is to use a        may be suboptimal (Bostrom and Durrett, 2020) for
arXiv:2109.02550v2 [cs.CL] 21 Sep 2021

                                             single canonical tokenisation at both train and      many reasons. For example, different tokenisations
                                             test time. We suggest that this approach is un-      – that is, different surface segmentations – can re-
                                             satisfactory and may bottleneck our evaluation       veal different morphological analyses of the word
                                             of language model performance. Using only
                                                                                                  in question (think un-ion-izeable vs. union-izable),
                                             the one-best tokenisation ignores tokeniser un-
                                             certainty over alternative tokenisations, which
                                                                                                  and committing to a particular analysis can discard
                                             may hurt model out-of-domain performance.            useful information, particularly if the best analysis
                                                                                                  from the tokeniser is erroneous (Dyer, 2010).
                                             In this paper, we argue that instead, lan-
                                             guage models should be evaluated on their               Further, tokenisers themselves are trained us-
                                             marginal likelihood over tokenisations. We           ing an objective which optimises the likelihood
                                             compare different estimators for the marginal        of the data. This can be explicit (the unigram to-
                                             likelihood based on sampling, and show that          keniser of Kudo (2018) optimises a unigram lan-
                                             it is feasible to estimate the marginal likeli-      guage modelling objective) or implicit (BPE aims
                                             hood with a manageable number of samples.
                                                                                                  to minimise the description length of the training
                                             We then evaluate pretrained English and Ger-
                                             man language models on both the one-best-            data, which has close connections to probabilistic
                                             tokenisation and marginal perplexities, and          methods; MacKay 2003). In this sense they are
                                             show that the marginal perplexity can be signif-     also language models, albeit far less powerful than
                                             icantly better than the one best, especially on      the neural language models we train on their out-
                                             out-of-domain data. We link this difference in       puts. This raises a difficult question: to what extent
                                             perplexity to the tokeniser uncertainty as mea-      are our large language models bottlenecked by the
                                             sured by tokeniser entropy. We discuss some          tokenisers that we use to train them?
                                             implications of our results for language model
                                             training and evaluation, particularly with re-          We argue that rather than evaluating language
                                             gard to tokenisation robustness.                     models using the one-best tokenisation from the
                                                                                                  tokeniser, one should evaluate language models
                                         1   Introduction                                         using the marginal likelihood over all possible to-
                                         Neural end-to-end language models have largely           kenisations of an input. This divorces language
                                         done away with traditional pipeline approaches to-       model performance from the performance of the
                                         wards building NLP systems. However, one compo-          tokenisation model, and we believe this gives a bet-
                                         nent which stubbornly remains is the tokenisation        ter indicator of the intrinsic quality of the language
                                         step, used right at the start of preprocessing. At the   model.
                                         time of writing, the most widely used tokenisers,           In this paper, we take a language model pre-
                                         such as BPE (Sennrich et al., 2016) and unigram          trained using a single tokenisation, and estimate
                                         (Kudo, 2018), break up the input text into subword       the marginal likelihood of the model on test data,
                                         units, potentially backing off to character-level seg-   taking multiple tokenisations of each input into
                                         mentation if necessary. This allows for coverage         account. While summing exactly over exponen-
                                         of every possible input sequence; on the downside,       tially many tokenisations is intractable, we can
                                         a single input sequence may now have multiple            estimate the marginal likelihood using importance
                                         possible tokenisations.                                  sampling. One contribution of this paper is to show-

case low-variance estimators of the marginal likeli- sampled tokenisations at training time, and how
hood based on sampling without replacement. We this might improve language model robustness.
cast the tokeniser as the proposal distribution for
our importance sampling estimator, which clearly 2 Taking multiple tokenisations into
delimits the role of the tokeniser. Indeed, as the consideration
number of samples we consider increases, the lan-
guage model becomes less and less coupled to the We denote by D (for document) a string of text
tokeniser, and our evaluation becomes more intrin- whose score we would like to calculate. Given a
sic to the language model itself, rather than the vocabulary V of sub-word tokens (which is usu-
language model + tokeniser combination. ally induced by the tokeniser), we denote by Ti
potential tokenisations of D – i.e. sequences of
We demonstrate that there can be a significant
tokens t1 t2 . . . tni such that each ti ∈ V and the
difference – which we call the marginal gap – in
sequence detokenises to D. An autoregressive neu-
marginal likelihood compared to one-best tokenisa-
ral language model (with parameters θ) is a model
tion likelihood, especially on out-of-domain evalua-
which decomposes the probability of the full se-
tion sets. This suggests that the tokeniser is failing
quence into aQseries of left-to-right predictions:
to generalise well to out-of-domain data, and is
Pθ (T, D) = ni=1 Pθ (ti |t

use the importance sampling estimator in Goldwater et al. (2011). The problem of tokenis-
ing consistently only arises when sampling from
X P (T, D) the tokeniser; the one-best tokenisation of an input
P (D) = P (T, D) = ET ∼Q (1)
Q(T |D) from the unigram tokeniser will always tokenise
T
each occurrence of a type identically.
Now, it remains to find a suitable proposal dis-
2.1 Lowering the variance of the estimator
tribution Q(T |D). In this paper, we use the uni-
gram tokeniser of Kudo (2018), as this is the only A naive approach to estimating the marginal like-
probabilistic tokeniser that we are aware of. This lihood using Equation 1 would be to sample n to-
tokeniser first constructs a lattice of all possible kenisations T1 , . . . , Tn at random from Q(T |D),
tokenisations given an input and a lexicon of word score the resulting tokenisations using the language
pieces. Distinct tokenisations of the input corre- model Pθ (Ti , D), and average the resulting impor-
spond to paths through this lattice, and the score tance weighted scores. However, due to Jensen’s
of a tokenisation is the sum of the scores of the inequality, this is only a lower bound of the true
tokens along the path. As the score decomposes marginal likelihood. We can obtain a tighter bound
along lattice segments, many interesting quanti- with the same number of samples by taking the
ties, such as Q(D) (the marginal likelihood of an average in probability space rather than log space
input text under the tokeniser), are exactly calcu- (as in Burda et al. (2016))
lable. This allows not only for sampling from the !
lattice of possible tokenisations, but also calculat- 1 X Pθ (Ti , D)
log Pθ (D) ≈ log (2)
ing the score of a given tokenisation (i.e. estimate n Q(Ti |D)
i
Q(T |D) = Q(T, D)/Q(D)), which is necessary
to estimate the importance weight. Changing the sampling procedure Taking n
independent samples from Q can result in high-
Tokenising consistently There is prior evidence variance estimates if the entropy of Q is low and it
(Lazaridou et al., 2021) to suggest that Transformer assigns low probability to tokenisations with high
language models are able to effectively leverage posterior probability under the language model Pθ .
memory, and that perplexities of repeated words in In this case, one would expect to see multiple re-
a document can be much lower than the perplexity peated samples, which do not sufficiently explore
of the first occurrence of that word. We show in the sample space. One option to lower the variance
Section 4.3 that this copying ability is tied to the of the estimate is to instead sample without replace-
exact tokenisation of that word: if a word reoccurs ment (WOR). By enforcing that all samples are
in a document with a different tokenisation, its distinct, we can explore the sample space better,
perplexity is much higher than if it reappears with However, sampling without replacement with-
the same tokenisation. out exactly enumerating all possible sample out-
Armed with this insight, we design an alterna- comes is tricky. Kool et al. (2019) show how to
tive proposal distribution which samples a single sample without replacement for sequence models
tokenisation for each unique whitespace-delimited using stochastic beam search (SBS). Unfortunately,
type in a document, and then shares that tokenisa- the segmentation lattice used in the unigram to-
tion for each token of that type in the document. keniser is not locally normalised, and we cannot
We note that it is possible to adapt a pre-trained naively use SBS. We therefore adapt the SBS al-
unigram tokeniser to do this, by passing in only the gorithm by first running the forward algorithm on
unique whitespace types in a document to the to- the segmentation lattice to calculate the normal-
keniser and reconstructing the document from the ising constant at each point of the lattice; we can
sampled tokenisations. This is possible because the then combine Viterbi backwards n-best search with
unigram tokeniser does not consider context when the constrained Gumbel-max trick used in SBS to
tokenising, and whitespace tokens are tokenised exactly sample n tokenisations WOR.
independently. We note that this two-stage word If we sample without replacement, the inclusion
generation process, where first we generate the vo- probability of a tokenisation Ti is no longer equal
cabulary for a document, and then generate the to Q(Ti |D). Kool et al. (2019) show that, for the
document from that vocabulary, has close connec- expectation of a function f under a distribution
tions to the two-stage language models proposed Q, an unbiased estimator using a set of k samples

without replacement is given by                          Algorithm 1: Recursive algorithm for lat-
                                                         tice entropy
                          k
                          X Q(Ti )                         Result: entropy Hn of segmentation lattice
         ET ∼Q f (T ) ≈              f (Ti )     (3)
                            qκ (Ti )                       init H0 = 0, α[i] the forward marginals ;
                          i=1
                                                           for i = 1 to n do
κ is the perturbed score of the k + 1th item during            for w token terminating at position i do
search and qκ (T ) = 1−exp(− exp(log Q(T )−κ))                     j = start position of w ;
is the probability that a Gumbel variable with loca-               // ϕ(w) is the score of token w ;
tion log Q(T ) takes a value greater than κ. In our                p(w) = exp(α[j] + ϕ(w) − α[i]) ;
case, f (T ) = Pθ (T, D)/Q(T ), and if we calcu-                   Hi += p(w)(Hj + log p(w))
late this sum before taking the logarithm to obtain            end
a tighter bound, then the Q(T ) terms cancel and           end
we obtain the following estimator for the marginal         return Hn
likelihood of a document:

                            X Pθ (Ti , D)
                                           !            2.2    Summing over the n-best tokenisations
        log Pθ (D) ≥ log                         (4)
                                  qκ (Ti )              An     alternative approach to estimating
                                i                       P
                                                           T Pθ (T, D) is to restrict the sum to a smaller set
Including the best tokenisation To lower the            of suitable candidates. As the unigram tokenisation
variance of the estimate further (at the cost of in-    objective decomposes over segments, one can use
troducing some bias), we can always include the         Viterbi search to find exactly the n highest scoring
best tokenisation from the tokeniser in our set of      tokenisations from the tokeniser. We then score
samples (Botev et P  al., 2017). This method decom-     each tokenisation using the language model, and
poses   estimating T Pθ (T, D) as Pθ (T ∗ , D) +        sum the contribution of each estimate to obtain a
                                                        (lower bound) estimate of the marginal likelihood.
P
   T 6=T ∗ Pθ (T, D). We can then estimate the sum
over all tokenisations using exactly the same meth-     This estimator is high-bias and low-variance
ods as before, using the new distribution Q∗ which      compared to the sampling-based estimators; we
places 0 mass on T ∗ and renormalises the result-       show in Section 4.1 that, although the n-best
ing probabilities for other tokenisations. It remains   estimator performs well, it is possible to tune
to simulate samples from Q∗ using samples from          the sample-based estimators to perform better by
Q. We note that for sampling with replacement,          trading bias for variance.
a simple technique to sample from Q∗ is simple
rejection sampling, where we discard any sample         3     Measuring segmentation lattice
from Q that equals T ∗ . However, if Q(T ) is partic-         entropy
ularly peaked around T ∗ , then this procedure may
                                                        We believe that the entropy of the tokeniser seg-
require many rejection steps. Therefore, we do not
                                                        mentation lattice is an important quantity to mea-
investigate this estimator further.
                                                        sure. The entropy quantifies the uncertainty of the
   When sampling without replacement, we have
                                                        tokeniser, and has a nice interpretation as the (log-
to be a little more careful. We note that the fol-
                                                        arithm of the) size of the set of alternatives the
lowing scheme samples k times exactly without
                                                        tokeniser is choosing uniformly over. While the
replacement from Q∗ :
                                                        entropy over hidden states of other structured mod-
  1. Take k + 1 items T1 , . . . , Tk+1 WOR from Q.     els like HMMs and CRFs have previously been
  2. If any Ti = T ∗ , discard it from the sample.      published (Hernando et al., 2005; Mann and Mc-
  3. Otherwise discard Tk+1                             Callum, 2007; Ilic, 2011), and a uniform treatment
                                                        in terms of expectation semirings is given in Li and
We also note (by conditioning on the event that         Eisner (2009), we are unaware of previous elemen-
T ∗ appears in the sample) that the inclusion proba-    tary derivations of the entropy of a segmentation
bilities are easily calculated (if T ∗ appears in the   lattice. We give the algorithm in Algorithm 1.
sample, take κ to be the perturbed score of the            Note that the recursion has a particularly nice
k + 2th item; otherwise take it to be the perturbed     interpretation in terms of information theory. Re-
score of the k + 1th item).                             call that the entropy of a random variable can be

thought of as the necessary number of bits to trans-
mit the random variable. The recursion states that,
to transmit the lattice up to position i (which takes
Hi bits), we can transmit a prefix of the lattice (us-
ing Hj bits), and then transmit the token w that
goes from j to i (using log P (w) bits). The to-
tal number of bits necessary is then the weighted
sum of all possible ways of doing this, where the
weights are given by the probability of that particu-
lar decomposition.

4 Experiments Figure 1: The effect of temperature scaling on the es-
timated perplexity on all English datasets, using WOR
For our experiments, we first pretrain language 1-best. The y-axis is the percentage difference in per-
models using one-best tokenisations from a to- plexity relative to the n-best baseline (lower is better).
keniser using WMT news shared task data (Bar- Note the x-axis is scaled as 1/τ , rather than τ .
rault et al., 2020). We train models on both English
and German data up to September 2017, reserv- Further, to ensure results are comparable across
ing the rest of the 2017 data for validation and different tokenisations with potentially different
model selection. We use a Transformer-XL (Dai numbers of tokens, we calculate perplexity by di-
et al., 2019) model with 18 layers and a hidden size viding the total likelihood across all documents by
of 1024. During evaluation time, we do not use the total number of whitespace-delimited tokens.
Transformer-XL memory, due to the interaction We present our results in Table 1.
of batching and sampled tokenisation. While this Our results show that there can be a significant
may depress our results, we are not interested in difference between the one-best tokenisation like-
absolute model performance per se, but rather in lihood and the marginal likelihood, particularly as
the relative performance of the marginal likelihood one moves further away from the training data do-
vs. the one-best likelihood. main. Indeed, the relative perplexity improvement
The tokeniser we use at both training and evalu- reaches up to 1.9% on E N - AR X IV, and 0.9% on
ation time is a unigram tokeniser as implemented D E - M C4. Further, tokenising words consistently
in the SentencePiece package (Kudo, 2018), with in a document has a large impact on the marginal
a vocabulary size of 50529. We train the tokeniser likelihood estimation. We investigate this effect
on the same training set, with a random sample of further in Section 4.3. While the n-best estimator
100 million sentences for English, and 10 million appears to perform the best in this comparison, we
documents for German. show in the next section that by tuning the sam-
pling temperature of the WOR 1-best estimator, it
4.1 Measuring the marginal likelihood
is possible to obtain even better estimates of the
For both English and German, we use 500 docu- marginal likelihood.
ments sampled randomly from the WMT train and
test data and 500 randomly sampled Wikipedia doc- The effect of sampling temperature We also
uments (W IKI). For English, we also use 500 docu- investigate sharpening the tokeniser distribution be-
ments from the C USTOM N EWS and arXiv abstracts fore sampling by multiplying the log-probability
(AR X IV) datasets of Lazaridou et al. (2021), and of each tokenisation by a factor of 1/τ before sam-
for German, we additionally use 200 documents pling. Using τ < 1 has often shown to give im-
from the M C4 dataset in Xue et al. (2020). proved results in various tasks (Kool et al., 2019;
For each method outlined in Section 2, we sam- Melis et al., 2019; Adlam et al., 2020), and can be
ple 128 different tokenisations of each document, understood as a way of tuning the bias-variance
and calculate Pθ (Ti , D) for each sample, before tradeoff with the n-best estimator at the high-bias,
aggregating the sample scores into an estimate of low variance end, and independently sampling at
the marginal likelihood. We parallelise evaluating the other. We compare the WOR with 1-best es-
all the samples for a document on a multi-host TPU timator at a various rate of temperatures on our
setup; each dataset takes 15-30 minutes to evaluate. English datasets, and show the results in Figure

Consistent tokenization                        Inconsistent tokenization
                                      WR      WOR      WOR 1-best      n-best      WR         WOR        WOR 1-best      n-best
            WMT train (16.49)        16.59    16.58        16.48        16.47     16.81       16.79          16.48        16.48
            WMT test (22.62)         22.73    22.72        22.59        22.56     23.07       23.01          22.60        22.58
  English

            C USTOM N EWS (37.09)    37.11    37.12        36.93        36.88     37.90       37.89          37.03        36.95
            W IKI (60.22)            61.09    61.02        59.82        59.71     63.37       63.33          60.06        59.92
            AR X IV (179.20)        176.38   176.11       175.87       175.98    179.76      179.74         177.52       176.90

            WMT train (31.84)        32.51    32.58          31.80      31.77     33.04       33.12           31.80       31.78
  German

            WMT test (37.16)         37.68    38.16          37.12      37.08     38.87       38.91           37.13       37.09
            W IKI (66.08)            69.44    69.30          65.86      65.63     72.37       72.41           66.01       65.78
            M C4 (194.02)           206.89   207.15         192.84     192.21    219.63      219.19          193.68      192.87

Table 1: Comparing the different estimators of model marginal perplexity on evaluation sets. The number in
brackets represents the one-best tokenisation perplexity. Consistent vs. inconsistent tokenisation refers to whether
we tokenise each appearance of a whitespace-delimited type consistently in a document or not.

1. One can see that it is possible to improve on                                        All words          Multi-token words
the n-best estimator by trading some bias for vari-                             First      (1)    (2)     First    (1)   (2)
ance, and this can result in a better estimate of the             WMT Tr        3.88      2.59   17.01    10.73   4.07    21.11
marginal, especially for out of domain datasets.                  WMT Te        4.19      2.59   16.69    12.15   4.11    20.40
                                                                  CN EWS        6.31      2.99   16.19    17.01   4.88    20.36
4.2         Tokeniser entropy and the marginal gap                W IKI         7.84      3.62   16.54    17.80   5.63    19.81
                                                                  AR X IV       9.94      3.97   14.93    17.56   5.41    18.03
Next, we investigate what causes the gap between
marginal likelihood and one-best likelihood, and                Table 2: Investigating the caching ability of language
whether there are easily measurable factors that                models. For words which appear multiple times with
might predict this difference. We hypothesise that,             different tokenisations, we show the average loss of
the more uncertain the tokeniser is, the bigger this            the first occurrence of that word, of subsequent occur-
gap becomes. We pool together the documents in                  rences of that word with the same tokenisation (1), and
                                                                subsequent occurrences of that word in a different to-
all our evaluation sets, and test whether there is a
                                                                kenisation (2). WMT Tr and WMT Te are the WMT
correlation between tokeniser entropy and marginal              training and test evaluation sets respectively.
gap. Our results, shown in Figure 2, demonstrate
that there is a correlation between entropy and the             Twi = ht1wi . . . tnwii i, and each sampled tokenisa-
marginal gap (Spearman r = 0.57 for English,                    tion Ti can have different token sequences Twi i for
0.49 for German); interestingly, it appears that high           the same underlying word. We look for words
tokeniser entropy is predictive of a bigger marginal            wk ∈ (wi , . . . , wn ) such that:
gap, but large marginal gaps are possible even if
                                                                   1. For some tokenisation Ti of wi , for some l <
the tokeniser has low entropy.
                                                                      k, wl = wk and Twi k = Twi l (the word has
4.3         Analysing the caching behaviour of                        appeared before with the same tokenisation).
            language models                                        2. For some other tokenisation Tj , for all l < k
Our results show that tokenising word types con-                      such that wl = wk , Twj k 6= Twj l (all previous
sistently within a document leads to significantly                    occurrences of this word in the document were
tighter estimates of the marginal likelihood com-                     tokenised differently).
pared to independently tokenising input tokens. We                 We then calculate Pθ (wk |w

(a) English Figure 3: The performance of the n-best marginal like-
lihood estimator on the AR X IV evaluation set as we
vary the number of samples, taken in order of Q(T |D)
in orange and Pθ (T, D) in blue.

dataset, as this showed the biggest relative improve-
ment between the marginal likelihood and the one-
best likelihood. We take the samples from our
n-best estimator with n = 128, and incrementally
sum the samples (which are given in decreasing
order of likelihood under the tokeniser) to simulate
having smaller n. As an oracle experiment to to see
how many samples contribute significantly to the
(b) German
marginal likelihood, we also order the samples by
Figure 2: The correlation between entropy per token their language model scores (i.e. we order accord-
and the marginal gap per token in nats (not in per- ing to Pθ (T, D) rather than Q(T |D)) before taking
plexity), categorised by evaluation dataset. Some data the incremental sum. We show the results in Figure
points which extend beyond the right of the graph are 3. Our results show that, although ostensibly many
trucated; they follow the same trend. samples are necessary to estimate the marginal like-
lihood accurately, only very few samples (in the
of tokens. By comparing the loss of words paired order of 5) actually contribute significantly.
in this way, we can control for extra confounding In practical terms, our results suggest that one
factors (such as token unigram probability), and needs to take many samples with current tokenis-
isolate the ability of the language model to recog- ers to accurately estimate the marginal likelihood,
nise whether different token sequences correspond but that many of these samples are not effective.
to the same underlying form. We therefore believe that a prerequisite for more
We show our results for our various datasets, widespread adoption of marginal likelihood as an
together with selected subsets of words, in Table 2. evaluation metric is tokenisers that better fit the lan-
We see that, if the language model sees a word after guage model posterior over tokenisations. Current
already seeing it in the same tokenisation, its loss tokenisers make very strong independence assump-
is significantly lower than the loss associated with tions to make learning and inference tractable, and
the first time the word is seen (as was also reported we believe there is significant scope to design to-
in Lazaridou et al. (2021)). However, this ability is kenisers which relax these assumptions.
strongly tied to the exact tokenisation of the word:
if it appears again, but in a different tokenisation, 5 Related Work
then its loss can in fact be even greater.
5.1 Tokenisation and segmentation
4.4 How many samples are necessary? Unsupervised word segmentation has a long and
Next, we investigate how many samples are neces- illustrious history. The earliest motivations were in
sary to obtain an accurate estimate of the marginal information retrieval, and the motivation was that
likelihood. We experiment on the E N - AR X IV collapsing a set of related query terms might help

smooth counts over each of those terms individu-          et al., 2017), and word representation (Cao and Rei,
ally and result in better retrieval results. The earli-   2016). One downside is that dependency lengths
est approaches, such as the Porter stemmer (Porter,       become longer on the character-level, and lexical
1997), were rule-based. However, the power of             information has to be memorised by the compo-
data-driven statistical methods quickly became ap-        sitional machinery of the model. For this reason,
parent, and tools such as Morfessor (Virpioja et al.,     traditionally fully character-based approaches did
2013) used likelihood-based objectives, typically         not perform as well as their token-level counter-
with Bayesian smoothing methods (see also Gold-           parts, although recent progress suggests this may
water et al. (2011)), to induce segmentations.            change soon (Choe et al., 2019; Clark et al., 2021).
   Sennrich et al. (2016) used a different algorithm      There also exist approaches which mix character-
to induce segmentations: byte-pair encoding (Gage,        level and segment-level approaches (Buckman and
1994). Originally designed as a data compression          Neubig, 2018; Kawakami et al., 2019; He et al.,
algorithm, BPE tokenisers are now some of the             2020), although these segmental language models
predominantly used tokenisation methods. Alterna-         require more complex inference procedures.
tive approaches, such as WordPiece (Schuster and
Nakajima, 2012) and SentencePiece (Kudo, 2018),           6   Conclusion
explicitly use a language modelling objective to in-
duce a token lexicon. Previous methods have used          In this paper, we argue for using model marginal
train-time tokenisation randomisation as a regulari-      likelihood over tokenisations as an evaluation met-
sation aid (Kudo, 2018; Provilkov et al., 2020), but      ric for language models, rather than one-best to-
still use the one-best tokenisation at test time.         kenisation likelihood. We introduce practical low-
                                                          variance estimators for measuring the marginal like-
   Another strand of work has investigated whether
                                                          lihood, and demonstrate that there can be signifi-
tokenisers that caputre linguistic morphology can
                                                          cant difference between the marginal and the one-
improve language models. Bostrom and Durrett
                                                          best likelihoods, particularly on strongly out-of-
(2020) showed that unigram and BPE tokenisers
                                                          domain evaluation sets. Evaluating with marginal
for English and Japanese have low recall on re-
                                                          likelihood thus goes some way toward loosening
covering linguistic segments, since many morpho-
                                                          the bottleneck imposed by tokeniser quality in the
logically complex words are treated as a single
                                                          currently dominant language modelling paradigm,
token. Linguistically aligned tokenisers have been
                                                          and our results suggest that the field may be under-
shown to result in better language model perplex-
                                                          estimating the generalisation capability of mod-
ity (Schwartz et al., 2020; Park et al., 2021) and
                                                          ern language models. We further demonstrate
better downstream task performance (Alkaoud and
                                                          that tokeniser entropy is a good predictor of this
Syed, 2020), especially for morphologically rich
                                                          “marginal gap”, suggesting that tokeniser entropy,
languages. These experiments also use one-best
                                                          especially when out-of-domain, can be a guide to
tokenisation at test time.
                                                          the number of samples needed for evaluation.
   Rather than considering one-best or stochastic
                                                             More broadly, our experiments suggest that the
samples of tokenisations, one can use entire seg-
                                                          field should continue seeking better ways to in-
mentation lattices as input to a model. This ap-
                                                          corporate tokenisation into end-to-end language
proach has been considered for morphological tag-
                                                          modelling. Sampling from the tokeniser during
ging (Seker and Tsarfaty, 2020), parsing (Goldberg
                                                          training is an obvious possibility; alternatively, one
and Tsarfaty, 2008), and spoken intent recognition
                                                          could incorporate the segmentation lattice into the
(Ladhak et al., 2016), among others.
                                                          model directly, which has been beneficial for pars-
                                                          ing morphologically rich languages (Goldberg and
5.2   Tokenisation-free approaches
                                                          Tsarfaty, 2008; Tsarfaty et al., 2020). Further, de-
An alternative approach to inducing a tokenisation        veloping more contextual tokenisers which make
is to decompose input sequences into well-defined         fewer independence assumptions can also result in
orthographic units, such as characters. These ap-         both better language models trained on their one-
proaches circumvent the problem of inducing a             best tokenisation, and better evaluation estimates
lexicon, and have been used for text classifica-          of the marginal likelihood with fewer samples.
tion (Conneau et al., 2017), language modelling              We conduct experiments on German and En-
(Al-Rfou et al., 2019), machine translation (Lee          glish corpora in this paper. However, these two

languages are only a small sample in the full space Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof
of language typology. English is a morphologi- Monz, Makoto Morishita, Masaaki Nagata, Toshi-
aki Nakazawa, Santanu Pal, Matt Post, and Marcos
cally impoverished language, and while German
Zampieri. 2020. Findings of the 2020 conference on
compounding and inflection offer some additional machine translation (WMT20). In Proceedings of
challenges, many languages have more complex the Fifth Conference on Machine Translation, pages
patterns of word formation and inflection. We be- 1–55, Online. Association for Computational Lin-
lieve that estimating marginal likelihood will be guistics.
important for morphologically richer languages, Kaj Bostrom and Greg Durrett. 2020. Byte pair encod-
where tokenisation makes a bigger difference (Gerz ing is suboptimal for language model pretraining. In
et al., 2018; Mielke et al., 2019). Findings of the Association for Computational Lin-
Finally, improved understanding of the interac- guistics: EMNLP 2020, pages 4617–4624, Online.
Association for Computational Linguistics.
tion between tokenisation and language modelling
has implications for evaluating language models Aleksandar Botev, Bowen Zheng, and David Barber.
on both downstream tasks and language generation 2017. Complementary Sum Sampling for Likeli-
tasks. Evidence has shown that gains in language hood Approximation in Large Scale Classification.
In Proceedings of the 20th International Conference
modelling, as measured in perplexity, often lead on Artificial Intelligence and Statistics, volume 54 of
to improvements in downstream task performance Proceedings of Machine Learning Research, pages
(Radford et al., 2019). It would be instructive to 1030–1038, Fort Lauderdale, FL, USA. PMLR.
extend our marginal likelihood approach to down-
Jacob Buckman and Graham Neubig. 2018. Neural lat-
stream task evaluation. On generation tasks, since tice language models. Transactions of the Associa-
the tokeniser affects language model training but tion for Computational Linguistics.
is only implicitly used when sampling (via the to-
keniser vocabulary), the effect of tokenisation algo- Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdi-
nov. 2016. Importance weighted autoencoders. In
rithms requires careful investigation. 4th International Conference on Learning Represen-
tations, ICLR 2016, San Juan, Puerto Rico, May 2-4,
Acknowledgements 2016, Conference Track Proceedings.
The authors would like to thank Dani Yogatama Kris Cao and Marek Rei. 2016. A joint model for word
and the rest of the Language group at DeepMind embedding and word morphology. In Proceedings
for comments and discussion, Gábor Melis and of the 1st Workshop on Representation Learning for
Phil Blunsom for comments on an earlier draft, NLP, pages 18–26, Berlin, Germany. Association for
Computational Linguistics.
and Mark Rowland for clarification remarks on
sampling without replacement. We would also like Dokook Choe, Rami Al-Rfou, Mandy Guo, Heey-
to thank our anonymous reviewers. oung Lee, and Noah Constant. 2019. Bridging
the gap for tokenizer-free language models. CoRR,
abs/1908.10322.
References Jonathan H. Clark, Dan Garrette, Iulia Turc, and John
Ben Adlam, Jasper Snoek, and Samuel L. Smith. 2020. Wieting. 2021. CANINE: pre-training an efficient
Cold posteriors and aleatoric uncertainty. tokenization-free encoder for language representa-
tion. CoRR, abs/2103.06874.
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy
Guo, and Llion Jones. 2019. Character-level lan- Alexis Conneau, Holger Schwenk, Loïc Barrault, and
guage modeling with deeper self-attention. Proceed- Yann Lecun. 2017. Very deep convolutional net-
ings of the AAAI Conference on Artificial Intelli- works for text classification. In Proceedings of the
gence, 33(01):3159–3166. 15th Conference of the European Chapter of the As-
sociation for Computational Linguistics: Volume 1,
Mohamed Alkaoud and Mairaj Syed. 2020. On the im- Long Papers, pages 1107–1116, Valencia, Spain. As-
portance of tokenization in Arabic embedding mod- sociation for Computational Linguistics.
els. In Proceedings of the Fifth Arabic Natural Lan-
guage Processing Workshop (WANLP, pages 119– Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-
129. bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.
Transformer-XL: Attentive language models beyond
Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, a fixed-length context. In Proceedings of the 57th
Marta R. Costa-jussà, Christian Federmann, Yvette Annual Meeting of the Association for Computa-
Graham, Roman Grundkiewicz, Barry Haddow, tional Linguistics, pages 2978–2988, Florence, Italy.
Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Association for Computational Linguistics.

Christopher Dyer. 2010. A Formal Model of Ambiguity Faisal Ladhak, Ankur Gandhe, Markus Dreyer, Lam-
and its Applications in Machine Translation. Ph.D. bert Mathias, Ariya Rastrow, and Björn Hoffmeister.
thesis, University of Maryland. 2016. Latticernn: Recurrent neural networks over
lattices. In Interspeech 2016, pages 695–699.
Philip Gage. 1994. A new algorithm for data compres-
sion. C Users J., 12(2):23–38. Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gri-
bovskaya, Devang Agrawal, Adam Liska, Tayfun
Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi
Terzi, Mai Gimenez, Cyprien de Masson d’Autume,
Reichart, and Anna Korhonen. 2018. On the relation
Sebastian Ruder, Dani Yogatama, Kris Cao, Tomas
between linguistic typology and (limitations of) mul-
Kocisky, Susannah Young, and Phil Blunsom. 2021.
tilingual language modeling. In Proceedings of the
Pitfalls of static language modelling.
2018 Conference on Empirical Methods in Natural
Language Processing, pages 316–327, Brussels, Bel- Jason Lee, Kyunghyun Cho, and Thomas Hofmann.
gium. Association for Computational Linguistics. 2017. Fully character-level neural machine trans-
Yoav Goldberg and Reut Tsarfaty. 2008. A single gen- lation without explicit segmentation. Transactions
erative model for joint morphological segmentation of the Association for Computational Linguistics,
and syntactic parsing. In Proceedings of ACL-08: 5:365–378.
HLT, pages 371–379, Columbus, Ohio. Association
for Computational Linguistics. Zhifei Li and Jason Eisner. 2009. First- and second-
order expectation semirings with applications to
Sharon Goldwater, Thomas L. Griffiths, and Mark minimum-risk training on translation forests. In Pro-
Johnson. 2011. Producing power-law distributions ceedings of the 2009 Conference on Empirical Meth-
and damping word frequencies with two-stage lan- ods in Natural Language Processing, pages 40–51,
guage models. Journal of Machine Learning Re- Singapore. Association for Computational Linguis-
search, 12(68):2335–2382. tics.
Xuanli He, Gholamreza Haffari, and Mohammad David J. C. MacKay. 2003. Information theory, infer-
Norouzi. 2020. Dynamic programming encoding ence, and learning algorithms.
for subword segmentation in neural machine trans-
lation. In Proceedings of the 58th Annual Meet- Gideon Mann and Andrew McCallum. 2007. Efficient
ing of the Association for Computational Linguistics, computation of entropy gradient for semi-supervised
pages 3042–3051, Online. Association for Computa- conditional random fields. In Human Language
tional Linguistics. Technologies 2007: The Conference of the North
American Chapter of the Association for Compu-
D. Hernando, V. Crespi, and G. Cybenko. 2005. Effi- tational Linguistics; Companion Volume, Short Pa-
cient computation of the hidden markov model en- pers, pages 109–112, Rochester, New York. Associ-
tropy for a given observation sequence. IEEE Trans- ation for Computational Linguistics.
actions on Information Theory, 51(7):2681–2685.
Gábor Melis, Charles Blundell, Tomáš Kočiský,
Velimir M. Ilic. 2011. Entropy semiring forward-
Karl Moritz Hermann, Chris Dyer, and Phil Blun-
backward algorithm for HMM entropy computation.
som. 2019. Pushing the bounds of dropout.
CoRR, abs/1108.0347.
Kazuya Kawakami, Chris Dyer, and Phil Blunsom. Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian
2019. Learning to discover, ground and use words Roark, and Jason Eisner. 2019. What kind of lan-
with segmental neural language models. In Proceed- guage is hard to language-model? In Proceedings of
ings of the 57th Annual Meeting of the Association the 57th Annual Meeting of the Association for Com-
for Computational Linguistics, pages 6429–6441, putational Linguistics, pages 4975–4989, Florence,
Florence, Italy. Association for Computational Lin- Italy. Association for Computational Linguistics.
guistics.
Hyunji Hayley Park, Katherine J. Zhang, Coleman Ha-
Wouter Kool, Herke Van Hoof, and Max Welling. 2019. ley, Kenneth Steimel, Han Liu, and Lane Schwartz.
Stochastic beams and where to find them: The 2021. Morphology matters: A multilingual lan-
Gumbel-top-k trick for sampling sequences with- guage modeling analysis. Transactions of the Asso-
out replacement. In Proceedings of the 36th In- ciation for Computational Linguistics, 9:261–276.
ternational Conference on Machine Learning, vol-
ume 97 of Proceedings of Machine Learning Re- M. F. Porter. 1997. An Algorithm for Suffix Stripping,
search, pages 3499–3508. PMLR. page 313–316. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA.
Taku Kudo. 2018. Subword regularization: Improving
neural network translation models with multiple sub- Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita.
word candidates. In Proceedings of the 56th Annual 2020. BPE-dropout: Simple and effective subword
Meeting of the Association for Computational Lin- regularization. In Proceedings of the 58th Annual
guistics (Volume 1: Long Papers), pages 66–75, Mel- Meeting of the Association for Computational Lin-
bourne, Australia. Association for Computational guistics, pages 1882–1892, Online. Association for
Linguistics. Computational Linguistics.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
  and Dario Amodeiand Ilya Sutskever. 2019. Lan-
  guage models are unsupervised multitask learners.
  OpenAI Technical Report.
Mike Schuster and Kaisuke Nakajima. 2012. Japanese
  and korean voice search. In 2012 IEEE Interna-
  tional Conference on Acoustics, Speech and Signal
  Processing (ICASSP), pages 5149–5152.
Lane Schwartz, Francis Tyers, Lori Levin, Christo
  Kirov, Patrick Littell, Chi-kiu Lo, Emily
  Prud’hommeaux, Hyunji Hayley Park, Kenneth
  Steimel, Rebecca Knowles, Jeffrey Micher, Lonny
  Strunk, Han Liu, Coleman Haley, Katherine J.
  Zhang, Robbie Jimmerson, Vasilisa Andriyanets,
  Aldrian Obaja Muis, Naoki Otani, Jong Hyuk
  Park, and Zhisong Zhang. 2020. Neural polysyn-
  thetic language modelling. Final Report of the
  Neural Polysynthetic Language Modelling Team
  at the 2019 Frederick Jelinek Memorial Summer
  Workshop.
Amit Seker and Reut Tsarfaty. 2020. A pointer net-
 work architecture for joint morphological segmen-
 tation and tagging. In Findings of the Association
 for Computational Linguistics: EMNLP 2020, pages
 4368–4378, Online. Association for Computational
 Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
  2016. Neural machine translation of rare words
  with subword units. In Proceedings of the 54th An-
  nual Meeting of the Association for Computational
  Linguistics (Volume 1: Long Papers), pages 1715–
  1725, Berlin, Germany. Association for Computa-
  tional Linguistics.
Reut Tsarfaty, Dan Bareket, Stav Klein, and Amit
  Seker. 2020. From SPMRL to NMRL: What did
  we learn (and unlearn) in a decade of parsing
  morphologically-rich languages (MRLs)? In Pro-
  ceedings of the 58th Annual Meeting of the Asso-
  ciation for Computational Linguistics, pages 7396–
  7408, Online. Association for Computational Lin-
  guistics.
Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and
  Mikko Kurimo. 2013. Morfessor 2.0: Python Im-
  plementation and Extensions for Morfessor Baseline.
  Technical report.
Linting Xue, Noah Constant, Adam Roberts, Mi-
  hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
  Barua, and Colin Raffel. 2020. mt5: A mas-
  sively multilingual pre-trained text-to-text trans-
  former. CoRR, abs/2010.11934.

You can also read