What Context Features Can Transformer Language Models Use?
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
What Context Features Can Transformer Language Models Use? Joe O’Connor Jacob Andreas Massachusetts Institute of Technology {joeoc,jda}@mit.edu Abstract focused on making use of even longer contexts computationally feasible (Rae et al., 2019; Wang Transformer-based language models benefit et al., 2020; Child et al., 2019; Dai et al., 2019; from conditioning on contexts of hundreds to arXiv:2106.08367v1 [cs.CL] 15 Jun 2021 thousands of previous tokens. What aspects Kitaev et al., 2020). of these contexts contribute to accurate model But despite empirical evidence that long contexts prediction? We describe a series of experi- are helpful, little is understood about why. If the ments that measure usable information by se- future of language modeling will include a focus lectively ablating lexical and structural infor- on contexts of increasing size, it is important to mation in transformer language models trained first understand what contextual information con- on English Wikipedia. In both mid- and long- tributes to accurate prediction in current models. range contexts, we find that several extremely This paper offers an answer to that question via the destructive context manipulations—including shuffling word order within sentences and V-information framework of Xu et al. (2020). V- deleting all words other than nouns—remove information, discussed more in Section 2, provides less than 15% of the usable information. Our a formal framework for reasoning about how much results suggest that long contexts, but not their usable information a computationally constrained detailed syntactic and propositional content, predictor (like a neural LM) can extract from an are important for the low perplexity of current input. Our experiments measure the amount of us- transformer language models.1 able information that is added when increasing LM 1 Introduction context size, then attempt to pinpoint the source of this information by ablating features of the added Recent years have seen a significant improvement context (via controlled shuffling and word deletion) in the predictive accuracy of neural language mod- and measuring the resulting loss of model predic- els (LMs), owing to a combination of improve- tive power. While this framework is general, we ments in model architecture (especially transform- focus on transformer LMs. ers; Vaswani et al. 2017) and training infrastructure Our work is closely related to an earlier study by (Wolf et al., 2020). The most striking change, rela- Khandelwal et al. (2018), which measured changes tive to both recurrent neural LMs (Mikolov et al., in a pre-trained LSTM LM when context words 2010) and count-based models (Kneser and Ney, were permuted and deleted at evaluation time. But 1995), is the length of the context that these models neural language models are known to be highly can effectively condition on. While count-based sensitive to distributional shifts—and in particular LMs in production speech recognition and machine might be unable to use information from long-range translation systems typically used 10–20 tokens at a context but still be adversely affected when the maximum (e.g., Brown, 2011), and recurrent LMs structure of that context changes at evaluation time. have an effective context size of 200 (Khandelwal Directly measuring usable information makes it et al., 2018), the predictive accuracy of transformer possible to clearly distinguish accuracy decreases LMs appears to improve when conditioning on as that result from loss of information and decreases many as a thousand previous tokens (Beltagy et al., that result from out-of-distribution inputs. 2020). A significant amount of recent work has Our experiments reveal a number of surprising 1 Code for all experiments in this paper is available at facts about the use of long- and mid-range context https://github.com/lingo-mit/context-ablations. in transformers. While increasing context length
from 256 to 768 tokens is beneficial (decreasing But in practice, this extra context is useless: no perplexity by roughly 4%), many destructive trans- known efficient predictor can learn anything about formations of this context (including transforma- the password from its hash code, and the extra con- tions that cause large changes in the paradigm of text has not made the language modeling problem Khandelwal et al. 2018) remove essentially no us- any easier. This is an extreme case, but a simi- able information. Our results suggest that for cur- lar intuition applies to more conventional questions rent models, the primary carriers of information in about language models. A newspaper article whose long-range context are content words and local co- first sentence begins A dog bit a man is likely to occurrence statistics: deleting function words and end very differently from one that begins A man shuffling within local windows both have very little bit a dog. Can LMs reason effectively about this effect on models’ predictive power. Context mat- distinction, or is it (like a hashed password) com- ters, but not all features of context matter equally; putationally inaccessible to current models? as discussed in Section 5, these results motivate A framework for answering questions of this future language modeling research focused on al- kind was introduced by Xu et al. (2020): ternative context representations rather than simply Definition 1. The usable predictive information more tokens. (formally, predictive V-information) from a ran- dom variable X to a random variable Y as: 2 Approach IV (X → Y ) = inf −E log p1 (Y ) A language model (LM) places a probability distri- p1 ∈V bution p(x) over discrete token sequences x. Most − inf −E log p2 (Y | X) (2) p2 ∈V learned LMs do so by decomposing p(x) according to the chain rule and modeling the conditional dis- for a class V of distributions p. tribution over a single target token given a (fixed- Intuitively, this definition measures how much or variable-length) context of previous tokens: extra information about Y can be extracted from Y X by any predictor in V. In language modeling, p(x) = p(xi | x0 , x1 , . . . , xi−1 ) . (1) we will take Y to be the target word, X its con- i text, and V a class of parametric models. While In transformer language models, this conditional this definition generalizes Shannon mutual infor- distribution is modeled via a sequence of alternat- mation (Shannon, 1948) and has deep connections ing neural feed-forward layers and self-attention to other information-theoretic quantities (see Xu layers; see Vaswani et al. (2017) for more details. et al. 2020 for details) it ultimately corresponds to While input sequences x can in principle be a simple and common-sense evaluation: if we want made arbitrarily long, there are both theoretical to know how much the extra context X helps a lan- and practical limits to transformers’ ability to make guage model, we should train a model p1 without effective use of it (Hahn, 2020; Wang et al., 2019). access to X, train a model p2 with access to X, and Here, we wish to understand when (and why) in- compare the accuracy of their predictions. creasing the size of the context improves model Measuring what is used But the original ques- predictions. tion raised by the introduction was not just how Usable information Consider a hypothetical much information is contributed by context. It is LM context consisting of the tokens The user’s already well-established that conditioning on long password is. . . . This context suggests that subse- contexts is helpful, with existing experiments on quent tokens will be a password: (hopefully!) a long-range transformers effectively implementing high-entropy sequence. Now suppose this context the measurement in Eq. (2). Instead, we want to is extended to include earlier tokens, becoming know what information in this context is actually The user’s hashed password is ave$@To9!. The used by models. user’s password is. . . . Information-theoretically, As a prototypical example, let us hypothesize this context is extremely informative: only a small that more than five tokens away from the target, number of passwords will hash to the given string, models are only able to extract usable information and a predictor capable of testing all passwords from nouns. (In our experiments in Section 3, this would be able to identify the candidates and signif- “long-range context” will be considerably longer icantly reduce its uncertainty about future tokens. than 5 words.) For example, given the sentence:
Pierre Vinken, 61 years old, will join the board as ℒ( nouns , ℓ : m ∼ n) a nonexecutive director Nov. 29. a director Nov Transformer LM we hypothesize that the LM distributions: Pierre Vinken years | will join the board | as a director | p1 (director | Pierre Vinken, 61 years old, will ℓ ℓ+m ℓ+n join the board as a nonexecutive) (3) ablated context ordered context ≈ p2 (director | Pierre Vinken years, Figure 1: Calculation of the ablated likelihood | {z } L(nouns, ` : m ∼ n) (Eq. (10)). A context ablation noun-only context nouns (which deletes all non-noun words) is applied to the | board as {z }) a nonexecutive , (4) the first ` tokens of the context, and likelihood is com- ordinary context puted on the last n − m (unablated) context tokens. and more generally that Evaluation in practice Eq. (9) provides a gen- eral framework for answering our core question in IV (X0:n → Xn ) this paper: for a diverse set of context ablations and ≈ IV ([nouns(X0:n−5 ), Xn−5:n ] → Xn ) (5) offsets, we will measure how much information is lost when a given ablation is applied at a given off- where Xi:j is the sequence of tokens set. A few modifications are required to turn this [Xi , Xi+1 , . . . , Xj−1 ], V is a class of LMs, equation into a practical evaluation scheme: and nouns is a context ablation that extracts Held-out evaluation: Eq. (7) involves an expec- only the nouns from a given string. That is, we tation over the sequence distribution p(X). In prac- hypothesize that the amount of usable information tice, LMs must be trained on finite corpora, creat- contributed by the full context X0:n is the same ing a risk of overfitting (Zhang et al., 2016). To as the amount contributed by the ablated context address this issue, we approximate the infimum in [nouns(X0:n−5 ), Xn−5:n ], so ablation removes no Eq. (7) by fitting θ1 on a training set, and comput- information. ing ablated information on a held-out validation The experiments in this paper generalize this set. All reported results are an average of held-out experimental framework to other context ablations likelihoods from two random initializations. and hypotheses. Let f be an ablation and k an Batching: Given a fixed (training or test) integer offset, and denote an ablated context: dataset of strings X and a maximum context size fk (X) = [f (X0:n−k ), Xn−k:n ] (6) of m, Eq. (7) should be estimated empirically P 1 P|x| as − |X1 | x |x| i=0 log p(Xi | fk (Xi−m:i )). and an ablated negative log-likelihood: This requires re-computing model predictions L(θ, f, k) = −E log pθ (Xn | fk (X0:n )) (7) once for every token in the dataset. However, the transformer models we use here support ef- Then, we can measure the effect of each ablation f ficient batch inference: training data is pre- on usable information via the following quantity: segmented into sequences of at most length n, and − |X1|n x ni=0 log p(Xi | fk (X0:i )) can be com- P P Definition 2. The ablated information due to an ablation f at an offset k is: puted in a single forward pass. This is considerably IV (X0:n →Xn )−IV (fk (X0:n )→Xn ) more efficient but means that most tokens are eval- A(f, k) = (8) uated with a context of length < n. As a compro- IV (X0:n →Xn )−IV (Xn−k:n →Xn ) mise to ensure that evaluations contain long-range inf θ L(θ,f,k)−inf θ0 L(θ0 ,n) context, we accumulate losses on a subset: = , (9) inf θ00 L(θ00 ,n−k)−inf θ0 L(θ0 ,n) 1 where L(θ, i) is the (unablated) negative log- L(θ, f, ` : m ∼ n) = − |X |(n − m) likelihood −E log pθ (Xn | Xn−i:n ). `+n X X log pθ (Xi | [f (X0:` ), X`:i ]) Intuitively, A(f, k) measures how much of the x i=`+m usable information added by an extra k tokens (the (10) denominator) is removed by applying the ablation f to those k tokens (the numerator). If it is close to (visualized in Fig. 1). This can be read as “` tokens 0, almost no information is removed; if it is close of f -ablated context, followed by m to n tokens to 1, almost all information is removed. of unablated context”. We will write L(θ, m ∼ n)
when only unablated context is used. Because of context. We first train a no information model to the large number of experiments in this paper, we minimize L(θ, 0 ∼ 512) and a full information use Eq. (10) for all training and evaluation. model to minimize L(θ, 512 ∼ 1024). For each context ablation f , we train a model to minimize Model, data and training details For all exper- L(θ, f, 512 : 0 ∼ 512). Each ablation has access iments, our LM uses the GPT-2 model architec- to more information than the no information model ture (Radford et al., 2019) in the implementation (because it conditions on extra tokens) and less of Wolf et al. (2020) with default hyperparame- information than the full information model (be- ters. All models are trained from scratch on the cause an ablation has been applied to those tokens). WikiText-103 dataset (Merity et al., 2016), an En- Note that the LM operates on BPE-derived sub- glish language modeling benchmark. Aside from word tokens for consistency with the way GPT-2 is ablations, no preprocessing is applied. A spe- typically used, but all ablations are defined at the cial separator token is inserted between ablated word level, meaning, e.g., that we shuffle words and unablated context. The training set contains rather than tokens. 103,221,021 words, while the evaluation set con- We use these trained models to calculate ab- tains 217,646 words. lated information (Eq. (9)). To explore the ef- A note on evaluation As in past work on eval- fect of different context lengths, we stratify eval- uating language models (Brown et al., 1992), our uation of the ablated information into two con- evaluation of relative predictive information ulti- ditions: a mid-range condition in which likeli- mately bottoms out in a conditional entropy (log- hoods in Eq. (9) are of the form L(·, f, 512 : 0 ∼ perplexity). Recent work has shown that other met- 256), and a long-range condition with likelihoods rics, such as diversity of outputs, are important for L(·, f, 512 : 256 ∼ 512). (We call the former evaluating the quality of LMs as models for lan- “mid-range” rather than “short-range” because most guage generation (Hashimoto et al., 2019; Caccia tokens are still predicted with significant unab- et al., 2020). Generation also depends on a number lated context; our experiments do not character- of other factors, such as choice of decoding proce- ize sentence-internal modeling of syntactic well- dure (Caglayan et al., 2020). Here, we focus on formedness.) Results are shown in Figure 2 and LMs as predictive models, measuring their ability discussed below. to place an accurate distribution over future words Overall word order and sentences, rather than their ability to gener- shuffle all ate useful or coherent text (see Appendix C). We 61 N.V., director the of Mr. Vinken Dutch group. as want to emphasize that these results below apply to nonexecutive the 29. is Vinken, years Elsevier join old, language models specifically, and not transformers publishing a Nov. will Pierre board chairman applied to NLP tasks in general—the same analysis shuf. trigrams globally might give very different conclusions if applied to, publishing group. N.V., the Dutch Mr. Vinken is join the e.g., question answering or summarization. board as a nonexecutive years old, will chairman of Elsevier Pierre Vinken, 61 director Nov. 29. 3 Experiments In the shuffle all ablation, f shuffles words uni- In this section, we attempt to determine what in- formly at random, forcing the model to treat ablated formation in transformer LM contexts is usable context as a bag of words. In the shuf. trigrams by measuring ablated information (Eq. (9)). Sec- globally ablation, the context is divided up into non- tions 3.1 and 3.2 describe our main results, with overlapping trigrams, the order of which is then Section 3.1 focused on ordering and Section 3.2 permuted uniformly at random. Shuffling all words focused on lexical information. Section 3.3 com- removes 41% of usable information in the mid- pares these results to ablations applied at evaluation range condition and 84% in the long-range condi- time. Section 3.4 explores whether contexts can be tion: ordering information is important even very further manipulated to improve model predictions. far from the target. On the other hand, shuffling all trigrams removes 31% of usable information in 3.1 Does order matter? the mid-range condition and 50% in the long-range In this section we will examine the effects of differ- condition: local co-occurrence statistics carry a ent augmentations to the order within long-range significant amount of usable information.
full information 4.19 (0%) sent.). (1) and (2) were also recently explored shuf. within trigrams +0.04 (14%) by Pham et al. (2020) in models for entailment, shuf. trigrams within sent. +0.04 (16%) and more complex shuffling procedures have been sent. shuf. +0.04 (17%) explored in neuroscience contexts (Mollica et al., shuf. within sent. +0.07 (26%) 2020). Here, (2) and (3) are chosen because they shuf. trigrams globally +0.08 (31%) preserve local co-occurrence statistics ((3) more shuffle all +0.10 (41%) than (2)), while (2) also preserves the general lin- replace w/ old +0.14 (55%) no information 4.45 (100%) ear information flow of the sentence. 4.20 4.25 4.30 4.35 4.40 4.45 Notably, the shuf. within trigrams (14% and bits 41%) and the shuf. trigrams within sent. (16% and (a) Mid-range condition (first 256 tokens after ablation) 35%) ablations both remove relatively little usable information in both the mid- and long-range condi- full information 4.17 (0%) sent. shuf. tions. Usable information is decreased only slightly +0.01 (14%) shuf. trigrams within sent. +0.02 (35%) by ablations that preserve local co-occurrence shuf. within trigrams +0.02 (41%) statistics and/or linear information flow. (This shuf. trigrams globally +0.02 (50%) includes transformations like man bites dog → dog shuf. within sent. +0.03 (55%) bites man with significant effects on semantics!) In replace w/ old +0.03 (69%) the long-range condition, uniform shuffling within shuffle all +0.04 (84%) sentences produces a larger effect, removing 55% no information 4.22 (100%) of usable information. 4.17 4.18 4.19 4.20 4.21 4.22 4.23 bits Sentence order (b) Long-range condition (tokens 256-512 after ablation) shuf. sent. Figure 2: Effect of word order on usable information. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. Pierre Vinken, 61 years old, will join Bar labels show “change in ablated likelihood (ablated the board as a nonexecutive director Nov. 29. information)”. The x axis shows ablated likelihood. Error bars represent 95% confidence intervals. Word- Next, sentences are shuffled within the context order changes that preserve local ordering remove only while their internal word order is unchanged. In a small amount of information, while shuffling or re- the mid-range condition, this produces results com- placement with thematically similar text remove more. parable to the trigram shuffling experiments above (removing 17% of usable information); in the long- Word order within sentences range condition, it has an even smaller effect (14%). shuf. within sent. Together with the previous experiment these results 61 director as the old, join will a Nov. board nonexecutive suggest that prediction accuracy depends on infor- years Vinken, 29. Pierre is publishing the Vinken N.V., Mr. group. chairman Elsevier of Dutch mation about local word co-occurrence, but not fine-grained word order or global position. shuf. within trigrams Order of entire sections Vinken, Pierre 61 will old, years the board join a nonexecutive as Nov. director 29. Mr. Vinken is of replace w/ old Elsevier chairman the Dutch N.V., group. publishing Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a shuf. trigrams within sent. nonexecutive director of this British industrial conglomerate. years old, will as a nonexecutive join the board Pierre Vinken, 61 director Nov. 29. N.V., the Dutch chairman of Elsevier Mr. Vinken is publishing group. A possible hypothesis about LM behavior is that the main function of long-range context is to provide Words are shuffled only within sentences according more information about the general topic of the to one of three procedures: (1) a uniform random document, including clues about vocabulary and permutation of all the words in the sentence (shuf. style. To test this, the ablation replaces its entire within sent.), (2) a uniform random permutation input with the 512 tokens that immediately precede of the words within each non-overlapping trigram it in the source document (which in general will in the sentence (shuf. within trigrams), and (3) a be topically similar). This transformation removes uniform random permutation of the order of the significant information in both mid- and long-range trigrams within the sentence (shuf. trigrams within conditions (55% and 69%). Long-range context is
full information 4.19 (0%) N & VB cont. words +0.02 (9%) Pierre Vinken years will join board director Nov. Mr. N&VB&ADJ +0.03 (11%) Vinken chairman Elsevier N.V. publishing group N&VB +0.03 (13%) N +0.05 (20%) N & VB & ADJ common +0.10 (38%) Pierre Vinken years old will join board nonexecutive named entities +0.10 (39%) director Nov. Mr. Vinken chairman Elsevier N.V. Dutch rare +0.15 (58%) publishing group func. words +0.18 (69%) cont. words (N & VB & ADJ & ADV) no information 4.45 (100%) 4.20 4.25 4.30 4.35 4.40 4.45 Pierre Vinken years old will join board nonexecutive bits director Nov. Mr. Vinken chairman Elsevier N.V. Dutch publishing group (a) Mid-range condition (first 256 tokens after context) func. words cont. words +-0.01 (-31%) , 61 , the as a 29 . is of , the . N&VB&ADJ +-0.01 (-29%) N&VB +-0.01 (-22%) As in the initial example from Section 2, we N +-0.00 (-9%) retain only words whose part of speech tag is in a full information 4.17 (0%) named entities +0.01 (31%) given set. We use the spaCy model (Honnibal et al., common +0.02 (33%) 2020) for part-of-speech tagging, and examine five rare +0.03 (73%) sets: (1) nouns only, (2) nouns and verbs, (3) nouns, func. words +0.04 (89%) verbs, and adjectives, (4) content words (nouns, no information 4.22 (100%) verbs, adjectives, and adverbs), and (5) function 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 bits words (all words except nouns, verbs, adjectives, (b) Long-range condition (tokens 256-512 after context) and adverbs). In the mid-range condition, deleting all words Figure 3: Effect of word identity on usable informa- but nouns removes only 20% of usable informa- tion. Labels are as in Fig. 2. Several ablations, includ- tion; deleting all but nouns and verbs removes only ing deletion of all words except nouns, preserve most 13%. Most usable information, even in mid-range usable information in the mid-range condition, and im- prove model accuracy in the in the long range. context, appears to be captured by nouns and verbs. Retaining only function words causes a consider- ably greater loss of information. not simply a source of topic information: earlier In the long-range condition, results are even text on the same theme is in some cases nearly as more striking: retaining only content words uninformative as no text at all. improves predictions over the “full informa- tion” experiment. Like Shannon information, V- information is defined to be non-negative (Xu et al., 3.2 Do all words matter? 2020), and the result in Fig. 3 is a consequence Our next experiments focus on lexical rather than of our finite-sample approximation based on held- structural information, using ablations that delete out likelihood. The effect is robust across multiple selected words from the context. Training and eval- training runs from random initializations. As there uation setups are exactly as in Section 3.1. Here, is a significant gap between the training and vali- unlike the previous section, ablations will generally dation perplexity of our model (roughly 11%), we cause the number of tokens in a given context to hypothesize that this change occurs because the decrease; in this case ablations also insert padding ablation preserves semantic content while reducing tokens to the beginning of the context window to the original model’s ability to overfit. We believe preserve the original number of tokens. Results are this is an important subject for future investigation. shown in Fig. 3. Named entities named entities Parts of speech Pierre Vinken 61 years old Nov. 29 Vinken Elsevier N.V. N Dutch Pierre Vinken years board director Nov. Mr. Vinken As an alternative to the topic hypothesis evaluated chairman Elsevier N.V. publishing group under “Order of entire sections” above, we might
hypothesize that long-range contexts are useful be- full information 4.18 shuf. within trigrams +0.06 cause they provide a reservoir of named entities shuf. trigrams within sent. +0.08 +0.10 sent. shuf. likely to be referred to again. Here, the ablation shuf. within sent. +0.14 +0.16 cont. words retains only spans tagged as named entities or quan- N&VB&ADJ +0.16 +0.16 shuf. trigrams globally tities by spaCy. While significantly worse than the N&VB +0.16 N +0.18 noun ablation discussed above, retaining only en- common +0.20 named entities +0.20 tities results removes only about a third of usable shuffle all +0.22 rare +0.24 information in both conditions (39% and 31%). replace w/ old +0.25 no information 4.48 func. words +0.36 Word frequency 4.20 4.25 4.30 4.35 4.40 4.45 4.50 4.55 bits common Pierre years old join board director . Mr. chairman (a) Mid-range condition (first 256 tokens after ablation) Dutch publishing group . full information 4.15 sent. shuf. +0.00 rare shuf. trigrams within sent. +0.02 shuf. within trigrams +0.03 Vinken nonexecutive Nov. Vinken Elsevier N.V. shuf. trigrams globally +0.03 replace w/ old +0.03 N&VB&ADJ +0.03 Another natural question is whether rare words or cont. words +0.03 N&VB +0.03 frequent words are more important: information N +0.03 named entities +0.03 about frequent context words might help models shuf. within sent. +0.04 rare +0.04 estimate fine-grained document-level frequencies shuffle all +0.04 common +0.05 of those words account for most of the terms in no information 4.20 func. words +0.07 Eq. (7); rare words are likely to be more informa- 4.14 4.16 4.18 4.20 4.22 tive about the content of the document itself. bits We partition the vocabulary into a set of rare (b) Long-range condition (tokens 256-512 after ablation) words, corresponding to the least frequent ∼ 98% of word types and 20% of word tokens, and fre- Figure 4: Loss of information resulting from ablations quent words, the most frequent ∼ 2% of types and at evaluation time only. x-axis and labels show ablated negative log-likelihoods. Some locality-preserving ab- 80% of tokens. Both ablations remove a significant lations (high PMI, shuf. sent.) have a small effect, but amount of information relative to the POS-based most affect likelihood significantly (including lexical ablations above, but retaining only frequent words ablations that do not remove usable information). improves perplexity relative to rare words in both the mid- and long-range conditions. the set of ablations shown in Section 3.1 and Sec- Appendix B presents versions of these experi- tion 3.2. For the full information model in Fig. 4, ments trained and evaluated on even longer con- we evaluate on ordered context windows with no texts. Conclusions are largely the same as above. padding tokens; for the no information model, we evaluate on context windows in which the first 512 3.3 Evaluating on augmented data tokens are all padding tokens. We motivated the use of V-information in Section 2 In the mid-range condition, the least destructive by arguing that it more clearly distinguished be- ablations are shuffling within trigrams and shuffling tween prediction errors attributable to loss of in- the order of trigrams within sentences: models ap- formation and prediction errors attributable to mal- pear to be reasonably robust to this kind of data formed and out-of-distribution model inputs. To transformation without specific training on it. Im- put our results in context, we repeat several of the portantly, lexical ablation experiments have a large previous experiments in the evaluation paradigm impact in this evaluation, underlining the extent to of Khandelwal et al. (2018), which is designed which the two experimental paradigms characterize to measure test-time sensitivity rather than usable different aspects of model behavior. Figure 5 in information. Appendix A shows a side-by-side comparison of We train a new model to minimize L(θ, 512 ∼ these experiments and the ones in Sections 3.1–3.2. 1024) while randomly truncating the first 512 con- text tokens and replacing them with padding tokens 3.4 Making better language models? (to ensure that the model has seen padding tokens The lexical ablation experiments in Section 3.2 in- at training time). We then evaluate this model on dicated that model accuracy could be improved by
selective deletion of context words. Can this ef- effects are reported by Sankar et al. (2019) for neu- fect be exploited to further improve models? As ral dialogue models, and Li et al. (2016) describe a simple experiment, we attempted to replace all an alternative procedure for ablating contexts. padding tokens in the nouns+verbs ablation of Sec- tion 3.2 with nouns and verbs from further back in the context—effectively providing the model Context in Transformer LMs Transformers in- with an even longer-range view of an informative troduce yet another mechanism for extracting infor- context representation. mation from long-range context: attention. Atten- This experiment slightly increased usable infor- tion is also used with RNNs, but typically with just mation in the mid-range condition (0.2%), but de- a single head—the hidden state still carries most creased it in the long range-range condition (0.6%). of the information. In transformers, context enters Longer contexts, even of a kind previously found into predictions primarily via unbounded random to be informative, did not provide additional us- access. These models appear to benefit from signif- able information. These results are consistent with icantly longer contexts than previous models. our earlier hypothesis that the previously observed effect resulted from a reduction in overfitting—if Some recent work that investigates the behavior removing information increased performance by re- of individual transformer attention heads (Clark ducing overfitting, then it is reasonable that adding et al., 2019; Voita et al., 2019). This work finds that information back results in more overfitting. certain attention heads are sensitive to things like word frequency, positional information, and certain 4 Related Work syntactic phenomena. While extremely informative about the computational structures implemented by Context in count-based and discriminative fixed models, these approaches do not necessarily LMs The earliest learned LMs were count-based reveal anything about usable information: indeed, (e.g., Kneser and Ney, 1995): they estimated patterns of attention do not necessarily correlate p(xn | x0:n ) based on a (smoothed) empirical n- with model predictions (Jain and Wallace, 2019). gram frequency #(x0:n )/#(x0:n−1 ) (where #(x) is the number of times the sequence x appears in training data). As the number of distinct n-gram Other related work Our finding that fine- counts grows exponentially in n, it was typically grained ordering information contributes little us- set to a small value. Count-based models have a able information is consistent with Rae et al. clear dependence on context: any token within the (2019)’s finding that long-range contexts could last n words that also appears in a training n-gram be informatively summarized in fixed-sized vec- is relevant, anything further back is not. tors; our finding that most usable information is Subsequent models improved on these by allow- carried by nouns is consistent with earlier find- ing the use of skip-grams, caches, and feature- ings about both specialized neural architectures based models (Goodman, 2001; Bengio et al., (Henaff et al., 2016) and discourse representa- 2003). Some of these in principle allowed the use tions in feature-based models (Barzilay and La- of unlimited-length contexts, but only by imposing pata, 2008). Our approach also shares similar mo- strong restrictions on the ways in which context tivations to information-theoretic work on prob- features could interact. ing (Voita and Titov, 2020; Pimentel et al., 2020), Context in RNN LMs Recurrent neural network which uses related tools to interpret linguistic struc- language models (Mikolov et al., 2010; Elman, ture in LM representations rather than characteriz- 1990) provide a more expressive mechanism for ing their effect on LM predictions. Several recent the use of long-range context: models write to a papers have explored the effect of training-time and recurrent “state vector” which can be carried ar- test-time ablations in models for other data analysis bitrarily far into the future. Computational issues tasks: Pham et al. (2020) find that shuffling exper- limit the effective context size such models can iments have a limited effect on the accuracy of be practically trained on, but this size is still sig- models for natural language inference, while Perez nificantly greater the models mentioned above: as et al. (2021) describe several experiments aimed at previously noted, Khandelwal et al. (2018) revealed introducing usable information for several question influence from up to 200 tokens of context. Similar answering and sentence understanding tasks.
5 Discussion Lincoln Laboratory Supercomputing Center for providing HPC resources that contributed to the We have investigated the extent to which trans- results reported within this paper. former models can use structural and lexical infor- mation in long-range contexts for English language Impact Statement modeling. Experiments demonstrated that this in- formation is primarily contained in content words Across initial exploration, evaluation conditions and local ordering statistics: ablations that remove and training runs, experiments in this paper re- other kinds of information from context have little quired roughly 100 training runs on the WikiText- effect on models’ predictive accuracies. In contrast, 103 dataset. As discussed in Section 2, model size retaining only information about document identity and batched evaluation were both used to mini- or named entities causes significant drops in pre- mize the energy demands of these experiments; ex- dictive accuracy: the effectiveness of long contexts periments themselves were performed at the Mas- is not explained by the presence of topic or named sachusetts Green HPC center, a carbon-neutral su- entity information alone. percomputing facility. Ultimately, results in Sec- Crucial to obtaining these results was a mea- tion 3 provide guidance toward the design of mod- sure of ablated usable information grounded in the els that use context more efficiently and motivate accuracy of models trained and tested on ablated the large-scale empirical study conducted here. contexts. Past work on context in LMs has pri- marily measured the influence of evaluation-time ablations. Sometimes these two notions of context- References sensitivity coincide (e.g., trigram shuffling) and Regina Barzilay and Mirella Lapata. 2008. Modeling sometimes they do not (e.g., removal of lexical in- local coherence: An entity-based approach. Compu- formation). Our results also offer a jumping-off tational Linguistics, 34(1):1–34. point for future modeling work. They motivate Iz Beltagy, Matthew E. Peters, and Arman Cohan. more efficient, compressed context representations 2020. Longformer: The long-document transformer. that better preserve the information that is usable by arXiv:2004.05150. current models. They motivate more accurate mod- els by developing new context representations that Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- make currently unusable information more promi- guage model. The Journal of Machine Learning Re- nent. search, 3:1137–1155. Several questions remain unanswered by our experiments. Do ablations affect the quality of Peter F Brown, Stephen A Della Pietra, Vincent J Della Pietra, Jennifer C Lai, and Robert L Mercer. text generated by models? (In particular, does 1992. An estimate of an upper bound for the entropy the usable information added by long contexts im- of english. Computational Linguistics, 18(1):31–40. prove predictability of syntax, semantics, or simply document-level word frequency statistics?) More Ralf D Brown. 2011. The CMU-EBMT machine trans- fundamentally, do observations about usable infor- lation system. Machine translation, 25(2):179. mation reflect limitations of transformers or fun- Massimo Caccia, Lucas Caccia, William Fedus, Hugo damental, (Shannon-)information-theoretic proper- Larochelle, Joelle Pineau, and Laurent Charlin. ties of English? Our results suggest that at least 2020. Language GANs falling short. In Interna- some of these effects are model-specific: delet- tional Conference on Learning Representations. ing function words cannot add information, but Ozan Caglayan, Pranava Madhyastha, and Lucia Spe- improves held-out model accuracy. A complete cia. 2020. Curious case of language generation answer to this question will require more detailed evaluation metrics: A cautionary tale. In Proceed- exploration, including a better understanding of ings of the 28th International Conference on Com- putational Linguistics, pages 2322–2328, Barcelona, human predictions in comparable settings. Spain (Online). International Committee on Compu- tational Linguistics. Acknowledgments Rewon Child, Scott Gray, Alec Radford, and Thanks to Carina Kauf and Greta Tuckute, Evelina Ilya Sutskever. 2019. Generating long se- Fedorenko and Roger Levy for valuable discus- quences with sparse transformers. URL sions. We acknowledge the MIT SuperCloud and https://openai.com/blog/sparse-transformers.
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Un- Christopher D. Manning. 2019. What does BERT derstanding neural networks through representation look at? an analysis of BERT’s attention. In Pro- erasure. arXiv preprint arXiv:1612.08220. ceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for Stephen Merity, Caiming Xiong, James Bradbury, and NLP, pages 276–286, Florence, Italy. Association Richard Socher. 2016. Pointer sentinel mixture mod- for Computational Linguistics. els. CoRR, abs/1609.07843. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent Transformer-XL: Attentive language models beyond neural network based language model. In Eleventh a fixed-length context. In Proceedings of the 57th annual conference of the International Speech Com- Annual Meeting of the Association for Computa- munication Association. tional Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics. F. Mollica, Matthew Siegelman, Evgeniia Diachek, S. Piantadosi, Zachary Mineroff, Richard Futrell, Jeffrey L Elman. 1990. Finding structure in time. Cog- Hope H. Kean, Peng Qian, and E. Fedorenko. 2020. nitive science, 14(2):179–211. Composition is the core driver of the language- selective network. Neurobiology of Language, Joshua T Goodman. 2001. A bit of progress in lan- 1:104–134. guage modeling. Computer Speech & Language, 15(4):403–434. Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. Rissanen data analysis: Examining dataset char- Michael Hahn. 2020. Theoretical limitations of self- acteristics via description length. arXiv preprint attention in neural sequence models. Transactions arXiv:2103.03872. of the Association for Computational Linguistics, Thang M Pham, Trung Bui, Long Mai, and Anh 8:156–171. Nguyen. 2020. Out of order: How important is Tatsunori Hashimoto, Hugh Zhang, and Percy Liang. the sequential order of words in a sentence in nat- 2019. Unifying human and statistical evaluation for ural language understanding tasks? arXiv preprint natural language generation. In Proceedings of the arXiv:2012.15180. 2019 Conference of the North American Chapter of Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, the Association for Computational Linguistics: Hu- Ran Zmigrod, Adina Williams, and Ryan Cotterell. man Language Technologies, Volume 1 (Long and 2020. Information-theoretic probing for linguistic Short Papers), pages 1689–1701, Minneapolis, Min- structure. In Proceedings of the 58th Annual Meet- nesota. Association for Computational Linguistics. ing of the Association for Computational Linguistics, pages 4609–4622, Online. Association for Computa- Mikael Henaff, Jason Weston, Arthur Szlam, Antoine tional Linguistics. Bordes, and Yann LeCun. 2016. Tracking the world state with recurrent entity networks. In ICLR. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Matthew Honnibal, Ines Montani, Sofie Van Lan- models are unsupervised multitask learners. deghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Python. and Timothy P. Lillicrap. 2019. Compressive trans- formers for long-range sequence modelling. Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. In NAACL-HLT. Chinnadhurai Sankar, Sandeep Subramanian, Christo- pher Pal, Sarath Chandar, and Yoshua Bengio. 2019. Urvashi Khandelwal, He He, Peng Qi, and Dan Juraf- Do neural dialog systems use the conversation his- sky. 2018. Sharp nearby, fuzzy far away: How neu- tory effectively? an empirical study. arXiv preprint ral language models use context. In Proceedings arXiv:1906.01603. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Claude E Shannon. 1948. A mathematical theory of pages 284–294, Melbourne, Australia. Association communication. The Bell system technical journal, for Computational Linguistics. 27(3):379–423. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 2020. Reformer: The efficient transformer. In Inter- Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz national Conference on Learning Representations. Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, Reinhard Kneser and Hermann Ney. 1995. Improved H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- backing-off for m-gram language modeling. In 1995 nett, editors, Advances in Neural Information Pro- International Conference on Acoustics, Speech, and cessing Systems 30, pages 5998–6008. Curran Asso- Signal Processing, volume 1, pages 181–184. IEEE. ciates, Inc.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- 140 nrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lift- 120 func. words normalized accuracy (eval ablation) ing, the rest can be pruned. In Proceedings of the no information 57th Annual Meeting of the Association for Com- 100 putational Linguistics, pages 5797–5808, Florence, replace w/ old 80 shuffle all rare Italy. Association for Computational Linguistics. common named entities 60 N&VB N shuf. trigrams globally N&VB&ADJ Elena Voita and Ivan Titov. 2020. Information- cont. words shuf. within sent. theoretic probing with minimum description length. 40 In Proceedings of the 2020 Conference on Empirical sent. shuf. shuf. trigrams within sent. Methods in Natural Language Processing (EMNLP), 20 shuf. within trigrams pages 183–196, Online. Association for Computa- tional Linguistics. 0 full information Alex Wang, Yada Pruksachatkun, Nikita Nangia, 20 Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Super- 40 GLUE: A stickier benchmark for general-purpose 40 20 0 20 40 60 80 100 120 140 language understanding systems. arXiv preprint normalized accuracy (train+eval ablation) arXiv:1905.00537. (a) Mid-range condition (first 256 tokens after ablation) Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self- 140 func. words attention with linear complexity. arXiv preprint 120 normalized accuracy (eval ablation) arXiv:2006.04768. 100 no information Thomas Wolf, Lysandre Debut, Victor Sanh, Julien common rare shuffle all Chaumond, Clement Delangue, Anthony Moi, Pier- 80 shuf. within sent. ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- N&VB N named entities icz, Joe Davison, Sam Shleifer, Patrick von Platen, 60 N&VB&ADJ shuf. trigrams globally cont. words replace w/ old Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, shuf. within trigrams Teven Le Scao, Sylvain Gugger, Mariama Drame, 40 shuf. trigrams within sent. Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language pro- 20 sent. shuf. cessing. In Proceedings of the 2020 Conference on 0 full information Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- 20 ciation for Computational Linguistics. 40 Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stew- 40 20 0 20 40 60 80 100 120 140 art, and Stefano Ermon. 2020. A theory of usable in- normalized accuracy (train+eval ablation) formation under computational constraints. In Inter- national Conference on Learning Representations. (b) Long-range condition (tokens 256-512 after ablation) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Ben- Figure 5: Comparison of model performance in the jamin Recht, and Oriol Vinyals. 2016. Understand- train+eval and eval-only settings. The units represent ing deep learning requires rethinking generalization. the percentage of the gap between the full information arXiv preprint arXiv:1611.03530. and no information models/contexts. That way, if a point falls on the dotted y = x line, then that ablation A Comparison of Experimental has the same relative effect in each paradigm. If a point Paradigms falls above the dotted line, then that ablation leads to In Figure 5 we show the contrast between the ex- better relative performance in the train+eval paradigm, and if a point falls below the dotted line, then that abla- perimental paradigm of Sections 3.1–3.2 and that tion leads to better relative performance in the eval-only of Section 3.3. Especially for the experiments in- paradigm. volving parts of speech, we see a significant differ- ence in both the quantitative and qualitative results across the two paradigms. B Longer Context Window of size 1024 tokens instead of 512 tokens in or- Here we report the results of repeating the experi- der to verify that the behavior we observed is not ments of Sections 3.1 and 3.2 with ablated contexts specific to the size of context window we chose.
full information 4.16 (0%) full information 4.16 (0%) N&VB&ADJ +0.04 (13%) sent. shuf. +0.06 (21%) N&VB +0.05 (17%) shuf. within sent. +0.08 (29%) N +0.07 (23%) shuffle all +0.14 (49%) named entities +0.12 (41%) replace w/ old +0.21 (73%) func. words +0.22 (76%) no information 4.45 (100%) no information 4.45 (100%) 4.20 4.25 4.30 4.35 4.40 4.45 4.20 4.25 4.30 4.35 4.40 4.45 bits bits (a) Mid-range condition (first 256 tokens after ablation) (a) Mid-range condition (first 256 tokens after ablation) full information 4.16 (0%) N&VB&ADJ +-0.00 (-6%) sent. shuf. full information 4.16 (0%) +0.01 (24%) N&VB +0.00 (2%) shuf. within sent. +0.04 (65%) N +0.01 (15%) shuffle all +0.06 (88%) named entities +0.03 (48%) no information 4.22 (100%) no information 4.22 (100%) replace w/ old +0.06 (100%) func. words +0.07 (114%) 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.16 4.18 4.20 4.22 bits bits (b) Long-range condition (tokens 256-512 after ablation) (b) Long-range condition (tokens 256-512 after ablation) Figure 6: Effect of word order on usable information. Figure 7: Effect of word identity on usable informa- Bar labels show “change in ablated likelihood (ablated tion. Labels are as in Fig. 6. Ablated contexts contain information)”. The x axis shows ablated likelihood. Er- 1024 tokens, but results are consistent with results on ror bars represent 95% confidence intervals. Ablated 512-token contexts. contexts contain 1024 tokens, but results are consistent with results on 512-token contexts. of London). Attending school became difficult for Radcliffe after the release of C Sample Generations the first Harry Potter film, with some fel- The purpose of this section is to verify that models low pupils becoming hostile, though he trained on ablated contexts can still generate text says it was people just trying to ”have a that is comparable to text generated by a model crack at the kid that plays Harry Potter” trained with full contextual information. We select rather than jealousy. a prompt from a randomly chosen Wikipedia article As his acting career began to consume in the WikiText-103 validation set; each model his schedule, Radcliffe continued his ed- generates a sentence (after finishing the sentence in ucation through on-set tutors. He admit- progress) given the appropriately ablated version ted he was not very good at school, con- of the prompt. The prompt consists of 768 tokens, sidering it useless and finding the work the last 256 of which remain unchanged for all ”really difficult.” He achieved A grades versions of the prompt, so that the ablations are in in the three AS-level exams that he took the long range relative to the point of generation. in 2006, but decided to take a break from The prompt and generations are as follows: education and did not go to college or prompt: university. Part of his reasoning was that he already knew he wanted to act and at two independent schools for boys: write, and that it would be difficult to Sussex House School, a day school in have a normal college experience. ”The Chelsea’s Cadogan Square, and the City paparazzi, they’d love it,” he told Details of London School, a day school on the magazine in 2007. ”If there were any North Bank of the River Thames in Lon- parties going on, they’d be tipped off as don’s financial district (known as the City to where they were.”
= = Career = = on the movie as a whole. The 2004 re- lease Harry Potter and the Prisoner of = = = Harry Potter = = = Azkaban was the third film in the series. In 2000, producer David Heyman asked Radcliffe ’s performance was panned by Radcliffe to audition for the role of Harry New York Times journalist A.O. Scott, Potter for the film adaptation of Harry who wrote that Watson had to carry him Potter and the Philosopher’s Stone, the with her performance. Next was Harry best-selling book by British author J.K. Potter and the Goblet of Fire in 2005. Rowling. Rowling had been searching The film was the second-highest gross for an unknown British actor to person- ify the character, and the movie’s director full information: Chris Columbus recalled thinking, ”This sing Potter film in the series at the time is what I want. This is Harry Potter”, af- of its release. After appearing in Harry ter he saw a video of the young actor in Potter and the Philosopher’s Stone, Rad- David Copperfield. Eight months later, cliffe tried to avoid trying to improve his and after several auditions, Radcliffe was role. selected to play the part. Rowling also endorsed the selection saying, ”I don’t shuffle all: think Chris Columbus could have found a better Harry.” Radcliffe’s parents origi- sing Potter film in the series (behind nally turned down the offer, as they had Harry Potter and the Chamber of Se- been told that it would involve six films crets), the highest grossing Harry Potter shot in Los Angeles. Warner Bros. in- film in the series. The following year, stead offered Radcliffe a two-movie con- Harry Potter and the Goblet of Fire was tract with shooting in the UK; Radcliffe the top-grossing Potter film, making it was unsure at the time if he would do any the highest grossing Harry Potter film in more than that. the series. The release of Harry Potter and the shuf. within sent.: Philosopher’s Stone (released as Harry Potter and the Sorcerer’s Stone in the sing Potter film in the series (until 2009’s United States) took place in 2001. Rad- Gold Glove), selling 1.5 million tickets cliffe received a seven figure salary for in its first week. The title of the film’s the lead role, but asserted that the fee sequel was also a major factor in the suc- was ”not that important” to him; his par- cess of Philosopher, while its underper- ents chose to invest the money for him. formance dropped because of series 3’s The film was highly popular and was met Sci-Fi Channel. with positive reviews, and critics took no- tice of Radcliffe: ”Radcliffe is the em- shuf. within trigrams: bodiment of every reader’s imagination. sing Potter film at the time of its re- It is wonderful to see a young hero who lease. The film had a financial success is so scholarly looking and filled with cu- worldwide and received five Academy riosity and who connects with very real Award nominations at the 56th Academy emotions, from solemn intelligence and Awards. the delight of discovery to deep family longing, ”wrote Bob Graham of the San shuf. trigrams within sent.: Francisco Chronicle. sing Potter film in the series at the United A year later Radcliffe starred in Harry States box office. His appearance in Potter and the Chamber of Secrets, the Harry Potter and the Philosopher’s Stone second installment of the series. Review- did not make it to the United States until ers were positive about the lead actors’ 2006. performances but had polarised opinions
shuf. trigrams globally: named entities: sing Potter film at the time and was the sing animated film in the series. third highest grossing film of 2006, the = = Appearances extended Persons (1990 highest-grossing film in the series. In – 2014) = = 2006, Radcliffe acted in his first major role in Harry Potter and the Deathly Hal- rare: lows – Part I, where he played Harry’s sing animated film in the series. Chief Engineer, Harry Potter’s primary antagonist. = = Part two = = sent. shuf.: common: sing Potter film in the series at the time of sing animated film in the series. A review its release. At the end of 2009, new back- in The New York Times found that Hans ers were looking for a leading role in the was not as strong as Watson but as well film, with Radcliffe starting the role of as Mr. Trough and Mr. Trough. Harry Potter and the Philosopher’s Stone. cont. words: sing Harry Potter film of all time after Harry Potter and the Goblet of Fire, the highest-grossing film of the 2007–2008 financial year. The film’s success meant that he and his wife bought a four- bedroom apartment in the 1960s, “ready to keep them warm”, adding “They could never make another effort to make some- thing exciting for people who were usu- ally very aware of the housing shortage but hadn’t been very conscious of that”. N & VB & ADJ: sing animated film in the series after The Lion King. A similar amount of praise came in 2008 when The Magic King was released. N & VB: sing animated film in the series. A sig- nificant amount of critical reviews were given by publications in The Washington Post, San Francisco Chronicle, and The Washington Post. The film was released in theaters in 2008. N: sing animated film in the series. = Appearances = =
You can also read