Time-Aware Language Models as Temporal Knowledge Bases
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Time-Aware Language Models as Temporal Knowledge Bases Bhuwan Dhingra∗ Jeremy R. Cole∗ Julian Martin Eisenschlos Daniel Gillick Jacob Eisenstein William W. Cohen Google Research {bdhingra,jrcole,eisenjulian,dgillick,jeisenstein,wcohen}@google.com Abstract et al., 2019). While the impact on language model- ing itself has been highlighted in recent work (e.g., Many facts come with an expiration date, from Lazaridou et al., 2021; Röttger and Pierrehumbert, the name of the President to the basketball team Lebron James plays for. But language 2021; Hombaiah et al., 2021), there are several po- arXiv:2106.15110v1 [cs.CL] 29 Jun 2021 models (LMs) are trained on snapshots of data tential problems that are specific to the encoding of collected at a specific moment in time, and factual knowledge: this can limit their utility, especially in the closed-book setting where the pretraining cor- • Averaging: For temporally-scoped knowl- pus must contain the facts the model should edge, the model may see conflicting informa- memorize. We introduce a diagnostic dataset tion, e.g., “Lebron James plays for the Cava- aimed at probing LMs for factual knowledge liers / Lakers.” Because LM training generally that changes over time and highlight problems ignores temporal metadata, this can lead to an with LMs at either end of the spectrum—those averaging effect, in which the model has low trained on specific slices of temporal data, as well as those trained on a wide range of tem- confidence in any of the correct answers. poral data. To mitigate these problems, we • Forgetting: Corpora such as Wikipedia and propose a simple technique for jointly mod- web crawls are constantly growing, with doc- eling text with its timestamp. This improves uments distributed non-uniformly across time: memorization of seen facts from the training there are more recent documents than older time period, as well as calibration on predic- ones, both because old documents can be up- tions about unseen facts from future time pe- dated and because more web documents are riods. We also show that models trained with temporal context can be efficiently “refreshed” generated recently than in the past. As a re- as new data arrives, without the need for re- sult, the model may fail to memorize facts training from scratch. that were valid only during underrepresented periods of time, and therefore do worse when 1 Introduction asked questions about the more distant past. Language models (LMs) have recently been • Poor temporal calibration: As language suggested as repositories of real-world knowl- models become “stale”, they are increasingly edge (Petroni et al., 2019) and there is much in- likely to be queried about facts that are out- terest in using them for tasks such as closed-book side the temporal scope of their training data. question answering (QA; Roberts et al., 2020), fact While it may seem undesirable for a model verification (Lee et al., 2020) and open-ended dia- to guess the answer to such questions, there logue (Adiwardana et al., 2020). Many facts, how- are many cases in which it is perfectly rea- ever, change with time. This raises two questions: sonable to assume that the future will be like Do pretrained LMs learn the appropriate temporal the present: for example, in twenty years the scope for the facts that they encode? And what is capital of Alaska is unlikely to change, even the best way to update temporally-scoped knowl- though the governor of Alaska is nearly im- edge in models that have already been trained? possible to predict. Ideally, the confidence Pretraining corpora are typically derived from a with which the model responds to such queries snapshot of the web crawled at a specific moment should reflect this difficulty. in time (Liu et al., 2019; Yang et al., 2019; Raffel Temporally-scoped facts are common in prac- ∗ Equal Contribution. tice; however, QA datasets such as SQuAD
(Rajpurkar et al., 2018) or Natural Questions Year Input Target (Kwiatkowski et al., 2019) focus on a single time C USTOM N EWS period, even for questions whose answers are tem- The pound faces pressure from the US, porally scoped. Thus, our first contribution in this 2017 French but the _X_ election could hit euro paper is a diagnostic dataset, T EMP LAMA (short _X_ accused Liverpool of ’crossing the Frank 2020 line’ during win over his Chelsea side. Lampard for TEMPoral LAnguage Model Analysis), of fill-in-the-blank queries for probing time-sensitive T EMP LAMA knowledge in LMs. The queries in T EMP LAMA 2012 Cristiano Ronaldo plays for _X_. Real Madrid 2019 Cristiano Ronaldo plays for _X_. Juventus FC are chosen such that the answer varies with time (§ 2.1). Using this dataset, we find empirical evi- Table 1: Examples from C USTOM N EWS, which masks dence of the problems mentioned above (§ 3). named entities and dates from news articles, and As a first step towards addressing these prob- T EMP LAMA, a novel synthethic dataset of temporally- lems, we propose a lightweight modification to scoped factual statements built from Wikidata. pretraining. We parametrize the masked language modeling objective (MLM; Devlin et al., 2019) be costly for large-scale models (Strubell et al., with temporal information, P (y|x, t; θ), where y 2019). On the other hand, finetuning only on the is a masked token or span, x is the textual context, new data leads to catastrophic forgetting of the and t is the time (§ 2.3). The parameters θ must old data (Zhu et al., 2020), since standard LMs learn a representation of both text and time. In have no knowledge of what is “new” and what is the T5 framework (Raffel et al., 2019), this can “old”, unlike a model trained with temporal context. be accomplished by prefixing the input x with a We show that our temporally-scoped pretraining string representation of t, e.g. “year: 2018”. In procedure makes LMs more amenable to post-hoc addition, we pretrain from documents that are uni- finetuning, as the data is implicitly bucketed into formly sampled from the timespan of the training non-overlapping time slices. We observe a similar corpus which, in our case, consists of news articles performance to models retrained from scratch with ranging from 2010-2018 (Lazaridou et al., 2021) 30× fewer steps, and without degradation on the (§ 2.1). These interventions accomplish two goals: knowledge encoded by the older data (§ 3.3). the model is exposed to facts from the entire time range instead of just the most recent one, which Summary of contributions: (1) We offer T EMP - avoids forgetting certain temporally scoped facts. LAMA, a new dataset of temporally-scoped knowl- Additionally, it prevents averaging because the edge probes; (2) We propose a simple modifica- facts are assigned to different time buckets (in our tion to pretraining that facilitates the acquisition of case years). This leads to improved recall of facts temporal knowledge; (3) We conduct evaluations from the timespan of the training corpus (§ 3.1). that demonstrate the impact of temporal shift on These interventions also improve the model’s the knowledge encoded by existing LMs and the temporal calibration. We find that jointly modeling improvements offered by temporally-scoped pre- text and time improves perplexity on future years training; (4) We perform a qualitative analysis of unseen during training. On T EMP LAMA, the joint temporal calibration into the future, again demon- model degrades more gracefully than a model un- strating the positive impact of temporally-scoped aware of time. We also examine the model’s cal- pretraining; (5) We show that temporally-scoped ibration farther into the future using hand-crafted pretraining also facilitates efficient updates to ex- sets of queries whose answer is likely to change isting pretrained LMs. frequently, rarely, or never. We find qualitative evi- dence that the entropy of models trained uniformly 2 Methods across the training timespan increases most rapidly We probe factual knowledge in masked LMs us- for the frequently-changing facts (§ 3.2). ing span prediction—given an input statement x While calibration is desirable, models should be with a span y replaced by a special character, the refreshed with new data when it becomes available. task is to reconstruct that span. Additionally, we A standard practice for doing this is to combine assume that each (x, y) pair has a timestamp t de- the new and old data and retrain the model from noting the time at which it was written or a point scratch (e.g., Liu et al., 2021), but retraining can in time at which its assertion is valid. In this pa-
Uniform Yearly Temporal _X_, the junior year: 2014 text: _X_, year: 2018 text: _X_, _X_, the junior Senator _X_, the junior Senator _X_, the junior Senator Senator from the junior Senator the junior Senator from California, today from California, today from California, today California, today from California, today from California, today proposed … → proposed … → proposed … → proposed … → proposed … → proposed … → Barbara Boxer Barbara Boxer Kamala Harris Kamala Harris Barbara Boxer Kamala Harris T52010-2018 T52014 T52018 T52010-2018 Figure 1: Three training setups to train T5 on C USTOM N EWS: The Uniform model (left) is trained on all the data without explicit time information. The Yearly model (middle) avoids averaging over similar contexts by training separate models depending on the year, while the Temporal model (right) prepends a time prefix to each example. per, we discretize t into yearly buckets and leave (Tjong Kim Sang and De Meulder, 2003) and a more fine-grained groupings (e.g. at the level of regular expression for dates. months or days) for future work. For simplicity and efficiency, all of our models are text-to-text Trans- T EMP LAMA We also construct a more targeted formers (Vaswani et al., 2017) initialized from pub- masked LM evaluation for probing temporally sen- licly available T5 checkpoints (Raffel et al., 2019) sitive knowledge. Starting with the November and then adapted to more time-dependent datasets. 2020 Wikidata snapshot (Vrandečić and Krötzsch, We first describe these datasets, followed by the 2014) we first identify all facts which have either approaches for jointly modeling text and time. a start or an end date after 2010 and whose sub- jects and objects are both entities with Wikipedia 2.1 Datasets pages.1 Among these 482K facts, we identify sub- ject and relation pairs which have multiple objects We experiment with a large-scale news corpus at different times and select nine relations with the (C USTOM N EWS) for pretraining our models, com- most such subjects. For these relations we manu- bined with a smaller diagnostic dataset of factual ally write template cloze queries (e.g. “Subject queries (T EMP LAMA) for evaluation. works for __X__.”) and populate them with the 1000 most frequent subjects per relation. For each C USTOM N EWS The C USTOM N EWS dataset is subject we construct a separate query for each year a subset of web documents that are determined from 2010 to 2020, varying the answer entity ap- to be news (Lazaridou et al., 2021) and have an propriately. For multiple correct answers in the associated date that is of high confidence. We same year, we only retain the most “popular” entity adapt this dataset in two main ways. First, we based on the number of facts about it in Wikidata. focus on a subset created by randomly sampling The query and the corresponding year form the in- 1M news articles from each of the years 2010-2020 puts x and t, while the object entity is the target which had the maximum number of articles. Sec- y. In total we construct 50, 310 queries across 11 ond, while Lazaridou et al. (2021) used this data years.2 Note that these type of cloze-style questions for classic autoregressive language modeling, we naturally follow the salient span masking paradigm, instead adapt it for the MLM objective. Specifi- where the answer to the question is the span to cally, we split the articles into sentences x and then be masked. Table 1 shows examples from both identify salient spans y in the text corresponding to C USTOM N EWS and T EMP LAMA. A full list of named entities and dates. The salient span masking the relations in T EMP LAMA and their template (SSM) paradigm improves question answering per- queries is included in Appendix A. formance in both open-book (Guu et al., 2020) and closed-book settings (Roberts et al., 2020). SSM 2.2 Training and evaluation restricts the inputs to those which have a higher We train and evaluate each of our models on a chance of requiring world knowledge and better mixture of C USTOM N EWS and T EMP LAMA. All aligns with our objective of measuring the factual 1 knowledge captured by the LMs. Following Guu We use SLING (Ringgaard et al., 2017) for preprocessing. 2 Code to reconstruct the data is available at et al. (2020), we identify named entities using a https://github.com/google-research/ BERT-based tagger trained on CoNLL-2003 data language/tree/master/language/templama
models are initialized from a public T5 checkpoint, its timestamp. If the timestamp falls outside 2010- and then further adapted for 300K steps on our 18, we use the closest yearly expert (e.g. 2018 for data. From C USTOM N EWS we hold out 2000 ar- all test inputs ≥ 2018). ticles each for validation and testing from each of the yearly subsets. From T EMP LAMA we reserve Temporal Training a separate expert for each 10% and 70% of the queries from each of the yearly time slice reduces the averaging across conflicting subsets for validation and testing, respectively, en- contexts (§ 1), but keeping an ensemble of large- suring that none of the subject entities overlap be- scale LMs is undesirable in practice. Moreover, tween train, validation, or test sets. Splitting along there are regularities in how often facts change subject entities ensures that none of the facts re- (e.g. the FIFA World Cup happens every 4 years, quired to answer the test queries are seen during whereas NBA Championships happen every year), training on T EMP LAMA (Lewis et al., 2021). In- which a model specialized to a single time slice stead they must be learned in an unsupervised man- might not be able to learn. Hence we also train a ner either from the T5 pretraining or when adapting single T5 model on the entire dataset from 2010- to C USTOM N EWS. We train over the combination 2018 for 300K steps. In this model, the time t of the two training sets such that for every 1000 is concatenated to the input, i.e. P (y|x, t; θ) = inputs from C USTOM N EWS, the model sees 1 in- P (y|t ⊕ x; θ), using a simple string representation put from T EMP LAMA. The purpose of mixing in of t as a prefix for the input x, e.g. “year: 2014”. queries from the latter is to inform the model about Baselines The T5 checkpoints released by Raffel the expected format of the answers (e.g. “Liver- et al. (2019) are pretrained on long inputs with pool F.C.” vs “Liverpool”), as T EMP LAMA and multiple masks and cannot directly be tested using C USTOM N EWS do not follow the same format. our factual knowledge probes. Instead, we establish We also partition the data into two groups based a baseline on the datasets introduced above using on the year: 2010-18 and 2019-20. Models are the pretrained models from Roberts et al. (2020), trained only on the former, but tested on both to which were trained using SSM on Wikipedia for measure their performance for both seen and future an additional 100K steps. This is referred to as T5- time periods. This split was informed by the fact CBQA (closed-book question answering). We also that the T5 checkpoints were pretrained on web experiment with additionally finetuning this model text extracted in April 2019. The main metric for on T EMP LAMA for 5K steps (T5-CBQA-ft). evaluation is a token-level F1 score between the The Temporal model has two interventions: it predicted and ground truth targets, which are al- is trained on a uniform distribution of data from ways entities in both datasets. This is computed in 2010-18, and it has temporal context. To isolate the the same way as is done for the SQuAD benchmark effect of each intervention, we also train a Uniform (Rajpurkar et al., 2018) and gives partial credit to model, which trains on the same data for the same answers which miss tokens or add extra ones. number of steps, but without the time provided as an input. During training, examples are shuffled 2.3 Jointly Modeling Text and Time rather than presented in chronological order. Given a dataset of (x, y, t) triples we model P (y|x, t; θ) using variants of the T5 model where, Hyperparameters We primarily focus on the given x as the input sequence, we maximize the Large-sized T5 models with 770M parameters, but likelihood of the target sequence y. We compare we also investigate the scaling with size by compar- two approaches to condition the predictions on the ing to the Small (110M) and XXL (11B) versions. time t (also see Figure 1). We use the same set of hyperparameters as Raf- fel et al. (2019), with a batch size of 2048, a fixed Yearly In the first approach we use the tempo- learning rate of 0.001 and a dropout rate of 0.1. All ral context by training separate models specialized our models are trained for a fixed number of 300K to different time buckets (in our case years), so steps, except when adapting to new data (§ 3.3), P (y|x, t; θ) = P (y|x; θt ). Hence, we train an en- and then evaluated on the test set. We found the semble of nine T5 models adapted to each year loss on held out C USTOM N EWS was still improv- between 2010-2018 for an additional 300K steps. ing at the end of 300K steps, but the overall trends When provided with a test input, this approach were stable; to limit the experimentation time we routes it to the appropriate yearly expert based on did not explore longer training runs.
3 Experiments Scaling Table 3 shows the effect of increasing model size on the overall F1 scores on C USTOM - We design several experiments to highlight the N EWS and T EMP LAMA. In general, larger model problems around temporally-scoped knowledge in sizes lead to a bigger improvement when training LMs and to test whether they can be addressed by with temporal context. joint models of text and time. CronQuestions To explore whether the im- 3.1 Memorizing Facts Across Time proved memorization of facts translates to down- To understand the interplay of memorization and stream tasks, we finetune the Uniform and Tem- time, we examine the T EMP LAMA and C USTOM - poral models on CronQuestions, a dataset of N EWS performance on the 2010-18 slice. This 410K time-dependent questions based on temporal permits us to analyze the forgetting and averag- knowledge graphs (Saxena et al., 2021). It consists ing effects discussed in § 1 by comparing models of questions where the answer is either an entity trained on different slices of the data and with or or a temporal expression. Similar to T EMP LAMA, without the temporal context. the questions are based on Wikidata across time. Results Table 2 shows performance on the 2010- We focus on a closed-book version of the task, sim- 18 test sets of C USTOM N EWS and T EMP LAMA. ilar to the setup in Roberts et al. (2020), where the T5-CBQA and T5-CBQA-ft fare significantly worse model is trained to predict the first answer in the on T EMP LAMA (16.0) than the more standard list of correct answers for an input question. Dur- Natural Questions benchmark (28.5, c.f. Roberts ing evaluation, it is compared to each answer in the et al. (2020)). In particular, we find that training on set of correct answers, and we take the maximum the news domain leads to significant improvements score among them. Table 4 lists the SQuAD-based on the temporally scoped knowledge required by EM and F1 metrics on the test set. We see an im- T EMP LAMA (comparing T5-CBQA-ft and Uni- provement in memorization for the Uniform and form). The two approaches which condition the Temporal models, with the latter doing slightly bet- predictions on time, Yearly and Temporal, improve ter on the Large and XXL model sizes. over Uniform which trains on the same data but 3.2 Better Calibration in the Future without temporal context. The Yearly ensemble, We examine the model’s performance on future however, has linearly more parameters and requires slices of data at two different time scales. In the linearly more compute to train. For 2010-18, the first, we look at graceful degradation, mimicking Yearly model performs better on C USTOM N EWS, the life-cycle of a model that has been deployed, but the Temporal model is better on T EMP LAMA, and thus has not seen the newest slices of data which is focused only on temporally-scoped facts. yet. In the second, we ask the models to predict We show empirical evidence of averaging and relations in the more distant future. While this forgetting effects in Figure 2, which plots the F1 may seem unreasonable, it is possible to articulate score of the year-specific models as we vary the coherent intuitions about the future: for example, gap between test and train years. The performance the capitals of U.S. states change far less frequently drops quickly on both sides, showing forgetting; than their governors, and the probabilities emitted however, the decline is larger for future years. The by language models should reflect this. right plot compares F1-scores on T EMP LAMA for queries grouped by the number of years for which 3.2.1 Graceful Degradation their answer is valid. This is computed from the du- Here we examine the T EMP LAMA and C USTOM - ration of their corresponding facts in Wikidata. The N EWS performance on the 2019-20 slices. Note uniformly trained model has higher performance that none of the models were pretrained or adapted on queries whose answers persist for a long time, to this slice, so these experiments allow us to mea- but it does worse on queries whose answers persist sure degradation. We additionally look at the per- for less than 5 years. The opposite is true for the plexity of the masked LM, which we compute as: year-specific models, which is intuitive due to the P (x,y,t) log P (y|x, t; θ) averaging effect of training on data from long pe- ppl = exp − P . riods of time. Adding temporal context strikes a y len(y) trade-off between these two extremes, leading to Following Lazaridou et al. (2021), we expect per- the overall higher F1 in Table 2. plexity to increase for slices that are not covered
C USTOM N EWS T EMP LAMA Model #Parameters 2010-18 2019-20 Overall 2010-18 2019-20 Overall T5-CBQA 737M 20.2 19.8 20.1 4.7 3.9 4.6 T5-CBQA-ft 737M 15.2 15.7 15.3 16.0 14.1 15.7 Uniform 737M 30.6 27.8 30.1 25.4 18.2 24.1 Yearly 6.6B 33.4 26.7 32.2 25.5 20.0 24.5 Temporal 737M 32.1 29.5 31.6 26.7 20.3 25.6 Table 2: F1 scores of Large-sized model variants for salient span mask prediction on C USTOM N EWS and T EMP - LAMA. T5-CBQA is the pretrained model from Roberts et al. (2020), and T5-CBQA-ft is further finetuned on T EMP LAMA. The Yearly model is an ensemble of 9 models each finetuned on a yearly slice of the training data between 2010 and 2018. We use the 2018 model when testing on 2019-20. The Uniform and Temporal models are trained on the entire data from 2010-18, and the latter has additional temporal context. The F1 scores are macro- averaged across the evaluation years. The Temporal model performs better on T EMP LAMA, which is focused only on temporally-scoped facts, as well as on the unseen years for C USTOM N EWS. CustomNEWS TempLAMA 28 34 35 T5-CBQA-ft Uniform 26 Yearly 32 30 Temporal 30 24 Yearly Yearly 25 F1 score F1 score F1 score 28 Uniform 22 Uniform Temporal Temporal 20 26 20 15 24 18 22 10 16 5 4 3 2 1 0 1 2 3 4 5 5 4 3 2 1 0 1 2 3 4 5 1 2 3 4 5 6 7 8 9 Gap between test and train years. Gap between test and train years. Answer Duration (years) Figure 2: F1 score of models trained on data from a specific year on C USTOM N EWS (Left) and T EMP LAMA (Middle) as the gap between test and train years varies. Negative gaps indicate that the model is tested on data from before the slice on which it was trained. The F1-score is macro-averaged across all possible pairs of train/test years between 2010-18. For comparison we also show the F1 score of Uniform and Temporal models averaged across 2010-18. Shaded area shows the 95% confidence interval around the macro-average. The performance drop on both sides shows the forgetting effect. (Right) F1 scores on T EMP LAMA grouped by the number of years for which the answer to a query persists. Shaded area shows the 95% confidence interval using bootstrap. CustomNews TempLAMA Size Model EM F1 Size Uniform Temporal Uniform Temporal None 3.63 9.51 Small Uniform 4.01 10.27 Small 21.1 21.9 18.8 18.5 Temporal 4.05 10.20 Large 30.1 31.6 24.1 25.6 XXL 32.3 33.8 25.6 27.7 None 4.10 10.78 Large 2018 4.39 10.87 Table 3: Overall F1-score averaged from 2010-20 for Uniform 4.70 11.34 Temporal 5.13 11.93 Uniform and Temporal models for different model sizes. Larger models benefit more from the temporal context. None 5.44 12.19 XXL Uniform 5.71 12.61 Temporal 5.81 12.88 in the training data, but we expect the temporally- Table 4: Test set results for models finetuned on conditioned model to be relatively more robust. the CronQuestions dataset in a closed-book manner. “None” refers to finetuning the T5 baseline; the “2018” Results Comparing the Uniform and Temporal model is adapted to the 2018 slice of C USTOM N EWS. models in Table 2, we can see that training with temporal context improves F1 scores on the 2019- 20 slices. The Yearly ensemble, which uses the the model predictions reveals that, unsurprisingly, latest 2018 model when tested on 2019-20, is sig- none of the models are able to predict the T EMP - nificantly worse on C USTOM N EWS but compara- LAMA facts that change after the training period. ble on T EMP LAMA; potentially because some of Adding temporal context simply allows the Tempo- the answers remain the same. A closer look at ral model to persist the unchanged facts to 2019-20.
Model 2010-18 2019-20 input years by prefixing queries with “In year,...”. T5-CBQA 26.11 29.22 The confidence for all models decreases as we get Uniform 11.68 14.37 into the future, which is reasonable since all rela- Yearly 13.62 23.30 Temporal 11.33 13.58 tions in T EMP LAMA are time-sensitive. Addition- ally, the Temporal model makes a clear distinction Table 5: Masked language modeling perplexity on between queries with single or multiple correct C USTOM N EWS (lower is better). The Temporal model answers—its confidence decreases more rapidly degrades less when evaluated on the future time slice. for the latter. This reflects the intuition that facts which have changed in the past are likely to change On C USTOM N EWS it has higher performance on again in the future. the SSM objective, which includes both dates and 3.2.2 Future Relations entities in articles from an unseen time period. Table 5 shows MLM perplexity on the C USTOM - To further probe the models’ understanding of ex- N EWS test set. The Temporal model has lowest pected versus unexpected changes in the future, we perplexities on both the seen and unseen slices of curate a small diagnostic dataset of queries about evaluation data. Interestingly, the Uniform model future relations. We restrict the queries such that has lower perplexity than the Yearly one, especially the answer is always either one of the 200 largest on the future slices where we use the 2018 expert US cities or one of the 249 countries in the world. for the latter. Averaging over multiple contexts al- This allows us to compute the entropy of the predic- lows the Uniform model to learn a better-calibrated tions over a fixed set. To relate model predictions to distribution of likely answers, whereas the Yearly commonsense intuitions, we construct three sets of model is often confidently incorrect. queries based on how frequently they are expected to change: frequent, rare and never. For example, T5-CBQA-ft Uniform Temporal the location of an awards show might change ev- 0.5 Single Single Single ery year, while the city an athlete plays in changes Multiple Multiple Multiple 0.0 every few years, and the location of a landmark Log-likelihood change almost never changes. Then, given queries like “In 0.5 2022, the Space Needle will be in __X__” and “In 1.0 2022, the NBA All-Star Game will be in __X__.”, a model with a reasonable representation of time 1.5 should have higher entropy for the former rather 2.0 than the latter. Moreover, the entropy should in- crease with time as the queries address the more 2020 2025 2020 2025 2020 2025 distant future, and the rate of increase should be Year greatest for frequently-changing relations. In to- tal, we constructed 86 queries across the three sets, Figure 3: Change in log-likelihood over time of the most recent answer (from 2018) for T EMP LAMA which are included in Appendix B. queries with Single or Multiple answers. The difference is taken from the value for the 2018 answer. The Tempo- Results Figure 4 shows the entropy of different ral model exhibits a more pronounced confidence gap model variants averaged across the three sets of for facts that changed in the past. queries and plotted over time. The baseline T5- CBQA-ft model has a low constant entropy through- Do the models learn how soon an answer is likely out, irrespective of the query type. Combined with to change in the future? We do a qualitative anal- its low accuracy on future slices from Table 2, this ysis by partitioning the T EMP LAMA test queries suggests it remains confidently incorrect and has where each model was correct in the 2018 evalu- poor calibration about which facts are likely to ation into two sets: those with Single or Multiple change. Both the Uniform and Temporal models answers across 2010-20. Then we measure the log- have increasing uncertainty in the future, which likelihood of that correct answer as we change the is ordered correctly according to intuition: high- input year t from 2019 to 2029, and plot the change est for the queries of frequently-changing facts, in log-likelihood relative to 2018 in Figure 3. For and lowest for queries whose answers are expected the T5-CBQA-ft and Uniform models, we vary the not to change. Interestingly, the Temporal model
T5-CBQA-ft Uniform Temporal 4.5 0.85 Never Never Never 4.0 Rare Rare Rare Frequent Frequent Frequent 0.80 3.5 0.75 T5-CBQA-ft 3.0 Accuracy 0.70 2018 2.5 Entropy Uniform 2.0 0.65 Temporal 1.5 0.60 1.0 0.55 0.5 0.0 2020 2022 2024 2026 2028 0 4 8 0 4 8 0 4 8 Year 202 202 202 202 202 202 202 202 202 Year Figure 5: Accuracy over time of ordering pairs of Figure 4: Entropy over time for frequent, rare, and queries from (never, rare) and (rare, frequent) sets never-changing queries. The Temporal model is more based on the entropy assigned by the model. uncertain about frequently changing queries as time passes, and has a flatter entropy for constant facts. by abstaining), but eventually models need to be refreshed as the world changes and new data ar- has a largely constant entropy for rare- and never- rives. In this section, we consider the setting where changing queries until 2022, after which it begins we have an already trained model on the 2010-18 to increase. While this agrees with intuition, ide- slices, as well as new data from the 2019 slice, and ally a model should have low entropy on the never- we attempt to maximize performance on both the changing set further into the future. heldout (2019 − 2020) and training (2010 − 2018) We also compare the accuracy of models in or- datasets of T EMP LAMA and C USTOM N EWS with dering pairs of queries from (never, rare) and (rare, as little additional training steps as possible. These frequent) sets by entropy: in each pair entropy of experiments are similar to the task posed by Lazari- the first query should be lower than the second one. dou et al. (2020), but we compare the impact of This is plotted against the year tested in Figure 5. adapting versus retraining from scratch. We have Again, we observe that the Temporal model has already seen (§ 3.1) that training only on the newest higher accuracy, but only until 2022, after which data (2019) is suboptimal as the model forgets facts both Uniform and Temporal models are similar. about the past. We observed the same to be true Overall, these results suggests that: (1) mod- if finetuning an already trained model on the new els trained uniformly over a wide range of time- data only, which was also reported by Zhu et al. sensitive data show improved calibration about ex- (2020). On the other hand, combining the new and pected changes in the future; and (2) training with old data and retraining from scratch is costly for temporal context further improves this calibration large-scale LMs (Strubell et al., 2019). Here we for the first few years beyond the training period, explore a simple alternative – finetuning already- in our case from 2019 to 2022. We also note the trained models on a mix of old and new data with limitations with this evaluation, however: (1) due temporal context. We use a 50-50 split between to manual curation by the authors there are only old data from 2010-18 and new data from 2019 to 86 queries in these sets, and are likely to be biased update the Uniform and Temporal models. in the facts they probe; and (2) entropy mixes dif- ferent kinds of uncertainty: that which is inherent Results Figure 6 shows how the F1-score in the query (e.g. there are more distinct countries changes on C USTOM N EWS and T EMP LAMA as than cities with NFL teams), as well as that due we update the Uniform and Temporal models on to the lack of confidence in the model. We are in- this 50-50 mixture. For comparison, we also show terested in the latter, but our evaluation does not the performance of Uniform and Temporal models disentangle the two effects. retrained from scratch on a uniform distribution of data from 2010-19, as well as a model trained 3.3 Cheaper Adaptation to New Data only on the 2019 slice of data, each for 300K steps. Improved calibration about the future can help min- Among the retrained models we observe similar imize mistakes after the training time period (e.g. trends as we did in § 3.1. Among the adapted mod-
2010-18 2019-20 2010-18 2019-20 34 28 2019 Uniform (adapted) 32 26 Temporal (adapted) Uniform (retrained) CustomNews F1 Score TempLAMA F1 Score 24 Temporal (retrained) 30 22 28 2019 20 26 Uniform (adapted) Temporal (adapted) Uniform (retrained) 18 24 Temporal (retrained) 0 20000 40000 0 20000 40000 0 20000 40000 0 20000 40000 # Steps # Steps # Steps # Steps Figure 6: C USTOM N EWS (left) and T EMP LAMA (right) F1 score as models are adapted to a mix of 50-50% data from 2010-18 and 2019. Dotted lines indicate models retrained from scratch on either 2019 alone or equal proportions of all data from 2010-19. The Temporal model does not degrade on the 2010-18 slice while adapted. els, the Uniform model improves significantly on LAMA is constructed in a synthetic manner from the 2019 slice, but this comes at the cost of degrad- WikiData. Incomplete or incorrect facts in the KB ing on the 2010-18 slices. The Temporal model, on can result in incorrect queries in T EMP LAMA; for the other hand, adapts quickly to 2019 without de- instance, we assume a missing start date implies grading on the 2010-18 evaluation. Its performance the fact is valid from the beginning of our time pe- with 10K additional steps matches that of the Tem- riod of interest. We partition the T EMP LAMA and poral model trained from scratch for 300K steps. C USTOM N EWS dataset on the same yearly slices Additional training improves performance of the despite the nature of the datasets being quite differ- adapted model further, likely because it is already ent. Moreover, we did not investigate using longer trained for 300K steps on the 2010-18 data. or shorter temporal partitions. Additionally, we did not test the ability to model temporal expressions 4 Discussion & Limitations such as “before” or “during”, and we did not in- vestigate temporal commonsense (e.g., Ben Zhou Our experiments have shown that current models and Roth 2019), temporal ordering (e.g., Ning et al. have practical limitations in their ability to memo- 2020) or events (e.g., Zhou et al. 2020). rize the past and reasonably estimate the future. We note that time is only one example of infor- Our interventions show the importance of both mation that contextualizes our understanding of training on uniformly sampled data and in provid- language. Authorship, intent, publication source, ing the model the date from which it was from. location, and relevant history all condition the way While our results show consistent advantages, they language is used, and if parameterized properly, also represent a narrow understanding of time. could improve model behavior. We hope this ini- We have focused on closed-book question an- tial work will lead to further contextualization and swering, but temporal staleness of language mod- common-sense analysis of language models. els may have impacts in other applications as well. Lastly, our studies have focused on medium- to For example, in open-book question answering, it large-scale LMs, ranging from 1B-11B parame- is still necessary to align the question with rele- ter models. In general, we find that larger model vant text in the retrieved passage, and this could sizes are able to use our interventions effectively. be challenging when the question cannot be prop- Nonetheless, as model sizes expand beyond 100B erly encoded by a stale LM: for example, the query (Brown et al., 2020), it is unclear how our studies “which countries were affected by the 2020 hurri- will translate to models of that scale. cane season?” would not match the passage “Iota caused damages of $564 million in Nicaragua” in 5 Related Work an LM that did not have access to training data mentioning “Iota” as a hurricane. Several studies have focused on degradation of Another limitation of our work is that T EMP - models on test data from a different time period
than their training data (Huang and Paul, 2018, T EMP LAMA dataset instead focuses on relations 2019; Jaidka et al., 2018; Lukes and Søgaard, 2018; where the answers change with time, e.g. sports Florio et al., 2020). Delasalles et al. (2019) intro- players and teams, or public offices, and uses tem- duced an LSTM language model which conditions poral scopes to determine the correct answer. on dynamic author representations computed sepa- T EMP LAMA is similar in spirit to KB-QA rately, and showed that it improves perplexity on benchmarks which focus on temporal reasoning both seen and unseen (future) time periods. Most such as TempQuestions (Jia et al., 2018) and Cron- recently, Röttger and Pierrehumbert (2021) ana- Questions (Saxena et al., 2021). Its format, how- lyzed the interplay between temporal adaptation ever, mimics the masked LM task typically used in during pretraining and finetuning, and concluded pretraining, since it is intended as a zero/few-shot that while both stages benefit from adaptation sep- probe to measure temporally-scoped knowledge in arately, adaptation during pretraining does not help pretrained models. Unlike those datasets, we fur- the downstream task. Here we show that the ben- ther restrict the queries to subject and relation pairs efits of adaptation can be achieved using a single for which multiple objects exist at different points model that conditions on time. We further show in time, and ensure a balanced distribution over the that the benefits of adaptation come, at least in part, entire time period of interest from 2010-2020. from better memorization of time-sensitive facts. 6 Conclusion In production contexts, an important form of tem- poral generalization is the deployment of models Though temporally-scoped facts are common in trained on data up to a certain time T but applied practice, there has been little prior work exploring on data after T : i.e., the present. Lazaridou et al. how these are encoded in pretrained language mod- (2021) show that language models gradually de- els. In this paper we show that T5 does poorly on grade in performance under such a time-stratified such facts, and training on the news domain im- setting, and propose dynamic evaluation (Krause proves it significantly. However, simply training et al., 2018) as a potential mitigation. However, on more data is sub-optimal; conditioning on the LMs are frequently applied to past data as well, e.g. temporal context of the text data improves memo- for extracting representations, and here we show rization of facts further. To this end, we propose a that updating on only the new data degrades perfor- time-aware language model which conditions on mance on old data. Our approach of conditioning time using string prefixes. Other related benefits on the temporal context alleviates this issue. of time-aware LMs include a better calibration of A related line of work has explored editing neu- expected changes in the future, and a cheaper adap- ral predictions after training given a dataset of re- tation to new slices of timestamped data. Finally, vised input and output pairs (Sinitsin et al., 2020; this paper opens up new research directions for Zhu et al., 2020; Cao et al., 2021). Here we in- modeling context in the language modeling space. troduce a different setting where we have access to new unlabeled text after model training, which References must be used implicitly to update the factual pre- dictions of the model. In this case the update pro- Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, cedure also needs to figure out which facts must be Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, updated and which ones remain the same. and Quoc V. Le. 2020. Towards a human-like open- Petroni et al. (2019) introduced the LAMA domain chatbot. CoRR, abs/2001.09977. benchmark for probing the factual knowledge mem- Qiang Ning Ben Zhou, Daniel Khashabi and Dan Roth. orized by LMs, which consists of fill-in-the-blank 2019. “going on a vacation” takes longer than “go- cloze queries about facts, e.g. “Dante was born ing for a walk”: A study of temporal commonsense in __X__”. Follow up studies have introduced understanding. In EMNLP. improved prompts for eliciting such knowledge Tom Brown, Benjamin Mann, Nick Ryder, Melanie (Jiang et al., 2020b) as well as multilingual versions Subbiah, Jared D Kaplan, Prafulla Dhariwal, (Jiang et al., 2020a; Kassner et al., 2021). However, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- all these benchmarks assume a static view of the Voss, Gretchen Krueger, Tom Henighan, Rewon knowledge inside an LM, and consider all answers Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, across time to be correct for a given query. The Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Zhen Jia, Abdalghani Abujabal, Rishiraj Saha Roy, Jan- Jack Clark, Christopher Berner, Sam McCandlish, nik Strötgen, and Gerhard Weikum. 2018. Tem- Alec Radford, Ilya Sutskever, and Dario Amodei. pquestions: A benchmark for temporal question an- 2020. Language models are few-shot learners. In swering. In Companion Proceedings of the The Web Advances in Neural Information Processing Systems, Conference 2018, pages 1057–1062. volume 33, pages 1877–1901. Curran Associates, Inc. Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki, Haibo Ding, and Graham Neubig. 2020a. X- Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. FACTR: Multilingual factual knowledge retrieval Editing factual knowledge in language models. from pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Nat- Edouard Delasalles, Sylvain Lamprier, and Ludovic ural Language Processing (EMNLP), pages 5943– Denoyer. 2019. Learning dynamic author represen- 5959, Online. Association for Computational Lin- tations with temporal language models. In 2019 guistics. IEEE International Conference on Data Mining (ICDM), pages 120–129. Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020b. How can we know what language Jacob Devlin, Ming-Wei Chang, Kenton Lee, and models know? Transactions of the Association for Kristina Toutanova. 2019. BERT: Pre-training of Computational Linguistics, 8:423–438. deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference Nora Kassner, Philipp Dufter, and Hinrich Schütze. of the North American Chapter of the Association 2021. Multilingual LAMA: Investigating knowl- for Computational Linguistics: Human Language edge in multilingual pretrained language models. In Technologies, Volume 1 (Long and Short Papers), Proceedings of the 16th Conference of the European pages 4171–4186, Minneapolis, Minnesota. Associ- Chapter of the Association for Computational Lin- ation for Computational Linguistics. guistics: Main Volume, pages 3250–3258, Online. Association for Computational Linguistics. Komal Florio, Valerio Basile, Marco Polignano, Pier- paolo Basile, and Viviana Patti. 2020. Time of your Ben Krause, Emmanuel Kahembwe, Iain Murray, and hate: The challenge of time in hate speech detection Steve Renals. 2018. Dynamic evaluation of neural on social media. Applied Sciences, 10(12). sequence models. In Proceedings of the 35th In- ternational Conference on Machine Learning, vol- Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pa- ume 80 of Proceedings of Machine Learning Re- supat, and Mingwei Chang. 2020. Retrieval aug- search, pages 2766–2775. PMLR. mented language model pre-training. In Proceed- ings of the 37th International Conference on Ma- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- chine Learning, volume 119 of Proceedings of Ma- field, Michael Collins, Ankur Parikh, Chris Al- chine Learning Research, pages 3929–3938. PMLR. berti, Danielle Epstein, Illia Polosukhin, Jacob De- vlin, Kenton Lee, Kristina Toutanova, Llion Jones, Spurthi Amba Hombaiah, Tao Chen, Mingyang Zhang, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Mike Bendersky, and Marc Najork. 2021. Dynamic Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. language models for continuously evolving content. Natural questions: A benchmark for question an- In Knowledge Discovery and Data Mining (KDD). swering research. Transactions of the Association for Computational Linguistics, 7:452–466. Xiaolei Huang and Michael J. Paul. 2018. Examining temporality in document classification. In Proceed- Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gri- ings of the 56th Annual Meeting of the Association bovskaya, Devang Agrawal, Adam Liska, Tayfun for Computational Linguistics (Volume 2: Short Pa- Terzi, Mai Gimenez, Cyprien de Masson d’Autume, pers), pages 694–699, Melbourne, Australia. Asso- Sebastian Ruder, Dani Yogatama, et al. 2021. Pit- ciation for Computational Linguistics. falls of static language modelling. arXiv preprint arXiv:2102.01951. Xiaolei Huang and Michael J. Paul. 2019. Neural tem- porality adaptation for document classification: Di- Konstantina Lazaridou, Alexander Löser, Maria achronic word embeddings and domain adaptation Mestre, and Felix Naumann. 2020. Discovering bi- models. In Proceedings of the 57th Annual Meet- ased news articles leveraging multiple human anno- ing of the Association for Computational Linguis- tations. In Proceedings of the 12th Language Re- tics, pages 4113–4123, Florence, Italy. Association sources and Evaluation Conference, pages 1268– for Computational Linguistics. 1277, Marseille, France. European Language Re- sources Association. Kokil Jaidka, Niyati Chhaya, and Lyle Ungar. 2018. Di- achronic degradation of language models: Insights Nayeon Lee, Belinda Z. Li, Sinong Wang, Wen-tau from social media. In Proceedings of the 56th An- Yih, Hao Ma, and Madian Khabsa. 2020. Language nual Meeting of the Association for Computational models as fact checkers? In Proceedings of the Linguistics (Volume 2: Short Papers), pages 195– Third Workshop on Fact Extraction and VERifica- 200, Melbourne, Australia. Association for Compu- tion (FEVER), pages 36–41, Online. Association for tational Linguistics. Computational Linguistics.
Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. Paul Röttger and Janet B Pierrehumbert. 2021. Tem- 2021. Question and answer test-train overlap in poral adaptation of bert and performance on down- open-domain question answering datasets. In Pro- stream document classification: Insights from social ceedings of the 16th Conference of the European media. arXiv preprint arXiv:2104.08116. Chapter of the Association for Computational Lin- guistics: Main Volume, pages 1000–1008, Online. Apoorv Saxena, Soumen Chakrabarti, and Partha Association for Computational Linguistics. Talukdar. 2021. Question answering over temporal knowledge graphs. arXiv preprint Jialu Liu, Tianqi Liu, and Cong Yu. 2021. Newsembed: arXiv:2106.01515. Modeling news through pre-trained documentrepre- sentations. arXiv preprint arXiv:2106.00590. Anton Sinitsin, Vsevolod Plokhotnyuk, Dmitry Pyrkin, Sergei Popov, and Artem Babenko. 2020. Editable Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- neural networks. In International Conference on dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Learning Representations. Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining ap- Emma Strubell, Ananya Ganesh, and Andrew McCal- proach. CoRR, abs/1907.11692. lum. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Jan Lukes and Anders Søgaard. 2018. Sentiment anal- Annual Meeting of the Association for Computa- ysis under temporal shift. In Proceedings of the 9th tional Linguistics, pages 3645–3650, Florence, Italy. Workshop on Computational Approaches to Subjec- Association for Computational Linguistics. tivity, Sentiment and Social Media Analysis, pages 65–71, Brussels, Belgium. Association for Compu- Erik F. Tjong Kim Sang and Fien De Meulder. tational Linguistics. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Qiang Ning, Hao Wu, Rujun Han, Nanyun Peng, Matt Proceedings of the Seventh Conference on Natu- Gardner, and Dan Roth. 2020. TORQUE: A reading ral Language Learning at HLT-NAACL 2003, pages comprehension dataset of temporal ordering ques- 142–147. tions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ing (EMNLP), pages 1158–1172, Online. Associa- Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz tion for Computational Linguistics. Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Denny Vrandečić and Markus Krötzsch. 2014. Wiki- Alexander Miller. 2019. Language models as knowl- data: A free collaborative knowledgebase. Commun. edge bases? In Proceedings of the 2019 Confer- ACM, 57(10):78–85. ence on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- ence on Natural Language Processing (EMNLP- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. IJCNLP), pages 2463–2473. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural In- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine formation Processing Systems, volume 32. Curran Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Associates, Inc. Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text trans- Ben Zhou, Kyle Richardson, Qiang Ning, Tushar Khot, former. arXiv preprint arXiv:1910.10683. Ashish Sabharwal, and Dan Roth. 2020. Temporal reasoning on implicit events from distant supervi- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. sion. arXiv preprint arXiv:2010.12753. Know what you don’t know: Unanswerable ques- tions for SQuAD. In Proceedings of the 56th An- Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Sri- nual Meeting of the Association for Computational nadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Linguistics (Volume 2: Short Papers), pages 784– Kumar. 2020. Modifying memories in transformer 789, Melbourne, Australia. Association for Compu- models. tational Linguistics. Michael Ringgaard, Rahul Gupta, and Fernando CN Pereira. 2017. Sling: A framework for frame seman- tic parsing. arXiv preprint arXiv:1710.07032. Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the param- eters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
A T EMP LAMA Templates Table 6 lists the 9 WikiData relations used for con- structing T EMP LAMA. Given a fact, we instantiate the template for the relation in the fact by replacing “” with the name of the subject entity, and “” with “__X__”. The answer to the query is the name of the corresponding object entity. We construct a separate query for each of the years that the fact is valid. B Future Relations Table 7 shows the queries used as part of the Fu- ture Relations experiment in § 3.2. These queries were constructed by searching for lists of events3 , popular athletes4 and issuing targeted queries to the WikiData Query Service.5 3 E.g. https://www. bizbash.com/bizbash-lists/ top-100-events/top-list/21089088/ top-100-events-in-the-united-states-2019 4 E.g. https://www.sportscasting.com/ the-top-10-list-of-most-popular-athletes-in-the-u-s-includes-some-surprising-names/ 5 E.g. https://w.wiki/3TN4
WikiData ID Relation # Queries Template P54 member of sports team 9033 plays for . P39 position held 7343 holds the position of . P108 employer 9049 works for . P102 political party 7324 is a member of the . P286 head coach 4886 is the head coach of . P69 educated at 1672 attended . P488 chairperson 4190 is the chair of . P6 head of government 4125 is the head of the government of . P127 owned by 2688 is owned by . Table 6: Templates used for converting WikiData facts into natural language queries.
Frequent Rare Never Cities The Super Bowl will take place in _X_. Visa Inc.’s headquarters are located in _X_. South by Southwest will take place in _X_. The NCAA Men’s Final Four will take place in _X_. SEGA of America’s headquarters are located in _X_. Lollapalooza will take place in _X_. The first game of the World Series will take place in _X_. Barack Obama lives in _X_. Summerfest will take place in _X_. The US PGA Championship will take place in _X_. Hillary Clinton lives in _X_. Outside Lands will take place in _X_. The golf US Open will take place in _X_. Donald Trump works in _X_. Spoleto Festival USA will take place in _X_. The NBA all-star game will take place in _X_. The Chargers play their home games in _X_. CMA Music Festival will take place in _X_. The NFL Draft will take place in _X_. The Raiders play their home games in _X_. Made in America Festival will take place in _X_. The Netroots Nation conference will take place in _X_. The Rams play their home games in _X_. The US Open Tennis Championships will take place in _X_. The MLB all-star game will take place in _X_. General Electric’s headquarters are located in _X_. The Masters tournament will take place in _X_. The team from _X_ won the NBA championship. Toyota’s US headquarters are located in _X_. The Kentucky Derby will take place in _X_. The team from _X_ won the Stanley Cup. Nestle’s headquarters are located in _X_. The capital of Washington state is _X_. The team from _X_ won the World Series. Tesla’s headquarters are located in _X_. The capital of California state is _X_. The team from _X_ won the Super Bowl. Lebron James plays in _X_. The capital of Texas is _X_. The golf US Women’s Open will take place in _X_. Tom Brady plays in _X_. The capital of Florida is _X_. Wrestlemania will take place in _X_. Kevin Durant plays in _X_. The Space Needle is located in _X_. Stephen Curry plays in _X_. The Statue of Liberty is located in _X_. Sidney Crosby plays in _X_. Golden Gate Bridge is located in _X_. Mike Trout plays in _X_. The White House is located in _X_. The Democratic National Convention will next take place in _X_. The Liberty Bell is located in _X_. The Republican National Convention will next take place in _X_. Countries The Six Nations Championship will be held in _X_. The UN Secretary general is from _X_. The Oxford Literary Festival will take place in _X_. The Association for Computational Linguistics will meet in _X_. The Pope hails from _X_. Wimbledon will take place in _X_. The Neural Information Processing Systems conference will be held in _X_. The FIFA world cup was lest held in _X_. Tomorrowland will take place in _X_. The Palme d’Or winner is from _X_. The Cricket world cup was last held in _X_. Hajj will take place in _X_. The Tour De France winner is from _X_. The UEFA European Football Championship was last held in _X_. The Eiffel Tower is located in _X_. The Wimbledon Men’s Singles winner is from _X_. The Olympics were last held in _X_. The Taj Mahal is located in _X_. The UEFA Champions League final will take place in _X_. The Winter Olympics were last held in _X_. Burj Khalifa is located in _X_. The G20 summit will be held in _X_. The FIFA world cup was last won by _X_. Machu Picchu is located in _X_. The G7 summit will be held in _X_. The Cricket world cup was last won by _X_. Stonehenge is located in _X_. The United Nations Climate Change conference will take place in _X_. _X_ won the most gold medals in the last Olympics. The world’s largest country by land area is _X_. The world’s longest river is in _X_. The world’s tallest mountain is in _X_. Table 7: The Future Relations dataset used to test model calibration over future years. The three columns represent queries whose answers, intuitively, change frequently or every year, rarely or once every few years, and never. The top section includes queries whose answer is a US city, while the bottom section includes queries whose answer is a country.
You can also read