Relational Memory Augmented Language Models
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Relational Memory Augmented Language Models Qi Liu2∗, Dani Yogatama1 , and Phil Blunsom1,2 1 DeepMind, 2 University of Oxford {qi.liu,phil.blunsom}@cs.ox.ac.uk dyogatama@deepmind.com Abstract that GPT-2 (Radford et al., 2019) states that uni- corns have four horns, directly after speaking that We present a memory-augmented approach unicorns have one horn. to condition an autoregressive language In this work, we explore ways to combine an model on a knowledge graph. We represent autoregressive language model with a knowledge arXiv:2201.09680v1 [cs.CL] 24 Jan 2022 the graph as a collection of relation triples graph. We design a memory-augmented archi- and retrieve relevant relations for a given tecture that stores relations from a knowledge context to improve text generation. Exper- iments on WikiText-103, WMT19, and en- graph and investigate the effect of conditioning on wik8 English datasets demonstrate that our this relational memory in an autoregressive lan- approach produces a better language model guage model. In contrast to existing token-based in terms of perplexity and bits per charac- memory-augmented language models that store ter. We also show that relational memory context-target pairs (Khandelwal et al., 2020b; Yo- improves coherence, is complementary to gatama et al., 2021), our memory stores relation token-based memory, and enables causal in- triples (head entity, relation, tail entity). Relation terventions. Our model provides a simple yet effective way to combine an autoregres- triples form the basis of knowledge bases, empow- sive language model and a knowledge graph ering a wide range of applications such as ques- for more coherent and logical generation. tion answering (Yasunaga et al., 2021), machine reading (Yang and Mitchell, 2019), and reasoning (Minervini et al., 2020). From a cognitive science 1 Introduction perspective, we can consider the neural language model to be an instance of System 1 which per- A core function of language is to communicate forms fast inference and the symbolic relational propositions (e.g., who did what to whom). As memory as a world model to support slow and log- such, language models need to be able to gener- ical reasoning of System 2 (Kahneman, 2011).1 ate this information reliably and coherently. Exist- We hypothesise that relational memory can im- ing language models (Devlin et al., 2019; Radford prove performance and coherence of an autore- et al., 2019; Brown et al., 2020) do not have ex- gressive language model. plicit representations for such information and rely Given an observed context, we first run an en- on it being implicitly encoded in their parameters tity tagger to identify entities in the context. We (Liu et al., 2019; Petroni et al., 2019; Wang et al., then use tf-idf (Ramos et al., 2003) to select salient 2020). This encoding mechanism makes it diffi- entities. We retrieve relations (from a knowledge cult to interpret what the language models know base) for the selected entities and design a gating and often leads to generating illogical and contra- function that allows the language model to adap- dictory contents. For example, Logan et al. (2019) tively combine information from extracted rela- observe that existing language models rely heavily tions and observed textual context to predict the on word correlation and fall short of logical rea- next token. Existing knowledge bases such as soning. This causes the model to hallucinate—e.g. Freebase and Wikidata can be used as a source of that Barack Obama’s wife is Hillary Clinton based information to retrieve relations from. However, on the high co-occurrence of the two entities. In 1 another example, Lake and Murphy (2020) notice This view is also advocated in a parallel work by Nye et al. (2021) which presents a model for story generation and ∗ Work completed during an internship at DeepMind. instruction following.
they are often incomplete and do not contain rela- probability as a product of conditional probabil- tions that are suitable for the particular dataset that ities with the chain rule (Jelinek, 1980; Bengio we want to work with. Instead of using these pre- et al., 2003): defined knowledge bases, we choose to perform T open information extraction (OpenIE) on each lan- Y guage modelling dataset to get relations. As a p(x1 , ..., xT ) = p(xt |x0 , . . . , xt−1 ), (1) t=1 result, our model is able to move beyond simple co-occurrence statistics and generate text that is where x0 is a special start token. more grounded on real-world relations observed Our language model is based on transformer- in a particular corpus. XL (§2.1) which is augmented with a relational Our main contributions are as follows: memory (§2.2). We discuss them in detail below. • We evaluate the model on three English lan- 2.1 Transformer-XL guage modelling datasets. We show that our model outperforms a strong transformer-XL We use transformer-XL (Dai et al., 2019)—which baseline (Dai et al., 2019) on both word-level is based on transformer (Vaswani et al., 2017)—to (WikiText-103 and WMT19) and character- parametrise the conditional probabilities in Eq. 1. level (enwik8) language modelling in terms Transformer stacks multiple self-attention layers of perplexity and bits per character respec- to obtain contextualised representations. tively (§3.3). Language modelling datasets usually consist of articles of different lengths. It is impracti- • We conduct comprehensive ablation and de- cal to apply transformer to encode long articles, sign choice studies to understand contribu- as its computational complexity is quadratic in tions of different components of our models the sequence length. In practice, each article is (§4.1). usually truncated into fixed-length text segments • We measure coherence with human evalua- {xt−N +1 , ..., xt } of length N to train and evalu- tion and two automatic metrics (knowledge ate the model. However, this approximation pre- perplexity and knowledge F1 ) and demon- vents transformer from capturing long-term de- strate that relational memory improves coher- pendency beyond text segments. Transformer-XL ence (§4.2). reuses hidden states from previous text segments to extend the context window. • We study the relationship between our More specifically, denote the hidden state method and a typical memory-augmented of xt at layer ` as h`t . Given a text language model which stores word tokens segment {xt−N +1 , . . . , xt } and its extended in its memory (Yogatama et al., 2021). We context {xt−N −M +1 , . . . , xt−N } of length M , show that relational memory is complemen- both the hidden states of the text segment tary to token-based memory and combining {h`t−N +1 , . . . , h`t } and the hidden states of the ex- them improves performance further (§3.3). tended context {h`t−N −M +1 , . . . , h`t−N } are used. • We perform qualitative analysis by examin- When performing self-attention, each token in the ing gate values and retrieved relations. In line text segment can attend to the preceding tokens with our main motivation, we find that the re- in the text segment and all the tokens in the ex- lational memory is particularly useful for pre- tended context, enabling longer-term dependency dicting entities. Further, we demonstrate that compared to a vanilla transformer. Importantly, such explicit propositional representations al- transformer-XL does not backpropagate through low causal interventions and increase inter- the hidden states of the extended context during pretability of language models (§4.3). training (by adding stop gradient operators to all the hidden states in the extended context). 2 Model 2.2 Relational Memory An autoregressive language model defines the In this section, we first introduce how we obtain probability of a sequence of tokens p(x) = relation triples using OpenIE (§2.2.1). We then p(x1 , . . . , xT ). It is common to factorise this joint use tf-idf to score entities in the observed context
next token knowledge graph agg Columbia Obama re gate LSTM encoder r ba ate ch (Obama, alma mater, Columbia University) gate am elo ro alm (Obama, bachelor of arts, 1983) fa rts Columbia ...... 1983 University relation retrieval Language Model (Obama, 13.8) (president, 7.6) (United States, 9.8) entity scoring Obama was the first African-American president of the United States. After graduating from previous text segment current text segment Figure 1: We identify salient entities in the previous text segment and extract relations to build our rela- tional memory. We encode each relation with an LSTM encoder, aggregate the resulting representations into a vector, and use a gate mechanism that allows our language model to adaptively take advantage of relational information for predicting the next token. and retrieve relation triples related to these enti- Algorithm 1 Train/Eval w/ Relational Memory ties (§2.2.2) to construct relational memory. Fi- 1: procedure TRAIN / EVAL SPLIT(S) nally, we show an integrated architecture that al- 2: for each article A in S do lows transformer-XL to incorporate the relational 3: Initialise M to empty memory for predicting the next token (§2.2.3). We 4: for each text segment xc in A do show our architecture in Figure 1. The pseudocode 5: if S is train set then of training or evaluating with the relational mem- 6: TRAIN(xc , M) ory is demonstrated in Algorithm 1. In the pseu- 7: else docode, we use TRAIN(xc , M) and EVAL(xc , 8: EVAL(xc , M) M to refer to training with the cross entropy 9: Run dynamic OpenIE on xc loss and evaluating (e.g. calculating perplexity) on 10: end if the text segment xc conditioned on the relational 11: Perform relation retrieval with xc memory M, respectively. 12: Update M with retrieved triples 13: end for 2.2.1 Open Information Extraction 14: end for A key challenge of utilising relational informa- 15: end procedure tion for language modelling is obtaining high- quality relation triples. There are several well- established knowledge bases, such as Freebase D. Given an entity e, we retrieve a set of rela- (Bollacker et al., 2007) and YAGO (Rebele et al., tion triples Re = {r1 , ..., rO }, where e is either 2016). However, existing knowledge bases suf- the head entity or the tail entity in these relation fer from missing relations and often do not con- triples. Conceptually, Re consists of all the rela- tain relation triples related to observed contexts in tion triples from the one-hop subgraph centred at a target corpus, even though research on knowl- the entity e in the knowledge graph constructed edge base completion has resulted in significant from D. Therefore, Re can provide “global” in- advances (Bordes et al., 2013; Trouillon et al., formation about the entity. 2016; Zhang et al., 2019) . In this work, we use OpenIE (Angeli et al., Dynamic OpenIE. Dynamic OpenIE takes ad- 2015; Etzioni et al., 2008) to obtain relation vantage of the autoregressive nature of language triples. Since OpenIE directly extracts relation modelling, where text segments are sequentially triples from each dataset D, it provides a struc- processed. In addition to extracting relations from tured way to represent knowledge in D.2 Specif- the training set of D, we can also extract rela- ically, we perform OpenIE on the training set of tions from previously seen text segments of our 2 We provide a comparison of using relations extracted evaluation set. We refer to this extraction mech- from OpenIE and Freebase in §4.1. anism as dynamic OpenIE. After a text segment
{xt−N +1 , ..., xt } has been evaluated, e.g. after observed context, we perform entity recognition calculating perplexity on this text segment, we (Ratinov and Roth, 2009; Nadeau and Sekine, perform OpenIE on it to obtain new relation triples 2007) on this context and score the tagged entities to be added to our knowledge graph. Note that we with tf-idf (Ramos et al., 2003). The top-K scored only perform OpenIE on previously seen text seg- entities (K is set to 5 in our experiments) are used ments and do not use unseen text. We expect that to retrieve relations {Re1 , ..., ReK }. These re- the relation triples extracted from seen text seg- trieved relations are used to construct the relational ments are potentially useful for predicting the next memory M. Note that the entities are selected tokens. This extraction mechanism will not violate from the observed context, so that unseen text is the autoregressive nature of language modelling. not utilized. We limit the capacity of M to P . Metrics such as perplexity and bits per character If the number of newly retrieved triples is larger are calculated as usual. The idea of using seen than P , we randomly drop relations and only se- text segments during evaluation to improve lan- lect P of them to be inserted into M. Otherwise, guage modelling is related to dynamic evaluation the relational memory operates with a first-in-first- (Krause et al., 2018, 2019). In dynamic evalu- out principle. When M is full, older relations re- ation, the model is adapted based on recent his- trieved will be overwritten by newly retrieved re- tory during evaluation via gradient descent so that lations. The relational memory is re-initialized to it can assign higher probabilities to re-occurring empty when an article ends. patterns. In contrast to dynamic evaluation, we As shown in Algorithm 1, since we update M do not update model parameters and only extract only after processing an entire text segment, all new relations from seen text segments to enrich the tokens in the same text segment will be con- our corpus-specific knowledge graph. ditioned on the same relational memory. This ap- proach is more efficient compared to updating M Mismatch between training and evaluation. each time a new entity is encountered and is more As shown in Algorithm 1, since we do not use amenable for batch training. dynamic OpenIE during training due to its addi- tional efficiency overhead (see speed comparison 2.2.3 Integration with Transformer-XL in §4.1), this results in a mismatch between train- We now show how we can integrate relational ing and evaluation. We extract all the relation memory with transformer-XL. We refer to our triples from the training set of each dataset D be- model as R ELATION LM. fore training on D. As a result, during training we may retrieve relation triples extracted from un- Relation triple encoding. We first discuss how seen text of the training set when performing re- we encode relation triples in the relational mem- lation retrieval (§2.2.2). We do not suffer from ory M. We treat relation triples as text and se- this issue during evaluation, as we extract relations rialise each relation triple into a sequence, e.g. from previously seen text of our evaluation set. We (Barack Obama, president of, United States) is believe this mismatch is minor given the superior converted into a sequence “Barack Obama, pres- performance of our model in the experiments. ident of, United States”. This sequential repre- sentation can well capture the order of head en- 2.2.2 Relation Retrieval tities and tail entities and is also adopted by KG- Given a knowledge graph (represented as a collec- BERT (Yao et al., 2019) and Kepler (Wang et al., tion of triples), an ideal relational memory consists 2021b). Since each example in a batch corre- of a set of triples that are relevant to the observed sponds to P retrieved relations, we obtain B · P context. There are many choices to measure the relation sequences for each batch, where B and P relatedness between the observed context and re- denote batch size and relational memory length, lation triples in our knowledge graph— e.g. based respectively. In the order of hundreds of relation on keyword search or dense retrieval (Karpukhin triples, this prevents us from using large models et al., 2020; Guu et al., 2020; Yogatama et al., (e.g. a multi-layer transformer) to encode these se- 2021). quences due to memory constraints. In our prelim- In this work, we use keyword search due to inary experiments, we compare LSTM (Hochre- its simplicity and leave methods based on dense iter and Schmidhuber, 1997), GRU (Cho et al., retrieval to future work. Specifically, given the 2014) and a one-layer transformer and find that
Dataset # Train # Valid # Test # Articles # Vocab # Entities # Relations # Relations/Entity WikiText 103M 0.2M 0.2M 28,595 267,735 980K 8.9M 9.03 WMT19 151M 0.3M 0.3M 169,180 50,259 976K 7.8M 7.97 enwik8 94M 5M 5M 12,350 256 361K 2.4M 6.66 Table 1: Statistics of datasets used in our experiments. For each subset, we show the number of (sub)words for WikiText-103 and WMT19 or the number of characters for enwik8. LSTM performs marginally better. Therefore, for 3 Experiments each relation triple rp , we reuse the transformer- XL word embedding matrix We to map each to- Our experiments seek to evaluate the effect of aug- ken in the sequence to its embedding vector. We menting language models with a relational mem- then run LSTM to encode the sequence and use ory. We introduce datasets used for evaluation the hidden representation of the last token as the (§3.1), discuss implementation details (§3.2), and relation representation rp . present our main results (§3.3). We show ablation There are other approaches to encode rela- studies and further analysis of our model in (§4). tion triples, e.g. embedding-based (Bordes et al., 2013; Trouillon et al., 2016) and graph-based (Schlichtkrull et al., 2018; Zhang and Chen, 2018) 3.1 Datasets and OpenIE methods. We leave a comparison of these ap- We use three English language modelling datasets: proaches to future work. WikiText-103 (Merity et al., 2017), WMT19 Integration. Given a text segment xc = (Barrault et al., 2019), and enwik8 (Hutter, {xt−N +1 , ..., xt }, after L self-attention layers with 2012). Descriptive statistics of these datasets are transformer-XL, we obtain contextualized repre- shown in Table 1. WikiText-103 and WMT19 sentations {hL L t−N +1 , ..., ht }. At each timestep t, are (sub)word-level datasets, while enwik8 is a we use its hidden representation hL t as the query character-level dataset. vector to attend over the P encoded contents of WikiText-103 is a knowledge-driven dataset M, i.e., {r1 , ..., rP }. We use a standard scaled consisting of featured articles from English dot-product attention (Vaswani et al., 2017) to ag- Wikipedia. WMT19 contains English news from gregate all triples into a single vector: the WMT19 workshop.3 The news are segmented P √ into months. We use the news from January to X exp(hL t · rp / d) mt = r , October for training, and news in November and P √ p p=1 P L exp(ht · rj / d) December for development and test respectively. j=1 Compared to Wikipedia articles, news contains where d denotes the hidden size of our more dynamic and temporal information, expos- transformer-XL. Finally, we combine mt ing new challenges for utilising relational informa- and transformer-XL representation hL t via a gate: tion. We reuse the vocabulary of GPT-2 (Radford et al., 2019) with 50,259 tokens to tokenise this gt = σ(Wg [hL t , mt ]) dataset. enwik8 contains more than 100M bytes zt = gt hL t + (1 − gt ) mt of Wikipedia text. Character-level language mod- p(xt+1 | x≤t ) = softmax(We zt ), elling has a much smaller vocabulary size than (sub)word-level language modelling. where σ is the sigmoid function, [, ] denotes concatenation of two vectors, is element-wise We perform OpenIE on each dataset. For en- multiplication, and We is the embedding matrix wik8, OpenIE is performed after detokenizing its shared by both input and output embeddings (Inan text into words. Statistics of extracted relations et al., 2016). The only new parameters introduced are also included in Table 1. Each entity from by our method are an LSTM relation encoder and WikiText-103, WMT19 and enwik8 has 9.03, 7.97 the gate matrix Wg . This gating mechanism al- and 6.66 relation triples on average. lows our model to adaptively take advantage of re- 3 lational information for predicting the next token. http://www.statmt.org/wmt19/
3.2 Implementation Details Model # Params Dev Test All models are implemented with JAX4 (Brad- Transformer-XL 122M 19.0 19.9 WikiText bury et al., 2018) and Haiku5 (Hennigan et al., R ELATION LM 124M 18.5 19.2 2020). We set the hidden size to 512 and the num- S PALM 122M 18.1 19.0 ber of layers to 16 for all models. In (sub)word- ,→ + R ELATION LM 124M 17.7 18.6 level language modelling, we use adaptive soft- Transformer-XL 114M 21.7 21.5 WMT19 max (Grave et al., 2017) for efficiency. We use R ELATION LM 116M 21.0 20.7 GELU (Hendrycks and Gimpel, 2016) as our acti- S PALM 114M 20.4 20.3 ,→ + R ELATION LM 116M 19.8 19.6 vation function and Adam (Kingma and Ba, 2015) as the optimizer. For training, we use batch size Transformer-XL 93M 1.05 1.03 enwik8 128 and train the models on 64 16GB TPUs. We R ELATION LM 95M 1.04 1.02 apply 4,000 warmup steps, before utilizing cosine S PALM 93M 1.04 1.02 ,→ + R ELATION LM 95M 1.03 1.01 annealing to decay the learning rate. Dropout (Sri- vastava et al., 2014) is applied during training with Table 2: We use perplexity (↓) on WikiText-103 a rate of 0.25. and WMT19 and bits per character (↓) on enwik8 We set the lengths of text segment N , extended for evaluation. context M , and the relational memory P to (512, 512, 300), (384, 384, 800) and (768, 1536, 400) for WikiText-103, WMT19 and enwik8, respec- tively. These are determined by grid searches on stored contexts and the observed context during development sets. training/evaluation. The next tokens of similar contexts are retrieved and are integrated with the 3.3 Main Results observed context via a gating mechanism for gen- eration. We compare with a strong transformer-XL base- line trained under the same setting as our model. We investigate whether R ELATION LM is com- Our main results are shown in Table 2. We obtain plementary to S PALM. Since S PALM also uses three observations comparing transformer-XL and a gating mechanism for integrating the retrieved R ELATION LM. First, R ELATION LM consistently tokens, we first apply R ELATION LM to combine outperforms transformer-XL on all three datasets, transformer-XL output hL t with relational infor- demonstrating the effectiveness of relational mem- mation to obtain zt (as shown in §2.2.3), before ory. Note that a decrease of 0.01 is consider- using S PALM to integrate zt with retrieved tokens. able on enwik8 with the bits per character met- The results are shown in Table 2. S PALM out- ric. Second, relational memory not only improves performs transformer-XL and even performs com- language modelling on knowledge-driven articles parably or better compared to R ELATION LM on (WikiText-103), but also generalises to the chal- three datasets, demonstrating the effectiveness of lenging news domain (WMT19) where informa- retrieving related tokens. However, integrating tion is more dynamic and temporal. Last, the R ELATION LM and S PALM can further improve results indicate that relational memory improves the performance, indicating that these two models both (sub)word-level and character-level language are not mutually exclusive. Therefore, retrieving modelling. relation triples brings complementary benefits to retrieving tokens. Complementarity to S PALM. S PALM (Yo- gatama et al., 2021) is a state-of-the-art memory- augmented language model. Instead of retriev- 4 Analysis ing relation triples, it retrieves a set of related to- kens at each timestep. Specifically, it first stores In this section, we study several design choices (context, the next token) pairs from training data. of relational memory, including its knowledge It then uses a pre-trained transformer language source, input component, capacity, dynamic Ope- model to measure the similarities between the nIE, entity scoring method used, and speed com- 4 https://github.com/google/jax parison. We then show quantitative and qualitative 5 https://github.com/deepmind/dm-haiku analysis results to better understand our model.
4.1 Ablations and Design Choice Studies triples performs the best, demonstrating the effec- For this ablation studies, we use the development tiveness of this triple representation of knowledge. set of WikiText-103. Model Dev Source of relation triples . We compare rela- Transformer-XL 19.0 tion triples extracted from Freebase or using Ope- Triple - Relation - Tail 19.0 nIE. In the Freebase case, we use the Freebase Triple - Relation 18.7 API6 to obtain relation triples for each entity. For Triple 18.5 WikiText-103, there are 10.74 relations per en- tity on average, which is comparable to OpenIE Table 4: Ablating relation and/or tail entity from a relations (9.03 relations/entity). The results are relation triple. shown in Table 3. Although Freebase relations have been observed to improve the performance on smaller datasets (e.g. WikiText-2; Logan et al., Length of relational memory. We study how 2019) and particular domains (e.g. movies and ac- many relation triples need to be stored in the re- tors; Ahn et al., 2016), we find that R ELATION LM lational memory. As shown in Figure 2, we can with Freebase relations does not improve over see that the perplexity improves with more rela- transformer-XL on a much larger WikiText-103 tion triples. However, the curve becomes flat with dataset. We observe that a large portion of Free- more than 300 relation triples. base relations is from infoboxes of Wikipedia pages, which only cover information such as occu- 19.0 pation, birth place, and religion. We believe these 18.9 triples are too general to be useful for most con- Perplexity 18.8 texts. The result of R ELATION LM with OpenIE 18.7 shows the advantages of extracting relations from each dataset compared to using Freebase relations. 18.6 18.5 Model Dev 0 100 200 300 400 500 # RelationsRelational Memory Length Transformer-XL 19.0 Figure 2: Perplexity on WikiText-103 with differ- R ELATION LM + Freebase 19.0 ent number of relation triples. R ELATION LM + OpenIE 18.5 Length of transformer-XL memory. As in- Table 3: R ELATION LM with OpenIE or Freebase creasing the length of context window can cap- triples. ture longer dependency, we study whether increas- ing the length of extended (transformer-XL) mem- Ablating relation triples. We ablate relation ory removes the performance gap between R ELA - and/or tail entity from a relation triple (head en- TION LM and transformer-XL. As shown in Fig- tity, relation, tail entity) to study the contribu- ure 3, the performance of both R ELATION LM and tion brought by each component. The results transformer-XL improves with larger extended are shown in Table 4. We find that ablating memory. However, R ELATION LM still outper- both relation and tail entity performs compara- forms transformer-XL even with extended mem- bly to transformer-XL. As head entities are ex- ory length 3072. We conclude that relational tracted from the observed context, we believe memory brings complementary benefits to sim- the extended memory of transformer-XL can off- ply expanding extended memory, since it provides set the effect brought by conditioning on head global information about entities on each dataset. entities. Ablating relation performs better than Dynamic OpenIE. All our main results use dy- transformer-XL. This shows the advantage of in- namic OpenIE. We show results without dynamic troducing tail entities. Using complete relation OpenIE in Table 5. We include the results on three 6 https://developers.google.com/ datasets for a comparison. We can see that R E - freebase LATION LM with dynamic OpenIE performs com-
21.0 Transformer-XL Transformer-XL Model Train Eval RLM RelationLM Perplexity 20.5 Transformer-XL 0.51 0.31 20.0 R ELATION LM 0.76 0.65 19.5 Table 7: The unit is second/step. We use batch size 19.0 128 and 1 per step for training and evaluation, re- 18.5 spectively. 0 500 1000 1500 2000 2500 3000 Memory Length Dataset Subset # Entity # Non-Entity Figure 3: Increasing extended memory length. Powered by TCPDF (www.tcpdf.org) Dev 61.6K 155.9K WikiText Test 65.8K 179.7K parably to R ELATION LM without dynamic Ope- Dev 84.9K 262.2K WMT nIE on WikiText-103 and enwik8, while larger im- Test 81.0K 256.6K provements are obtained on WMT19. This indi- Dev 1.7M 3.3M cates that dynamic OpenIE is more helpful for the enwik8 Test 1.7M 3.3M news domain, which is more dynamic and tempo- ral compared to knowledge-driven articles. Table 8: Statistics of entity and non-entity tokens. Model Wiki WMT ew8 4.2 Does Relational Memory Improve Transformer-XL 19.0 21.7 1.05 Coherence? w/o Dynamic OpenIE 18.6 21.4 1.04 For evaluating coherence, we use two automatic w/ Dynamic OpenIE 18.5 21.0 1.04 metrics—knowledge perplexity and knowledge F1 —to investigate whether the models can faith- Table 5: Perplexity with and without dynamic Ope- fully use entities. We further perform a human nIE. evaluation to study whether language models can generate coherent and knowledgeable sequences. We believe the human evaluation is a reliable way Entity scoring We study different entity scor- of evaluating coherence. This claim is advocated ing mechanisms for relation retrieval. We con- in Barzilay and Lapata (2005). We note that ques- sider random selection (where entities extracted tion answering is also often used to evaluate co- from the observed context are randomly selected), herence (Guu et al., 2020; Lin et al., 2021). We frequency-based scoring, and tf-idf scoring. As leave this to future work. shown in Table 6, tf-idf performs the best. Knowledge perplexity. While vanilla perplex- ity considers all words in an evaluation set, knowl- Model Dev edge perplexity only considers entities for calcu- Random 19.1 lating perplexity. We use it to evaluate whether Frequency 18.7 the model can assign higher probabilities for the tf-idf 18.5 correct entities under different contexts. Table 8 shows the numbers of entity words and non- Table 6: Perplexity with different entity scoring entity words in our corpora. We show the re- methods. sults in Table 9. We observe that the gap between R ELATION LM and transformer-XL is larger on knowledge perplexity. R ELATION LM only per- Speed comparison. The wall clock time for forms comparably or slightly better compared to both training and evaluation is shown in Table 7. transformer-XL on non-entity perplexity. This R ELATION LM is 1.5 and 2.1 times slower during shows that relational memory is helpful for pre- training and evaluation, respectively. Evaluation dicting entity words. Note that knowledge per- slows down some more due to dynamic OpenIE as plexity tends to be much higher than perplexity on shown in Algorithm 1. non-entity words, indicating the difficulty of pre-
Metric Model Dev Test The results are shown in Table 10. We notice that R ELATION LM performs better compared to Knowledge PPX Transformer-XL 47.3 52.3 WikiText transformer-XL. We conclude that models with re- R ELATION LM 45.6 50.9 lational memory can generate more coherent and Transformer-XL 77.2 77.0 WMT logical text. R ELATION LM 73.2 73.1 Transformer-XL 2.25 2.21 Human evaluation. We conduct a human eval- enwik8 R ELATION LM 2.22 2.19 uation to study whether language models can gen- Transformer-XL 13.3 13.8 erate coherent and knowledgeable sequences. We Non-entity PPX WikiText take 1,000 contexts from the test set of WikiText- R ELATION LM 13.0 13.4 103. We show the contexts, ground-truth se- Transformer-XL 14.4 14.4 WMT quences, and continuations generated by R ELA - R ELATION LM 14.2 14.3 TION LM and transformer-XL to five annotators. Transformer-XL 1.98 1.95 enwik8 We use greedy decoding for both models. We R ELATION LM 1.98 1.95 shuffle the order of the continuations generated by Table 9: Knowledge perplexity (↓) and non-entity R ELATION LM and transformer-XL so that the an- perplexity (↓). notators are unaware of the sources of sequences. We then pose the following questions to the anno- tators: Metric Model Dev Test Transformer-XL 9.9 9.4 1. Coherent. Given the context and its ground- WikiText truth continuation for reference, which gener- R ELATION LM 11.4 11.2 ated sequence is more logical and coherent? Transformer-XL 11.4 11.0 WMT R ELATION LM 12.6 12.3 2. Knowledgeable. Given the context and its Transformer-XL 16.0 18.9 ground-truth continuation, which generated enwik8 sequence provides more insights and is more R ELATION LM 16.6 19.4 knowledgeable? Table 10: Knowledge F1 (↑). We show the results in Table 11. We find that R ELATION LM outperforms transformer-XL in the dicting entity words. This collection of results in- human evaluation. These results are consistent dicates that relational memory helps the model use with the two automatic metrics, knowledge per- entities coherently and consistently under different plexity and knowledge F1 . This corroborates our contexts. claim that relational memory improves coherence in language modelling. Knowledge F1 . We use knowledge F1 to ex- plore whether our model generates tokens that are Model Coherent Knowledgeable grounded to its contexts. Given a context as input, we sequentially generate 32 words (or 128 char- Transformer-XL 388 416 acters) for word-(character-)level language mod- R ELATION LM 612 584 elling by sampling from the distribution of the next word (character). To reduce variance, we Table 11: We show the number of contexts in generate 100 continuations for each context. We which a continuation from a particular model is then perform entity recognition for both the gen- chosen by human evaluators for each evaluation erated sequences and their corresponding ground- criterion. Recall that the total number of contexts truth sequences and calculate an F1 score based on used for human evaluation is 1,000. Since we these two sets of entities. For example, given the have five annotators, we use majority voting to de- context “...Ayola was nominated and shortlisted cide the favored model for each continuation. We for the ‘Female Performance in TV’ award”, we use the Kappa statistic to measure inter-annotator compare the generated text and the ground truth agreement. The statistic is 0.64, which shows sub- “in the 2006 Screen Nation Awards, for her role stantial agreement among the annotators. as Kyla Tyson in Holby City...” to calculate F1 .
shipwreck occurred on December when the Aberdeen trawler Elinor Viking A278 , skipper Alec Flett , foundered on the Figure 4: Heatmap of gate values. 4.3 Qualitative Analysis tration, which shows a text segment from the article, Joe Biden 2008 presidential campaign7 Gate values. As we use a gating function to inte- and some retrieved relations. We find that the grate transformer-XL with relational information, first two relations, (Joe Biden, senior Senator, we study gate values in this section. The histogram Delaware) and (Joe Biden presidential campaign, of gate values is shown in Figure 5. We notice began, January 7 2007), are extracted from previ- that the histogram concentrates around 0.9. This ous text segments, while (Joe Biden, was nomi- is expected because non-entity words, which ac- nated, vice president) and (Biden, withdrew nom- count for a large portion of text (according to Ta- ination, 1987) are extracted from the other arti- ble 8), benefit less from the relational memory and cles, Joe Biden8 and Joe Biden 1988 presidential mainly rely on the observed context for prediction campaign9 , respectively. We notice that the rela- as shown in §4.2. We further calculate the aver- tion (Joe Biden, was nominated, vice president) age gate values for entity words and non-entity is highly predictive of the sequence, “Biden was words. The average gate value for entity words is selected to be Democratic presidential nominee 0.87, while the average value is 0.92 for non-entity Barack Obama’s vice presidential running mate”. words. This confirms that entity words rely more From the observed context, the model also iden- on relational information for prediction compared tifies a closely related entity, Barack Obama, and to non-entity words. We also plot the heatmap of retrieves the relation (Barack Obama, president of, gate values and a cherry-picked example is shown United States). Therefore, we conclude that the in Figure 4. Note that we randomly select 100 relational memory can give a global picture of re- dimensions from 512 dimensions for readability. lated entities and provide relevant information for We notice that the entities, Aberdeen and Alec language modelling. Flett, use more relational information than other positions (as shown by the horizontal blue lines). Causal intervention. We use causal interven- These results demonstrate that R ELATION LM can tion to study whether changing the contents in adaptively incorporate relational information for the relational memory will affect language model prediction. prediction. Given the relation (Obama, born in, Hawaii) along with other relations about Barack Obama, we let the model complete the sequence, “Obama was born in”. R ELATION LM outputs “Obama was born in and raised in Hawaii.” with greedy decoding. However, after modifying the relation to (Obama, born in, Kenya), we obtain “Obama was born in Kenya and was the first African-American president.”. We further change to (Obama, born in, Paris) and the model outputs 0.0 0.2 0.4 0.6 0.8 1.0 “Obama was born in Paris, France.”. This indi- 7 Figure 5: Histogram of gate values gt . https://en.wikipedia.org/wiki/Joe_ Biden_2008_presidential_campaign 8 https://en.wikipedia.org/wiki/Joe_ Biden Example. We show three cherry-picked exam- 9 https://en.wikipedia.org/wiki/Joe_ ples in Table 12. We take the first for illus- Biden_1988_presidential_campaign
Seven months after conclusion of his campaign, edge graphs is a promising direction to overcome Biden was selected to be Democratic presidential nominee Barack Obama's vice presidential running the problem. Next we review previous knowledge- mate. The pair won in the general election, and enhanced language models. were sworn in on January 20, 2009 ... (Joe Biden, senior Senator, Delaware) Knowledge-enhanced language models. Our (Joe Biden presidential campaign, began, January model is closely related to previous work on 7 2007) grounding autoregressive language models with (Joe Biden, was nominated, vice president) (Biden, withdrew nomination, 1987) knowledge graphs (Ahn et al., 2016; Logan et al., (Barack Obama, president of, United States) 2019; Hayashi et al., 2020; Wang et al., 2021a). However, these models rely on complex and ad- From 7 February 2006 to 9 December 2008, Ayola starred in BBC medical drama Holby City as nurse hoc preprocessing or rules to link text with knowl- Kyla Tyson. She had previously appeared in Holby edge bases, e.g. Freebase and Wikidata. As a re- City 's sister show Casualty … sult, previous work is more aligned with condi- (Holby City, is, BBC medical drama) tional language modelling, e.g. graph-to-text gen- (Rakie Ayola, played the role, Kyla Tyson) eration p(x|G) in Wang et al. (2021a), which con- Independiente became Arjona's fourth number one trasts with unconditional language modeling p(x) album on the Billboard Top Latin Albums where it considered in this work. As the graph G is con- debuted for the week ending 22 October 2011. For thirteen non-consecutive weeks it topped the structed with the unseen text x, predicting x given Latin Pop Albums chart ... G is easier due to this information leakage for Wang et al. (2021a). Also in Hayashi et al. (2020), (Independiente, number one on, Top Latin Albums chart) topic entities are required for language modelling, (Independiente, became number one on, 22 October which may not be available in most datasets, e.g. 2011) the news domain. We do not compare with these previous models due to the different settings. In Table 12: Three examples of text segment and contrast, we adopt OpenIE relations and use a tf- retrieved relations (based on previous text seg- idf search to retrieve relation triples for connect- ments). ing language models and knowledge graphs. In the experiments, we demonstrate the effectiveness cates that R ELATION LM can take advantage of of our approach on three datasets, WikiText-103, relation triples for making prediction. While we WMT19 and enwik8. can also use prompts as intervention for vanilla There are language models incorporating en- language models, it remains challenging about se- tity information, such as entity coreference anno- lecting the appropriate prompts in different appli- tations (Ji et al., 2017; Clark et al., 2018), sur- cations (Liu et al., 2021a). face forms of entities (Kiddon et al., 2016; Yang et al., 2017; Cao et al., 2021), entity types (Parvez 5 Related Work et al., 2018; Wang et al., 2018b) and entity de- scriptions (Bahdanau et al., 2017). Different from Knowledge-enhanced architectures. Injecting these models, we augment language models with symbolic knowledge to machine learning mod- a relational memory consisting of relation triples. els is widely-adopted to improve the perfor- We demonstrate the effectiveness of using rela- mance of natural language understanding (Anner- tion triples by ablating tail entities and relations vaz et al., 2018; Ostendorff et al., 2019), ques- in §4.1. tion answering (Zhang et al., 2018; Huang et al., 2019; Hixon et al., 2015), dialogue systems (Zhou Knowledge-enhanced pretraining. Using et al., 2018; Moon et al., 2019; Guo et al., 2018; knowledge information for pretraining language Liu et al., 2021b) and recommendation systems models (Peters et al., 2019; Sun et al., 2019; (Zhang et al., 2016; Wang et al., 2018a, 2019). Liu et al., 2020; Guu et al., 2020; Wang et al., Different from these models, we focus on using 2021b; Agarwal et al., 2021; Verga et al., 2021) symbolic knowledge for language modelling. Ex- has recently grown in popularity and has achieved isting language models are prone to generating il- substantial improvements on knowledge-driven logical and contradictory contents. We believe tasks such as question answering and named entity that connecting language modelling and knowl- recognition. Instead of using knowledge informa-
tion for improving downstream knowledge-driven References tasks, we focus on using knowledge information Oshin Agarwal, Heming Ge, Siamak Shakeri, and for improving the generation capability of the Rami Al-Rfou. 2021. Knowledge graph based language model itself. synthetic corpus generation for knowledge- enhanced language model pre-training. In Pro- Retrieval-augmented models. Retrieval- ceedings of the 2021 Conference of the North augmented models are now widely adopted in American Chapter of the Association for Com- open-domain question answering (Chen et al., putational Linguistics: Human Language Tech- 2017; Lewis et al., 2020; de Masson d’Autume nologies, pages 3554–3565. et al., 2019; Izacard and Grave, 2021), dialogue (Dinan et al., 2019; Fan et al., 2021; Thulke et al., Sungjin Ahn, Heeyoul Choi, Tanel Pärna- 2021) and machine translation (Bapna and Firat, maa, and Yoshua Bengio. 2016. A neural 2019; Khandelwal et al., 2020a). We focus on knowledge language model. arXiv preprint retrieval augmentation for language modelling arXiv:1608.00318. (Merity et al., 2017; Grave et al., 2016; Khandel- Gabor Angeli, Melvin Jose Johnson Premkumar, wal et al., 2020b; Yogatama et al., 2021). These and Christopher D. Manning. 2015. Leverag- algorithms are specifically tailored for language ing linguistic structure for open domain infor- modelling, where related tokens are retrieved to mation extraction. In Proceedings of the 53rd help predict the next token. In this work, we Annual Meeting of the Association for Compu- move beyond token augmentation and show the tational Linguistics and the 7th International benefits of retrieving relation triples. We also Joint Conference on Natural Language Pro- demonstrate that our model is complementary to cessing of the Asian Federation of Natural Lan- a token augmentation model, S PALM (Yogatama guage Processing, ACL 2015, July 26-31, 2015, et al., 2021) in the experiments. Beijing, China, Volume 1: Long Papers, pages 344–354. The Association for Computer Lin- 6 Conclusion guistics. K. M. Annervaz, Somnath Basu Roy Chowdhury, We presented R ELATION LM, a language model and Ambedkar Dukkipati. 2018. Learning be- that is augmented with relational memory. We yond datasets: Knowledge graph augmented showed how to obtain relevant knowledge graphs neural networks for natural language process- for a given corpus and how to combine them ing. In Proceedings of the 2018 Conference of with a state-of-the-art language model such the North American Chapter of the Association as transformer-XL. We demonstrated that our for Computational Linguistics: Human Lan- model improves performance and coherence on guage Technologies, NAACL-HLT 2018, New WikiText-103, WMT19 and enwik8. We also per- Orleans, Louisiana, USA, June 1-6, 2018, Vol- formed a comprehensive analysis to better under- ume 1 (Long Papers), pages 313–322. Associa- stand how our model works. Our model provides a tion for Computational Linguistics. way to combine an autoregressive language model with general knowledge graphs. Dzmitry Bahdanau, Tom Bosc, Stanislaw Jas- trzebski, Edward Grefenstette, Pascal Vincent, and Yoshua Bengio. 2017. Learning to com- Acknowledgements pute word embeddings on the fly. CoRR, abs/1706.00286. We would like to thank our action editor (Xavier Carreras) and three anonymous reviewers for their Ankur Bapna and Orhan Firat. 2019. Non- insightful comments. We also thank Angeliki parametric adaptation for neural machine trans- Lazaridou, Cyprien de Masson d’Autume, Ling- lation. In Proceedings of the 2019 Conference peng Kong, Laura Rimell, Aida Nematzadeh, and of the North American Chapter of the Associ- the DeepMind language team for their helpful dis- ation for Computational Linguistics: Human cussions. Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume
1 (Long and Short Papers), pages 1921–1931. Dhariwal, Arvind Neelakantan, Pranav Shyam, Association for Computational Linguistics. Girish Sastry, Amanda Askell, et al. 2020. Lan- guage models are few-shot learners. arXiv Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, preprint arXiv:2005.14165. Christian Federmann, Mark Fishel, Yvette Gra- ham, Barry Haddow, Matthias Huck, Philipp Nicola De Cao, Gautier Izacard, Sebastian Riedel, Koehn, Shervin Malmasi, Christof Monz, and Fabio Petroni. 2021. Autoregressive en- Mathias Müller, Santanu Pal, Matt Post, and tity retrieval. In 9th International Conference Marcos Zampieri. 2019. Findings of the 2019 on Learning Representations, ICLR 2021, Vir- conference on machine translation (WMT19). tual Event, Austria, May 3-7, 2021. OpenRe- In Proceedings of the Fourth Conference on view.net. Machine Translation (Volume 2: Shared Task Danqi Chen, Adam Fisch, Jason Weston, and An- Papers, Day 1), pages 1–61, Florence, Italy. As- toine Bordes. 2017. Reading wikipedia to an- sociation for Computational Linguistics. swer open-domain questions. In Proceedings Regina Barzilay and Mirella Lapata. 2005. Mod- of the 55th Annual Meeting of the Association eling local coherence: An entity-based ap- for Computational Linguistics, ACL 2017, Van- proach. In ACL 2005, 43rd Annual Meeting couver, Canada, July 30 - August 4, Volume 1: of the Association for Computational Linguis- Long Papers, pages 1870–1879. Association for tics, Proceedings of the Conference, 25-30 June Computational Linguistics. 2005, University of Michigan, USA, pages 141– KyungHyun Cho, Bart van Merrienboer, Dzmitry 148. The Association for Computer Linguistics. Bahdanau, and Yoshua Bengio. 2014. On Yoshua Bengio, Réjean Ducharme, Pascal Vin- the properties of neural machine transla- cent, and Christian Janvin. 2003. A neural prob- tion: Encoder-decoder approaches. CoRR, abilistic language model. J. Mach. Learn. Res., abs/1409.1259. 3:1137–1155. Elizabeth Clark, Yangfeng Ji, and Noah A. Smith. Kurt D. Bollacker, Robert P. Cook, and Patrick 2018. Neural text generation in stories using Tufts. 2007. Freebase: A shared database of entity representations as context. In Proceed- structured general human knowledge. In Pro- ings of the 2018 Conference of the North Amer- ceedings of the Twenty-Second AAAI Confer- ican Chapter of the Association for Compu- ence on Artificial Intelligence, July 22-26, 2007, tational Linguistics: Human Language Tech- Vancouver, British Columbia, Canada, pages nologies, NAACL-HLT 2018, New Orleans, 1962–1963. AAAI Press. Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2250–2260. Association Antoine Bordes, Nicolas Usunier, Alberto García- for Computational Linguistics. Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. multi-relational data. In Advances in Neural Carbonell, Quoc Viet Le, and Ruslan Salakhut- Information Processing Systems 26: 27th An- dinov. 2019. Transformer-xl: Attentive lan- nual Conference on Neural Information Pro- guage models beyond a fixed-length context. In cessing Systems 2013. Proceedings of a meeting Proceedings of the 57th Conference of the As- held December 5-8, 2013, Lake Tahoe, Nevada, sociation for Computational Linguistics, ACL United States, pages 2787–2795. 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2978–2988. As- James Bradbury, Roy Frostig, Peter Hawkins, sociation for Computational Linguistics. Matthew James Johnson, Chris Leary, Dou- gal Maclaurin, and Skye Wanderman-Milne. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 2018. JAX: composable transformations of Kristina Toutanova. 2019. BERT: pre-training Python+NumPy programs. of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Tom B Brown, Benjamin Mann, Nick Ry- Conference of the North American Chapter of der, Melanie Subbiah, Jared Kaplan, Prafulla the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT The Thirty-Second Innovative Applications of 2019, Minneapolis, MN, USA, June 2-7, 2019, Artificial Intelligence Conference, IAAI 2020, Volume 1 (Long and Short Papers), pages 4171– The Tenth AAAI Symposium on Educational Ad- 4186. Association for Computational Linguis- vances in Artificial Intelligence, EAAI 2020, tics. New York, NY, USA, February 7-12, 2020, pages 7911–7918. AAAI Press. Emily Dinan, Stephen Roller, Kurt Shuster, An- gela Fan, Michael Auli, and Jason Weston. Dan Hendrycks and Kevin Gimpel. 2016. Gaus- 2019. Wizard of wikipedia: Knowledge- sian error linear units (gelus). arXiv preprint powered conversational agents. In 7th Interna- arXiv:1606.08415. tional Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, Tom Hennigan, Trevor Cai, Tamara Norman, and 2019. OpenReview.net. Igor Babuschkin. 2020. Haiku: Sonnet for JAX. Oren Etzioni, Michele Banko, Stephen Soderland, Ben Hixon, Peter Clark, and Hannaneh Hajishirzi. and Daniel S Weld. 2008. Open information ex- 2015. Learning knowledge graphs for question traction from the web. Communications of the answering through conversational dialog. In ACM, 51(12):68–74. NAACL HLT 2015, The 2015 Conference of the Angela Fan, Claire Gardent, Chloé Braud, and An- North American Chapter of the Association for toine Bordes. 2021. Augmenting transformers Computational Linguistics: Human Language with knn-based composite memory for dialog. Technologies, Denver, Colorado, USA, May 31 Trans. Assoc. Comput. Linguistics, 9:82–99. - June 5, 2015, pages 851–861. The Association for Computational Linguistics. Édouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. 2017. Ef- Sepp Hochreiter and Jürgen Schmidhuber. 1997. ficient softmax approximation for GPUs. In Long short-term memory. Neural computation, Proceedings of the 34th International Confer- 9(8):1735–1780. ence on Machine Learning, volume 70 of Pro- ceedings of Machine Learning Research, pages Xiao Huang, Jingyuan Zhang, Dingcheng Li, and 1302–1310. PMLR. Ping Li. 2019. Knowledge graph embedding based question answering. In Proceedings of Edouard Grave, Armand Joulin, and Nicolas the Twelfth ACM International Conference on Usunier. 2016. Improving neural language Web Search and Data Mining, WSDM 2019, models with a continuous cache. CoRR, Melbourne, VIC, Australia, February 11-15, abs/1612.04426. 2019, pages 105–113. ACM. Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, Marcus Hutter. 2012. The human knowledge and Jian Yin. 2018. Dialog-to-action: Conver- compression contest. URL http://prize. hutter1. sational question answering over a large-scale net, 6. knowledge base. In Advances in Neural Infor- mation Processing Systems 31: Annual Con- Hakan Inan, Khashayar Khosravi, and Richard ference on Neural Information Processing Sys- Socher. 2016. Tying word vectors and word tems 2018, NeurIPS 2018, December 3-8, 2018, classifiers: A loss framework for language mod- Montréal, Canada, pages 2946–2955. eling. CoRR, abs/1611.01462. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. REALM: Gautier Izacard and Edouard Grave. 2021. Lever- retrieval-augmented language model pre- aging passage retrieval with generative models training. CoRR, abs/2002.08909. for open domain question answering. In Pro- ceedings of the 16th Conference of the Euro- Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and pean Chapter of the Association for Computa- Graham Neubig. 2020. Latent relation lan- tional Linguistics: Main Volume, EACL 2021, guage models. In The Thirty-Fourth AAAI Con- Online, April 19 - 23, 2021, pages 874–880. As- ference on Artificial Intelligence, AAAI 2020, sociation for Computational Linguistics.
Frederick Jelinek. 1980. Interpolated estimation Ben Krause, Emmanuel Kahembwe, Iain Murray, of markov source parameters from sparse data. and Steve Renals. 2018. Dynamic evaluation In Proc. Workshop on Pattern Recognition in of neural sequence models. In Proceedings Practice, 1980. of the 35th International Conference on Ma- chine Learning, ICML 2018, Stockholmsmäs- Yangfeng Ji, Chenhao Tan, Sebastian Martschat, san, Stockholm, Sweden, July 10-15, 2018, vol- Yejin Choi, and Noah A. Smith. 2017. Dynamic ume 80 of Proceedings of Machine Learning entity representations in neural language mod- Research, pages 2771–2780. PMLR. els. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Pro- Ben Krause, Emmanuel Kahembwe, Iain Mur- cessing, EMNLP 2017, Copenhagen, Denmark, ray, and Steve Renals. 2019. Dynamic evalu- September 9-11, 2017, pages 1830–1839. Asso- ation of transformer language models. CoRR, ciation for Computational Linguistics. abs/1904.08378. Daniel Kahneman. 2011. Thinking, Fast and Brenden M. Lake and Gregory L. Murphy. 2020. Slow. Farrar, Straus and Giroux. Word meaning in minds and machines. CoRR, abs/2008.01766. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik- Edunov, Danqi Chen, and Wen-tau Yih. 2020. tus, Fabio Petroni, Vladimir Karpukhin, Naman Dense passage retrieval for open-domain ques- Goyal, Heinrich Küttler, Mike Lewis, Wen- tion answering. In Proceedings of the 2020 tau Yih, Tim Rocktäschel, Sebastian Riedel, Conference on Empirical Methods in Natural and Douwe Kiela. 2020. Retrieval-augmented Language Processing, EMNLP 2020, Online, generation for knowledge-intensive NLP tasks. November 16-20, 2020, pages 6769–6781. As- In Advances in Neural Information Processing sociation for Computational Linguistics. Systems 33: Annual Conference on Neural In- formation Processing Systems 2020, NeurIPS Urvashi Khandelwal, Angela Fan, Dan Jurafsky, 2020, December 6-12, 2020, virtual. Luke Zettlemoyer, and Mike Lewis. 2020a. Nearest neighbor machine translation. CoRR, Stephanie Lin, Jacob Hilton, and Owain abs/2010.00710. Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods. CoRR, Urvashi Khandelwal, Omer Levy, Dan Jurafsky, abs/2109.07958. Luke Zettlemoyer, and Mike Lewis. 2020b. Generalization through memorization: Nearest Nelson F. Liu, Matt Gardner, Yonatan Belinkov, neighbor language models. In 8th Interna- Matthew E. Peters, and Noah A. Smith. 2019. tional Conference on Learning Representations, Linguistic knowledge and transferability of ICLR 2020, Addis Ababa, Ethiopia, April 26- contextual representations. In Proceedings of 30, 2020. OpenReview.net. the 2019 Conference of the North American Chapter of the Association for Computational Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. Linguistics: Human Language Technologies, 2016. Globally coherent text generation with Volume 1 (Long and Short Papers), pages 1073– neural checklist models. In Proceedings of 1094, Minneapolis, Minnesota. Association for the 2016 Conference on Empirical Methods in Computational Linguistics. Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao 329–339. The Association for Computational Jiang, Hiroaki Hayashi, and Graham Neubig. Linguistics. 2021a. Pre-train, prompt, and predict: A sys- tematic survey of prompting methods in natural Diederik P. Kingma and Jimmy Ba. 2015. Adam: language processing. CoRR, abs/2107.13586. A method for stochastic optimization. In 3rd In- ternational Conference on Learning Represen- Qi Liu, Lei Yu, Laura Rimell, and Phil Blun- tations, ICLR 2015, San Diego, CA, USA, May som. 2021b. Pretraining the noisy channel 7-9, 2015, Conference Track Proceedings. model for task-oriented dialogue. Transactions
You can also read