Augmenting Transformers with KNN-Based Composite Memory for Dialog
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Augmenting Transformers with KNN-Based Composite Memory for Dialog Angela Fan Claire Gardent Facebook AI Research CNRS/LORIA Université de Lorraine claire.gardent@loria.fr LORIA angelafan@fb.com Chloé Braud Antoine Bordes CNRS/IRIT Facebook AI Research chloe.braud@irit.fr abordes@fb.com Abstract tectures on each task. In this work, we focus on human–machine dialog and how to efficiently Various machine learning tasks can benefit retrieve external knowledge that is relevant to from access to external information of different the dialog. We consider two scenarios and for modalities, such as text and images. Recent each scenario, retrieve two types of knowledge: work has focused on learning architectures (i) knowledge about similar dialog contexts and with large memories capable of storing this (ii) external knowledge used to ground the knowledge. We propose augmenting genera- conversation into real world information. tive Transformer neural networks with KNN- Knowledge about similar dialog contexts allows based Information Fetching (KIF) modules. for a hybrid retrieval/generative approach to dialog Each KIF module learns a read operation to where the system response is generated based not access fixed external knowledge. We apply only on a representation of the current dialog these modules to generative dialog modeling, context and of the relevant world knowledge, a challenging task where information must be but also based on a response retrieved from a flexibly retrieved and incorporated to maintain similar dialog context. The retrieved knowledge the topic and flow of conversation. We demon- can be viewed as providing information about strate the effectiveness of our approach by structure and dialog sentences, or utterances: identifying relevant knowledge required for which response is likely given a similar context? knowledgeable but engaging dialog from External knowledge is also retrieved to improve Wikipedia, images, and human-written dialog the semantic content of the dialog model. In utterances, and show that leveraging this one scenario, Wizard of Wikipedia (Dinan et al. retrieved information improves model perfor- 2018), general topics are provided to crowdwor- mance, measured by automatic and human kers, who are asked to have in-depth and specific evaluation. conversations about these topics by referencing specific Wikipedia sentences as knowledge. In this 1 Introduction scenario, external knowledge is retrieved from a Machine learning approaches to various tasks, pre-selected set of Wikipedia sentences associated such as game-playing or dialog, are often depen- with the current dialog topic. Retrieval aims to dent on external information. This information select the sentence that is most relevant at each can take multimodal forms, including structured step of the dialog and thereby to ground system knowledge bases, free text, and images, and responses in relevant world knowledge (e.g., by also comes in overwhelmingly large quantities. referring to Star Wars when talking about science A pressing challenge is to create models that fiction). can identify which specific elements of multiple In the other scenario, Engaging ImageChat information sources are relevant in a particular (Shuster et al., 2020), crowdworkers are provided context, and incorporate them into standard archi- with images and asked to have a conversation 82 Transactions of the Association for Computational Linguistics, vol. 9, pp. 82–99, 2021. https://doi.org/10.1162/tacl a 00356 Action Editor: Masaaki Nagata. Submission batch: 6/2020; Revision batch: 9/2020; Published 3/2021. c 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
inspired by or about the image. In this case, the ledge elements are selected, allowing users to retrieved external knowledge is images and their better understand the information the generative associated dialogs. By retrieving images that are model conditions upon when writing the subse- similar to the image being talked about, we aim to quent utterance. On both datasets, we achieve enrich system responses with knowledge about state-of-the-art results compared to generative what is typically mentioned when describing models and find there is no statistically significant similar images (e.g., when talking about an image difference in the interestingness or human pre- with dogs, mentioning their breed). ference of our model output compared to state- Our work on incorporating different types and of-the-art retrieval models. modalities of knowledge is related to methods that strive to add external memory, such as knowledge 2 Related Work bases, to neural networks. Previous work has ex- plored incorporating large external memories into We discuss related work on learning to incorporate neural network layers (Weston et al., 2015; external knowledge into neural networks and Sukhbaatar et al., 2015, 2019; Lample et al., 2019). efficiently access relevant information. We then Many existing approaches focus on using attention describe work in generative dialog that incor- over the memory slots, which is computationally porates knowledge. intensive and becomes less effective as the the size of the memory grows. In this work, we propose 2.1 Incorporating External Knowledge representing multiple sources of external infor- Augmenting neural networks with memory, or mation as fixed encodings and using K Nearest longer-term components that can be accessed Neighbors (KNN) search to fetch relevant infor- with read and write operations, has been mation. KNN search is computationally efficient explored in various proposed architectures. For and scalable, and libraries like faiss (Johnson example, Memory Networks (Weston et al., 2015; et al., 2019) allow KNN to be easily used on GPUs Sukhbaatar et al., 2015, 2019) introduce attention and integrated into neural networks. Further, mechanisms over large external memories. Neural the external memories are pre-encoded, so the cache models (Grave et al., 2017b) simplify these information encoding is only computed once. As to access previous memories with a dot product. the external memories are kept fixed, they do not Previous work has also studied how to read and require any training to learn the memories along write into these memory architectures (Rae et al., with the model. We can thus scale easily to larger 2016; Graves et al., 2014; Joulin and Mikolov, memories by learning only the KNN-based read 2015). In contrast, we focus on how to read large operation to identify relevant information from memories. the memory. Another line of research has focused on Our core contribution proposes an efficient, computational scalability for larger external me- KNN-based Information Fetching (KIF) module mories to allow efficient access of information. that can access relevant external knowledge, com- For example, Chandar et al. (2016) propose a bine knowledge from different sources, and inte- hierarchical memory network rather than a flat grate this information into standard sequence to one and Rae et al. (2016) learn sparse operations sequence architectures. We apply these flexible to read and write. Lample et al. (2019) focus modules to two dialog datasets that challenge gen- on learning memories of up to one million erative models to leverage external information to slots and how to efficiently access the slots write coherent, on-topic responses. Both of our using product keys. Khandelwal et al. (2019) chosen tasks require models to leverage external use nearest neighbor operations to augment information, such as information from Wikipedia language models by performing retrieval at the or images, to engage in the conversation. We token level—in contrast, we focus on multimodal show that relevant information can be identified retrieval of multiple pieces of knowledge based from hundreds of thousands of candidates in a on an entire dialog context. Beyond explicit multimodal, multi-knowledge-source setting to memory representations, it may be possible to improve the performance of generative dialog store information implicitly during training time models. Further, the output of the KIF modules by memorizing common patterns present in text is interpretable as specific human-readable know- (Petroni et al., 2019). We focus on learning 83
to fetch relevant information from multiple model (Dinan et al., 2018; Weston et al., 2018; explicit external multimodal knowledge sources Cai et al., 2019; Zhu et al., 2020). Some of this and integrate them into one network. Further, work has specialized to use both types of models to our work allows the retrieved information to be generate conversations in an ensemble (Song et al., interpreted as each memory slot is an explicit fact 2016) or to specifically improve consistency (Song that can be read as text, rather than a learned vector et al., 2020). We extend these approaches by such as in Lample et al. (2019). augmenting generative models with retrieval-like Work has also focused on computationally operations based on KNN search, allowing dialog efficient softmax operations (Mnih and Hinton, models to flexibly incorporate various sources of 2009; Grave et al., 2017a; Chen et al., 2016). external knowledge at the same time and scale to Many approximate softmax techniques use KNN- large quantities of retrieval candidates. like operations to form clusters, and the overall softmax operation is constrained by the slow 3 KNN-based Information calculation of the exponential. Our usage of KNN Fetching Modules benefits from efficient and scalable libraries such as faiss and nmslib. Broadly, the KIF module assumes an en- coder model M can access inputs X = {x1 , x2 , . . . , xn }. For example, X can be a 2.2 Generative Dialog collection of sentences, and xi represents an We develop a general architecture for incorpo- individual sentence. In a setting without additional rating external information and apply it to the supporting information, the encoder will process case of generative dialog models. Previous work an input xi and produce the encoder output in dialog has leveraged knowledge as necessary M (xi ). If xi is a sequence such as a sentence, information to accomplish the task. For example, then M (xi ) is a representation of the variable airline and restaurant booking tasks often use size of the sequence length by the fixed size API calls to access information about reservation encoder M ’s hidden size. However, in many tasks, times and availability (Bordes et al., 2017). In additional information is present, represented as contrast, our work focuses on how to incorporate E = {e1 , e2 , . . . , em }. We encode each element unstructured knowledge, such as free text found of X and E into a vector representation using the on the Web. Previous work has used architectures encoder. To identify the closest information in E that attend over the available knowledge and that is relevant to xi , our general approach will identify relevant pieces of information, which be to use KNN by comparing the representation scales poorly with large quantities of information of xi with the representation of each element in (Dinan et al., 2018; Qin et al., 2019; Lian the set E . KNN is a fully differentiable operation et al., 2019). We replace the use of attention (Plötz and Roth, 2018), so can be incorporated in a over external information with the output of straightforward way into neural models. The most a KNN module. Other work has investigated relevant information in E will then be available in incorporating information retrieval in language the model. We display a KIF-Augmented model 1 modeling and question answering (Chen et al., in Figure 1 and describe how the KIF module 2017; Fan et al., 2019; Seo et al., 2019; Guu et al., operates. 2020), while we focus on dialog applications and One challenge to overcome is that the flexibly incorporating knowledge from multiple, representation of all elements of the knowledge multimodal sources. source E are pre-computed and kept fixed, On the modeling side, work has explored creating M (E )—we do not backpropagate to both generative (Serban et al. 2016a, 2016b) affect the embeddings of the pre-encoded and retrieval based models (Zhang et al., 2018), knowledge. In the early stages of training, the which identify the best utterance from the model receives large amounts of loss, which would training set to return as the dialog response. This affect the quality of the pre-encoded embeddings often leverages self-attention or cross-attention if we backpropagated to them. Further, encoding mechanisms (Humeau et al., 2019). Further work the fixed external knowledge once and re-using has explored hybrid models, for example, using the it allows for greater scalability. However, this output of a retrieval model as input for a generative lack of backpropagation can introduce a mismatch 84
Figure 1: KIF modules fetch relevant information from multimodal external knowledge. External knowledge sources E1 and E2 are pre-encoded by encoder M (green). In the model, input xi is encoded by encoder M ′ (blue) to produce M ′ (xi ). KIF modules (orange) operate on M ′ (xi ) and identify the nearest neighbors encoded in M (E1 ) and M (E2 ) using KNN. Identified relevant elements from E1 and E2 are re-encoded by M ′ in a gating mechanism with a weighted sum (represented by σ (WS1i ) · WS1i , where WS stands for weighted sum), then concatenated to M ′ (xi ). Full description with notation can be found in Section 3. between the encoding of E and the encodings to fE (M ′ (xi )) in M (E ), based on KNN search produced by a model that is training, as the training with inner product. Then, the relevant elements model has constantly changing representations identified by KNN are re-encoded by M ′ . For because the weights are being learned. We use example, if element ej is retrieved by KIF, it would M to represent the original encoder model used produce M ′ (ej ). We use the optimized faiss to encode E and M ′ to represent the constantly library for KNN search, which can conduct training model that is encoding X . The model billion-scale KNN efficiently on GPUs. must learn a function to align M ′ (xi ) to the The KNN output for an element xi is produced pre-encoded elements of the external memory by using faiss to search for the k nearest M (E ). representations to fE (M ′ (xi )) in M (E ). Note To circumvent this misalignment, we learn that as the encoders M and M ′ produce output a mapping operator fE (M ′ (xi )) that trains to representations of variable length (for example, in map elements of the model’s representation of X , the case where xi is a variable length sequence, or M ′ (X ), into the additional information repre- such as a sentence), we average across the length sentation space M (E ). Concretely, fE (M ′ (xi )) dimension to produce a fixed-size representations is a multilayer perceptron with ReLU nonlineari- r to conduct the KNN search. ties. From the input elements of X , fE (M ′ (xi )) rxi = Avg fE (M ′ (xi )) learns representations of an output close to the (1) corresponding projection of X into E . This can RE = Avg(M (e)) | e ∈ E (2) be interpreted as learning a read operation on a KNNxi = KNearest k, rxi , RE (3) fixed external memory. If there was no change to the encoding of the model compared to the Then, the KIF module output for an element xi pre-computed knowledge, then the ideal map- is the set of all re-encoded representations of the ping operator would be the identity function (as KNN-retrieved knowledge: M ′ would equal M ). However, as the model KIFxi = M ′ (e) | e ∈ KNNi changes significantly during the training process, (4) the nonlinear mapping capability of fE (M ′ (xi )) These elements are weighted by their normal- is essential to be able to identify the correct ized nearest neighbor scores and then summed. knowledge E from the input X . This is subsequently concatenated to M ′ (xi ) to Thus, a model augmented with KIF will form the final encoder output: incorporate external knowledge in the following manner. First, we find the k nearest elements [M ′ (xi ), WeightedSum(KIFi )] (5) 85
This can be easily extended to using multiple 4.1 KIF for Generative Dialog modules simultaneously. For instance, two In dialog, xi represents the text of the conversation sources of external information, E1 and E2 , can i. A conversation consists of multiple back- be combined by identifying the top candidates and-forth utterances (or turns). For example, a of each information source. The weighted sum conversation could consist of 4 turns: xi = of the KIF output on each information source is [xi,1 , xi,2 , xi,3 , xi,4 ] where xi,4 is the direct concatenated with the encoded input M ′ (xi ). The utterance the model should respond to, and the KIF output dimensionality is the same size as the earlier utterances are the conversation context. hidden size of M ′ (xi ), so they can be directly Standard generative dialog models use a concatenated. Transformer neural network as the encoder M Finally, different sources of information may and want to produce an output that is an ap- not be required for every prediction and some propriate response to the conversation. However, information sources can be more important than in many cases, the conversation history alone others. To allow the model to make more fine- does not include all of the information required to grained decisions about what information to produce an appropriate response. For example, if use from what source, and how much of it, a model needs to chat about a specific movie, we add a gating mechanism using a sigmoid it can be helpful to provide the model with function around each weighted sum of KNN more information about that movie so a more representations. KIF1i and KIF2i denote the KIF interesting dialog response could be produced. To module from Equation (4) applied to E1 and E2 , incorporate knowledge, models often concatenate respectively. a knowledge source E such as Wikipedia to xi and use attention modules to identify the WS1i = WeightedSum(KIF1i ) (6) most relevant knowledge. However, this approach WS2i = WeightedSum(KIF2i ) (7) is computationally intensive when handling large quantities of information. Further, attention which produces the final encoder output, a mechanisms have been found to operate poorly concatenation of M ′ (xi ) with the output of over long sequences, as the mechanism becomes multiple KIF modules: blurry due to the softmax and struggles to make fine-grained decisions (Fan et al., 2018b). The ′ same is true for hierarchical approaches, which M (xi ), σ (WS1i ) · WS1i , σ (WS2i ) · WS2i (8) lack scalability. We augment Transformer sequence to sequence This concatenation represents the output of the (seq2seq) networks on the encoder side with KIF encoder M ′ and can be used for various purposes, to improve generative dialog models. We experi- such as providing the encoder output to a decoder ment on two dialog tasks, Wizard of Wikipedia in a sequence to sequence model. (Dinan et al., 2018) and Engaging ImageChat (Shuster et al., 2020). In both datasets, models must leverage information external to the dialog 4 Applying KIF to Dialog Tasks history alone—in Wizard of Wikipedia, the chat We describe how to apply KIF to the task of requires access to knowledgeable facts and in generative dialog, a setting where models must Engaging ImageChat, discussion about a specific generate engaging and on-topic responses. We image. As models must process multiple inputs investigate dialog for two reasons: First, dialog and ground responses in the knowledgeable facts agents must be able to consult relevant information or images, these tasks challenge existing seq2seq to maintain the topic of the conversation. Second, approaches. retrieval-based agents have strong performance 4.2 Wizard of Wikipedia compared to generative ones, due to their ability to copy dialog utterances from the training set. Using The goal of the Wizard of Wikipedia dataset is to KIF, we can incorporate the benefits of retrieval train knowledgeable agents that can chat in any architectures into generative, knowledge-based domain. The dataset contains 1,365 various topics models. discussed in 18,430 dialogs in the training set, 86
totalling 166,787 training utterances. Each topic is information learned by accessing the Wikipedia a general concept, such as dogs or ice cream, and is knowledge. included as the first utterance of the conversation. The conversation is meant to be in-depth and Additional KNN Features. To better identify detailed, so individual utterances must reference relevant training utterances from the large quantity specific knowledge as a basis for the utterance. The available, we break down xi into conversation knowledge takes the form of Wikipedia sentences. sub-features for a more fine-grained match in the For example, the chat utterance I love Toy Story! KNN search step. By conducting KNN on more It was released in 1995 would reference the features, we can achieve higher quality retrieval. Wikipedia sentence Toy Story is a 1995 American We leverage the nature of dialog to decide these computer-animated buddy comedy [...]. For each features. utterance, a set of sentences are identified by an We concatenate the encoding of the most information retrieval system, and the crowdworker recent dialog utterance (e.g., xi,last ) with the selected one knowledge sentence as the basis for encoding of the dialog context from the current their utterance. conversation and the turn number t, such that M ′ (xi,last ), M ′ (xi,−last ), t is the representation Knowledge Sources. Our model for Wizard of used for KNN search. Concretely, if the model is Wikipedia has access to two sources of external trying to produce the 5th turn of the conversation, information, E1 and E2 : then xi,last is the most recent utterance from the dialog partner, xi,−last would be the last 3 turns • E1 is Wikipedia Knowledge provided by the of exchange, and t would be 4. Note that the turn dataset as evidence to support knowledgeable number is represented as a standalone number. chitchat (initially curated by the information These are known to be salient conversation fea- retrieval system used in Dinan et al. [2018]). tures. The most recent dialog utterance is the di- The scale of this KNN search is to filter rect turn the model is responding to, and the through an average of 34 sentences. The KIF dialog context may provide additional clues. The module uses dialog features to fetch relevant turn number is important, as earlier turns are often knowledge to condition upon to generate the generic (e.g., how are you doing today) and later subsequent utterance. turns are more specific. • E2 is Training Utterances. To incorporate the benefits of retrieval-based dialog models 4.3 Engaging ImageChat to the generative setting, we use KIF to The goal of Engaging ImageChat is to create identify relevant utterances from the training agents capable of chitchatting about images set and take their responses as input. If selected from the YFFC100M dataset (Thomee many conversations about dogs have already et al., 2016). The dataset contains 186,782 dialogs occurred, models should be able to take in the training set, each about a unique image, advantage of these human-written examples totalling 355,862 utterances. Agents are assigned to improve their generations. For example, one of 215 personalities (e.g., sweet, caring, likely conversation could occur about the excited) to increase engagingness. Previous work breed of the dog, daily routine with a pet, and (Shuster et al., 2020, 2019) identified that both similar topics. There are around 170K dialog crowdworkers and models, when provided with utterances as inputs to KNN search. This can personalities, produced more diverse, interesting be interpreted as incorporating the benefits of responses, as evaluated by humans. retrieval models by identifying an utterance We use a multimodal neural network designed with similar structure as the text the model to handle both image input and text input. would like to generate. We do not allow the Following Shuster et al. (2020), the images are module to fetch the correct response of the encoded using a pre-trained ResNeXt network current conversation context. (Xie et al., 2017). To extract the final image Access to these two sources of knowledge representation, we project the 2048-dimensional can be seen as learning a template and a topic output of the image encoder to 512-dimensions separately. Sample templates can be identified using a deep multilayer perceptron with ReLU from the training utterances, and topic-specific activation units. The conversation history, which 87
includes the one-word personality, is encoded with then the turn number t and personality p are a Transformer encoder network. The image and represented separately. As the personality is a conversation are integrated using the Multimodal- word, we use the same Transformer to encode Sum-Combiner module proposed in Shuster et al. it. The concatenation of features used for KNN (2020). search is: M ′ (xi,last ), M ′ (xi,−last ), t, p. Knowledge Sources. Our model for Engaging 5 Experimental Setup ImageChat has access to two sources of external information, E1 and E2 : 5.1 Implementation Details • E1 is Chat on Similar Images. Although there Parameter Settings. We use parl.ai (Miller are over 180K different images in this dataset, et al., 2017) to implement our models. The data for many of the images are similar. For example, both datasets used is available for download from conversations associated with two pictures parl.ai as well. We use byte-pair encoding of dogs could be relevant to each other. The (Sennrich et al., 2016) to represent the text to better model is able to use KIF directly on the handle the rare word problem (Dinan et al., 2018; current image features to fetch from around Fan et al., 2018a). Our generative Transformer 180K different images and return 6 turns of models have 8 encoder layers and 8 decoder layers, related chat for each fetched image. Fetching with FFN size 2048, embedding dimension 512, from E1 consists of identifying related image and 4 attention heads. We optimize using Adam chats, or conversations on related topics. (Kingma and Ba) and the inverse square root • E2 is Training Utterances. Similar to the learning schedule (Vaswani et al., 2017) with 10k motivation for the previous dataset, we allow warmup updates. The initial learning rate is 0.0001 the model to identify training utterances that and we optimize for model perplexity. We use a could be useful for responding in the current dropout of 0.5 and set gradient clipping to 0.1. conversation. The scale of this fetching task We set k = 5 for all cases. For both datasets, is large: 350K dialog utterances. This could we model a vocabulary size of 54,944 based on be interpreted as identifying utterances with the BPE-based vocabulary from the Reddit pre- similar structure to what the model would training. We tuned the learning rate and batchsize like to generate, and is complementary to the hyperparameters together. topic-based related image chats. Pre-training. We pre-train the Transformer Additional KNN Features. To identify relevant seq2seq model used for both datasets on 250M information from training utterances, we use the comments from Reddit. The Reddit dataset was same dialog features as Wizard of Wikipedia in made available by pushshift.io. The comments the KNN search step, with one modification: We are parsed to maintain conversational threads add the personality provided by the dataset. We of users responding to each other, so the represent the personality feature as the personality encoder network has been exposed to conversa- word, such as caring, and embed it with the tional context at training time. Note that the encoder M ′ . As utterances from speakers with Reddit dataset does not include aspects such as the same personality are more likely to be personality, as those are unique to specific datasets similar, this feature improves the quality of the such as Engaging ImageChat. The context size in fetched information. For example, conversations pre-training is set to 512 tokens. The ResNeXt with the sweet personality often include similar encoder used to model images for the Engaging text such as aww, that’s wonderful. We use ImageChat dataset was pre-trained on 3.5 billion two additional features for the KNN search: t, images (Mahajan et al., 2018). the turn number, and p, the personality. This feature is explicitly used in Shuster et al. (2020) 5.2 Evaluation to improve the engagingness and flow of the Generation. We generate with beam search, conversation. Similar to Wizard of Wikipedia, we setting the beam size to 4. We use 3-gram block- represent the conversation turn t as a number. ing. This technique disallows repeated n-grams The Transformer model is used to encode text from being generated multiple times and reduces xi and produce a representation of the text, repetition. 88
Automatic Metrics. Following Dinan et al. collected on the same topic for Wizard of Wiki- (2018), we compute F1, a metric of unigram pedia and collected on the same image and per- overlap, between the generated utterance and sonalities for Engaging ImageChat. Topic and the human-written reference utterance from the images selected for evaluation are unique and dataset. For generative models, utterances are taken randomly from the test set. generated using beam search. For retrieval models, the next utterance is predicted by ranking the entire 5.3 Baselines set of training utterances, and the highest scoring We compare Transformers augmented with KIF to utterance is chosen. other existing approaches on Wizard of Wikipedia In Wizard of Wikipedia, there are two test sets: and Engaging ImageChat. The best approaches, A set of seen topics, or topics that have been judged by human evaluation, are retrieval models, seen at training time with new test-time dialogs. the Retrieval Transformer Memory Network from The second set is unseen, or topics that have not Dinan et al. (2018) and the Retrieval Transformer been encountered at all during training time. We from Shuster et al. (2020). These have been evaluate on both subsets. shown to be strong baselines compared with other retrieval techniques based on TF-IDF (Chen Human Evaluation. We follow the setup and et al., 2017). Thus, we report the existing retrieval use the analysis questions proposed in the models for both datasets, but focus on comparing Acute-Eval dialog evaluation system (Li et al., to other generative baselines. 2019). For reproducibility, we adopt this existing We compare to three additional generative evaluation setting that has been applied to several baselines. Note that in Wizard of Wikipedia, dialog datasets. We use the question wording the construction of the dataset is that sentences suggested by Acute-Eval and follow their of Wikipedia knowledge are provided with the self-chat procedure and interface. As one of the utterances in a concatenated form. Models must original datasets assessed in this system was identify the relevant information in this provided Wizard of Wikipedia, their evaluation setting knowledge, or can access more Wikipedia know- extends naturally to ours. We collect 100 human- ledge beyond the provided sentences. The follow- bot conversational dialogs on a crowdsourcing ing baseline methods always have access to the platform for both datasets. The dialogs are eight information provided in the datas et already, but turns long. Then, we show pairs of the collected no additional Wikipedia knowledge beyond that. conversations side by side, one conversation with a human and model A and the other conversation • Transformer Memory Networks. To contrast with a human and model B. We ask annotators the the ability of KIF to existing work, we following questions: compare our models to published Trans- former Memory Networks (Dinan et al., • Who would you prefer to talk to for a long 2018). These models encode each piece of conversation? external information independently with a • If you had to say one of the speakers is Transformer Encoder, and these are stored interesting and one is boring, who would you as memory slots. To access information in say is more interesting? the memory slots, a model performs dot- • Which speaker sounds more human? product attention between the memory slots • Which speaker has more coherent responses and the dialog context. In Dinan et al. (2018), in the conversation? the knowledge selection from Wikipedia was supervised with either (a) a two-stage model • If you had to say that one speaker is more where the first model was trained to pre- knowledgeable and one is more ignorant, dict the right knowledge and a second model who is more knowledgeable? (Wizard of conditions on the predicted knowledge to Wikipedia only) generate the next utterance, or (b) an end- We measure the percentage of time one model to-end model with an auxiliary loss for was chosen over the other, taking the majority knowledge prediction accuracy. agreement between three evaluators. To reduce • Retrieve and Refine. We implement a hybrid variance, dialogs paired in the evaluation were model (Weston et al., 2018) that incorporates 89
top retrieval candidates as additional input more effectively as they are trained for dialog. to Generative Transformer MemNets. In Re- Thus, we replace CoVE embeddings with trieve and Refine, a fixed number of candi- domain-specific ones. dates are retrieved and concatenated to the conversational history in the encoder, making All of Transformer generative baselines are the input much longer. For both datasets, the initialized with the same pre-training on Reddit Retrieve and Refine mechanism that fetches that we use for our models for fair comparison on a fixed number of training utterances is added modeling quality. to the Generative Transformer MemNet with Reddit Pre-Training baseline. 6 Results Unlike the KIF-Augmented Transformer, the We describe the results of incorporating KIF retrieval is conducted with a separate model modules into Transformer networks. We display so there is no backpropagation to affect the an example conversation between a human and retrieval. With KIF, models can alter the our model in Figure 4, and show the top scoring retrieved candidates by learning the mapping Wikipedia knowledge and Training Utterance operator. Further, a fixed amount of infor- fetched by KIF modules. We compare to various mation is always retrieved, without the cap- baselines using automatic and human evaluation, ability to easily rescale to focus on specific and discuss our experiments. We present various candidates. KIF modules have weighting ablation settings to understand the key features mechanisms to focus more on certain infor- that make our method function. mation, and the modules are combined with gating so models can learn which knowledge 6.1 KIF is Effective for Incorporating sources are more important and adjust Knowledge flexibly. Lastly, Retrieve and Refine is only Automatic Evaluation. Comparing KIF aug- used to retrieve one source of information: mented Transformer networks to published base- training set utterances. lines and Retrieve and Refine, we find improved • Response Generation with MR. We imple- results. ment the model proposed in Qin et al. (2019), For Wizard of Wikipedia, the improvement in which encodes the conversation history and F1 score over the best baseline is around 8 points document contextually with a biLSTM before (see Table 1). A major contributing factor is the generating the next dialog utterance. The construction of the dataset—as each dialog turn initial model was applied to a machine is grounded in a specific knowledge sentence reading task where a knowledge document from Wikipedia, improving the ability to identify was provided along with the conversation the relevant fact strongly improves performance. history. For Wizard of Wikipedia, we replace Contrasting the results from the seen and unseen the knowledge document with the Wikipedia test sets in Table 1, the improvement on unseen is sentences provided in the dataset. The model worse—it is harder to fetch training utterances for then uses the conversation to identify the unseen topics. most relevant information in the document While Imagechat has no explicit dependency using a cross-attention mechanism. For the on knowledge, we still see a 2 point improve- Engaging ImageChat dataset, as there is no ment compared to the Generative Transformer document provided with the dataset, we MemNet (with the additional Reddit pre-training), replace the expected document with the indicating that KIF can be generally useful (see conversation history, and use the most recent Table 2). Compared to an even stronger baseline utterance in the conversation to attend to the that we tune in this work, Retrieve and Refine, we conversation history. see 1 point improvement. We make an additional improvement to this Human Evaluation. Results are shown in baseline: in Qin et al. (2019), the embeddings Figure 2. On both datasets, we find there is large used pre-trained CoVE vectors (McCann improvement over existing generative models et al., 2017). We found our Reddit pre- (green bars) that is statistically significant for some trained Transformer embeddings to work of the evaluation questions. Evaluators agree that 90
Model Test F1 Test F1 (Seen) (Unseen) Retrieval Baselines Retrieval Transformer MemNet (Dinan et al., 2018) 15.4 12.4 Generative Baselines 2-Stage Generative MemNet (Dinan et al., 2018) 18.9 17.4 Generative Transformer MemNet (Dinan et al., 2018) 16.9 14.4 + Reddit Pre-Training 17.6 16.3 Retrieve and Refine (Weston et al., 2018) 18.2 17.9 Response Generation with MR (Qin et al., 2019) 17.5 16.8 KIF-Augmented Transformer 25.9 22.3 Table 1: Results on the Wizard of Wikipedia dataset. We implement the Retrieve and Refine and Response Generation with MR approaches, all with Reddit Pre-Training, and evaluate them on Wizard of Wikipedia. The Seen test set consists of conversations on topics seen at training time, and the Unseen test set consists of conversations about new topics that were not in the training set. Model Test F1 Retrieval Baselines Retrieval Transformer (Shuster et al., 2020) 9.81 Generative Baselines Generative Transformer MemNet (Dinan et al., 2018) 7.1 + Reddit Pre-Training 12.8 Retrieve and Refine(Weston et al., 2018) 13.6 Response Generation with MR (Qin et al., 2019) 13.2 KIF-Augmented Transformer 14.4 Table 2: Results on the Engaging ImageChat dataset. We implement the Generative Transformer Memory Network, Retrieve and Refine, and Response Generation with MR approaches, all with Reddit Pre-Training, and evaluate them on Engaging ImageChat. KIF-augmented Transformers are generally more models. For example, on Engaging ImageChat, coherent and human-sounding compared to the while our model has significantly improved over Generative MemNet. the generative baseline (see green bars in Figure 2, Comparison with existing retrieval models right), it does not beat retrieval based methods in (shown in blue) is more nuanced. Along the sounding more human or being more interesting lines of existing work (Zhang et al., 2018; Dinan (see blue bars in Figure 2, right). As the Retrieval et al., 2018), we find that retrieval-based models baseline returns human-written text for other score very well in human evaluations that ask humans to evaluate, we hypothesize that humans how human or interesting a dialog sounds. This score each other’s writing quite well. Compared is because retrieval models return human-written with generative models, which we focus on utterances from the training set and do not suffer improving, retrieval models often produce longer from decoding mistakes present in generative text with more interesting, nuanced vocabulary 1 In Shuster et al. (2020), retrieval Transformer models usage, and do not make generation mistakes report Hits@N using a fixed candidate set of 99 distractor such as repetition. These factors often lead to candidates and 1 true candidate. We compute F1 using their the stronger performance of retrieval models. open-sourced model by scoring the entire training set of over 350K utterances with the model and taking the top scoring A surprising result is that KIF-augmented candidate as the response. Transformers are more human sounding than 91
Figure 2: Human Evaluation Results on Both Datasets. More than 50% indicates the KNN Model is preferred. Stars indicate statistical significance at p < 0.05. retrieval models on Wizard of Wikipedia. This is because the dataset’s utterances are long and factual due to the tendency of crowdworkers to copy Wikipedia. Sometimes humans chatting with the retrieval bot would respond uh. . . that’s an interesting fact? Otherwise, our model scores similarly to retrieval models, with most evaluations not having statistically significant Figure 3: Human Evaluation on the Unseen difference. Test Set of Wizard of Wikipedia. More than We conduct a second evaluation on the Unseen 50% indicates the KNN Model is preferred. Stars Test Set of the Wizard of Wikipedia dataset. indicate statistical significance at p < 0.05. Results are shown in Figure 3. Trends are similar compared to the results on the Seen Test set, mistakes. Overall, we find that the conclusions though the preference for the KIF-augmented (and statistical significance) are stable across Transformer is greater over the retrieval baseline. multiple evaluations. We hypothesize that because the Unseen Test Set is on entirely held out topics, the retrieval baseline 6.2 Analysis of Fetched Knowledge can struggle to identify relevant utterances. In Example conversations from our KIF-augmented contrast, the KIF-augmented Transformer, similar generative model are shown in Figure 4 on to the generative baseline from Dinan et al. (2018), Wizard of Wikipedia. We find that relevant can use the generative capability to produce knowledge is identified that affects the content utterances. of the generated utterance. For example, the Lastly, we conduct an additional study to model finds knowledge sentences about Disney examine the variance of the comparative dialog movies as the human conversationalist starts judgements. The evaluation study for Wizard of the conversation discussing Disney. The model Wikipedia is repeated three times on different leverages the fetched knowledge to write the days, and evaluators who have answered on content of the generated utterance. In a concrete previous days are not allowed to evaluate again example, the fetched sentence disney announced in any subsequent experiments. Overall, we intentions [...] after the success of the incredibles find reasonable interannotator agreement rates, leads the model to generate the utterance i love the around 73% averaged across all evaluations, incredibles, they are my favorite disney movie. which is similar to the agreement rates reported In contrast, the model uses the form of the in Li et al. (2019). We find there is greater fetched training utterance often as a template for variance on questions asking which dialog is writing a response. For example, the model copies more human and more interesting, most likely as the training utterance Ohhh . . . what do people different evaluators can interpret these in different with color blindness do to cope with the effects? ways. Further, we see that comparison with and starts the model generation with Ohhh ... and the Retrieval model has less variance compared continues with the question i think toy story is a to the Generative model, possibly because the classic? following the form of the selected training Retrieval model’s human written text is devoid of utterance. 92
Figure 4: Conversation between Human and KIF-Augmented Transformer on Wizard of Wikipedia. The top-scoring Wikipedia knowledge and training utterances fetched by KIF are displayed with model output. Figure 5 displays the top-3 fetched training test the scalability of the module. In Figure 6(a), set utterances and knowledge sentences on the we compare the Generative Transformer MemNet Wizard of Wikipedia dataset when responding Baseline with KIF-Augmented Transformers in to a human utterance. KIF modules can identify three settings. The first is the standard Wikipedia multiple relevant items. In response to the human sentences provided by the dataset (average question about blue skies the 1946 movie the model 34 sentences). Then, we extend to providing identifies both the comedy film and the band. the model with the full Wikipedia article (on Finally, the elements retrieved by KIF modules average, 57 sentences) and finally to multiple provide a more interpretable understanding of Wikipedia articles (on average, totaling 205 what the model is conditioning upon to generate sentences), identified using the conversation’s a dialog response. In Table 3, we display for the topic. This increasing size of available knowl- same dialog history, changing the model’s fetched edge could be realistic for settings where it training utterance and knowledge sentence for our is unclear what information is most relevant, own examples. The model heavily incorporates if filtering steps to preprocess the data remove our manual changes of the fetched information into potentially relevant information, or if information the generated utterance. For example, changing synthesis from multiple knowledge sources is the knowledge directly affects what the model necessary to produce a high-quality generation. generates as the favorite character—from buzz As the Wikipedia knowledge becomes more lightyear to mr potato head to slinky dog—while difficult to identify, performance decreases, but changing the fetched training utterance changes still outperforms the baseline that uses the the form of the generated sentence. dataset-provided set of 34 sentences. Comparing the scaling capability of KIF to the 6.3 Scaling KIF to Challenging standard Generative Transformer MemNet Base- Retrieval Settings line highlights the advantage of using KNN. The KIF modules can be used in more realistic and attention-based mechanism used in Dinan et al., challenging settings for knowledge retrieval that 2018 struggles to identify salient information 93
Figure 5: Examples of Top-3 Fetched Training Utterances and Fetched Knowledge when responding to a human chat from the dataset using a trained Wizard of Wikipedia model. Examples are taken from validation. when given increasingly larger quantities of of the KIF module—requiring only a feature knowledge, unlike the KNN information fetch. We vector to find nearest neighbors from—allows hypothesize the attention mechanism is challenged fetching on multiple modalities such as text and by softmax-ing over a larger quantity of inputs, as images. In Table 4, using the Image-based KIF it can be difficult to make sharp distinctions. to fetch text from Related Images is important to reach the strongest performance (compare 6.4 Ablations Training Utterances Only that uses text-based KIF Importance of Multiple Knowledge Sources. and using both Training Utterances and Related One benefit of the KIF module approach is Images). that several modules can be combined, each capturing information from a different source. In Using Dialog Features for KNN Performance. both settings, Wizard of Wikipedia and Engaging The quality of the KNN search is critical to the ImageChat, two modules were used to incorporate performance of KIF modules. As the external multiple forms of knowledge—training utterances knowledge is kept fixed, KIF must be able to align to capture the capability of a retrieval-based model the dialog context with the knowledge to identify and knowledge from Wikipedia or related chats relevant pieces of information. In Table 5, we based on image features. We perform here an show that matching on more features can improve ablation study to evaluate the impact of using the quality of the retrieved information. Using only only one source of information. As can be seen the encoding of the immediate previous utterance in Table 4, performance decreases when only one can improve results on Wizard of Wikipedia by source of information is used (see Table 4). 7 F1 points, but this is further improved by For Engaging ImageChat, this study also also leveraging the encoding of context (+1.8 underlines the importance of being able to fetch F1) and using the dialog turn number (+1 F1). in a multimodal fashion. The general form These features are available in the datasets, and 94
Knowledge Training Utterance Generation buzz lightyear’s name is in my favorite character in that cool! my favorite character in honor of astronaut edwin ‘buzz’ book series is hermione granger that movie is buzz lightyear aldrin mr potato head is based on the my favorite character in that my favorite character in that real-life mr. potato head toy book series is hermione granger movie is real-life mr potato head slinky dog is a toy dachschund my favorite character in that cool! my favorite character is the with a metal slinky for a body book series is hermione granger slinky dog slinky dog is a toy dachschund i really like the character cool! i really like slinky dog with a metal slinky for a body hermione granger slinky dog is a toy dachschund my favorite character of all i love that movie, my favorite with a metal slinky for a body time has to be hermione granger character has to be slinky dog the dachshund slinky dog is a toy dachschund i agree with you! that’s my i think so too! my favorite is with a metal slinky for a body favorite character as well slinky Table 3: Effect of Fetched Information on Generated Utterances. The top section provides examples for a fixed training utterance, changing the knowledge—the generated text maintains the construction of the training utterance but changes the favorite character to match the knowledge. The bottom section provides examples for fixed knowledge but changing the training utterance—the generated text modifies its form to match the training utterance, but the favorite character information remains consistent. Figure 6: Ablations on Wizard of Wikipedia. (a) KIF can scale to hundreds of relevant sentences (blue) while the baseline model, the Generative Transformer MemNet (gray), scales poorly (b) Gating can remove irrelevant information. In the 3 Sources case, one source of external information is unrelated. (c) Performance as k varies. we leverage them to improve the relatedness of N = 2 or N = 3 fixed hops. As the number retrieved knowledge. of hops is fixed, the multi-hop operation remains differentiable. We do not allow the model to Multi-Hop Retrieval with KIF. Work in me- retrieve the same information in a second hop. mory networks (Weston et al., 2015; Sukhbaatar We experimented in two settings. First, the et al., 2015) utilized multi-hop mechanisms. Such same KIF module is used multiple times to fetch capacity could be useful when multiple sources are different information, and then all of the fetched necessary or information is incrementally fetched. knowledge is concatenated. Results are shown To emulate multi-hop memory mechanisms, we in Table 6 (top). Second, we examine spreading use KIF to retrieve relevant information for the fetches into different KIF modules at various 95
Model Test F1 Model Valid F1 Wizard of Wikipedia KIF-Augmented Transformer 27.4 Training Utterances Only 18.1 One KIF Module fetches multiple times Wiki Knowledge Only 23.9 2 Fetches 26.9 Training Utterances and Wiki Knowledge 25.9 3 Fetches 26.0 Engaging ImageChat Multiple KIF Modules fetch once each Training Utterances Only 13.9 2 Fetches 26.5 Related Images Only 13.8 3 Fetches 25.9 Training Utterances and Related Images 14.4 Table 6: Multi-hop with KIF to retrieve Table 4: Using Multiple KIF Modules on Multiple information with multiple fetch steps. Sources is important for improved performance. (Zhang et al., 2018). This dataset looks quite Model Valid F1 different—short utterances without factual Wizard of Wikipedia knowledge—and should be easy for the model Previous Utterance Only 24.6 to identify as distinct from Wizard of Wikipedia. + dialog Context 26.4 As shown in Figure 6(b), if KIF on PersonaChat is + Turn Embedding 27.4 included without gating, it has a harmful effect as Engaging ImageChat the model includes irrelevant information. When Previous Utterance Only 13.3 equipped with gating, the model learns to use + dialog Context 14.5 the gate to ignore some inputs, and can recover + Turn Embedding + Personality 15.1 almost the full performance of a model without this irrelevant information source. Table 5: Important Features for KNN Search Size of K in KNN. Figure 6(c) shows the using KIF. Salient conversation features performance on Wizard of Wikipedia when improve performance on both datasets. varying the amount of knowledge. Being able to access multiple relevant pieces of information is encoder depths. This could be interpreted as the helpful, but too much information can be harmful. model learning to access more information each This is likely because the weighted sum becomes layer. As the model progresses deeper, more blurry if too many sentences are incorporated. abstract and high level representations are built, which could allow different knowledge to be retrieved. Results are shown in Table 6 (bottom). 7 Conclusion In both multi-hop settings, no improvement in We present a KNN-based Information Fetching performance on the Wizard of Wikipedia dataset module that learns to identify relevant information is observed. We hypothesize that this can be from external knowledge sources by learning a partially attributed to the construction of the mapping-based read operation. KIF modules ben- dataset—as humans explicitly based their written efit from the scalability and efficiency of KNN dialog utterance on one knowledge sentence. search, enabling computation with large external Further, it is possible that concatenation brings memories. We show in the context of two dialog together too much information for the model to datasets that relevant knowledge can be identi- incorporate, and thus adding additional fetches fied and incorporated to create more engaging, makes the retrieval more noisy. high-quality dialog. Effect of Gating. We analyze the effect of the Acknowledgments gating mechanism by evaluating the capability of the gate to identify and focus on salient infor- We thank the reviewers and action editor for mation. On Wizard of Wikipedia, we concatenate their comments and insightful discussion. We a third source of information: dialog turns from thank Emily Dinan and Kurt Shuster for provid- a completely different corpus called PersonaChat ing assistance to reproduce their original works. 96
References in Natural Language Processing and the 9th International Joint Conference on Natural Antoine Bordes, Y-Lan Boureau, and Jason Language Processing (EMNLP-IJCNLP), Weston. 2017. Learning end-to-end goal- pages 4177–4187. oriented dialog. In 5th International Conference on Learning Representations, ICLR 2017, Angela Fan, David Grangier, and Michael Auli. Toulon, France, April 24-26, 2017, Conference 2018a. Controllable abstractive summarization. Track Proceedings. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, pages 45–54. Xiao-jiang Liu, and Shuming Shi. 2019. Retrieval-guided dialogue response generation Angela Fan, Mike Lewis, and Yann Dauphin. via a matching-to-generation framework. 2018b. Hierarchical neural story generation. In In Proceedings of the 2019 Conference on Proceedings of the 56th Annual Meeting of Empirical Methods in Natural Language the Association for Computational Linguistics Processing and the 9th International Joint (Volume 1: Long Papers), pages 889–898. Conference on Natural Language Processing Edouard Grave, Armand Joulin, Moustapha Cissé, (EMNLP-IJCNLP), pages 1866–1875. DOI: David Grangier, and Hervé Jégou. 2017a. https://doi.org/10.18653/v1/D19 Efficient softmax approximation for GPUs. -1195 In Proceedings of the 34th International Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Conference on Machine Learning-Volume 70, Pascal Vincent, Gerald Tesauro, and Yoshua pages 1302–1310. Bengio. 2016. Hierarchical memory networks. Edouard Grave, Armand Joulin, and Nicolas CoRR, abs/1605.07427. Usunier. 2017b. Improving neural language models with a continuous cache. In 5th Danqi Chen, Adam Fisch, Jason Weston, and International Conference on Learning Repre- Antoine Bordes. 2017. Reading Wikipedia to sentations, ICLR 2017, Toulon, France, April answer open-domain questions. In Proceedings 24-26, 2017, Conference Track Proceedings. of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Alex Graves, Greg Wayne, and Ivo Danihelka. Papers), pages 1870–1879. DOI: https:// 2014. Neural Turing machines. arXiv preprint doi.org/10.18653/v1/P17-1171, PMCID: arXiv:1410.5401. PMC5579958 Kelvin Guu, Kenton Lee, Zora Tung, Panupong Wenlin Chen, David Grangier, and Michael Auli. Pasupat, and Ming-Wei Chang. 2020. Retrieval 2016. Strategies for training large vocabulary augmented language model pre-training. In neural language models. In Proceedings of Proceedings of the International Conference the 54th Annual Meeting of the Association on Machine Learning, pages 5695–5704. for Computational Linguistics (Volume 1: Samuel Humeau, Kurt Shuster, Marie-Anne Long Papers), pages 1975–1985. DOI: Lachaux, and Jason Weston. 2019. Poly- https://doi.org/10.18653/v1/P16 encoders: Architectures and pre-training strate- -1186 gies for fast and accurate multi-sentence scoring. Emily Dinan, Stephen Roller, Kurt Shuster, In International Conference on Learning Angela Fan, Michael Auli, and Jason Weston. Representations. 2018. Wizard of Wikipedia: Knowledge- Jeff Johnson, Matthijs Douze, and Hervé Jégou. powered conversational agents. In International 2019. Billion-scale similarity search with Conference on Learning Representations. GPUs. IEEE Transactions on Big Data. DOI: https://doi.org/10.1109/TBDATA Angela Fan, Claire Gardent, Chloé Braud, and .2019.2921572 Antoine Bordes. 2019. Using local knowledge graph construction to scale seq2seq models Armand Joulin and Tomas Mikolov. 2015. to multi-document inputs. In Proceedings of Inferring algorithmic patterns with stack- the 2019 Conference on Empirical Methods augmented recurrent nets. In Advances 97
You can also read