Pan More Gold from the Sand: Refining Open-domain Dialogue Training with Noisy Self-Retrieval Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Pan More Gold from the Sand: Refining Open-domain Dialogue Training with Noisy Self-Retrieval Generation Yihe Wang1 , Yitong Li2,3 , Yasheng Wang2 , Fei Mi2 Pingyi Zhou2 , Xin Wang1 , Jin Liu1∗, Qun Liu2 , Xin Jiang2 1 School of Computer Science, Wuhan University 2 Noah’s Ark Lab, Huawei 3 Huawei Technologies Ltd. Abstract resources such as Reddit and Twitter, are hetero- Real human conversation data are complicated, geneous and contain utterances with many various heterogeneous, and noisy, from whom build- topics, more freedom of topic shifting, and vague ing open-domain dialogue systems remains responses (Kummerfeld et al., 2018). As a result, arXiv:2201.11367v1 [cs.CL] 27 Jan 2022 a challenging task. In fact, such dialogue directly building generation models from such data data can still contain a wealth of informa- will be inefficient and usually requires "knowledge" tion and knowledge, however, they are not during the training. fully explored. In this paper, we show ex- One common solution is to introduce external isting open-domain dialogue generation meth- ods by memorizing context-response paired knowledge, usually, in a form of unstructured data with causal or encode-decode language knowledge passages from Wikipedia (Dinan et al., models underutilize the training data. Differ- 2018) or Internet articles (Komeili et al., 2021), ent from current approaches, using external and then, to build retrieval-augmented generation knowledge, we explore a retrieval-generation (RAG) methods to improvement the response qual- training framework that can increase the us- ity (Lewis et al., 2020; Izacard and Grave, 2020). age of training data by directly considering the However, this assumes knowledge-intensive sce- heterogeneous and noisy training data as the "evidence". Experiments over publicly avail- narios, which are not suitable for general open- able datasets demonstrate that our method can domain tasks or not robust to noise. According to help models generate better responses, even our preliminary study, in the Reddit dataset, 43% such training data are usually impressed as of the dialogue are merely chitchat and cannot low-quality data. Such performance gain is match "knowledge". Moreover, building such a comparable with those improved by enlarging knowledge-augmented dataset is very expensive the training set, even better. We also found as it relies on large amounts of high-quality hu- that the model performance has a positive cor- man annotations w.r.t. knowledge grounding. And relation with the relevance of the retrieved evi- dence. Moreover, our method performed well thus, they are limited in size, making it hard for a on zero-shot experiments, which indicates that knowledge-retrieval method to generalize on scale. our method can be more robust to real-world Motivated by above, we would like to investi- data. gate can we have better ways of utilizing open do- 1 Introduction main data without introducing external resources? To tackle the aforementioned problem, we found Open-domain dialogue is a long-standing problem that the context from the other relevant dialogue in natural language processing and has aroused sessions can still be very useful for dialogue gen- the widespread interest of researchers. Many ap- eration. To utilise such unstructured context, we proaches have been studied, and recently, genera- take inspirations from retrieval-augmented meth- tion models trained on large-scale data have gained ods (Lewis et al., 2020). Differently, we retrieve more attention (Adiwardana et al., 2020; Roller useful dialogue context as evidence, build context- et al., 2020; Xu et al., 2021; Madotto et al., 2021; evidence-response triples for each dialogue context, Bao et al., 2019, 2020; Zhang et al., 2019; Wang and treat open-domain generation as an evidence- et al., 2020). Open-domain dialogue systems are aware generation task. Such that our model learns born to deal with many diverse domains, and natu- to respond with grounding on these useful evi- rally its training data, usually crawled from online dence. ∗ Corresponding Author By this, we show that current training methods
which learn merely using context-response pairs training dialogue model which is trained on a large- have not fully unleashed the potential of training scale cleaned Chinese conversation dataset. data and that our methods, only retrieving from Retrieval Augmented Generation Retrieval is the training data, can consistently improve the gen- a long-considered intermediate step in dialogue eration performance. We also perform zero-shot system, and recently, it has been a more inten- experiments, demonstrating that our method can be sively studied topic for neural models. Song robust and generalized to different domains. More- et al. proposed an ensemble of retrieval-based over, we found that adding extra retrieval data only and generation-based conversation system, the re- (without training them) can still help the model gain trieved candidates in addition to the original query performance, and it can even outperform traditional are fed to a response generator, in which an ex- methods directly trained on that part of retrieval tra encoder was used for the retrieved response. data. This proves our method is compatible with Pandey et al. combined input context and retrieved current methods with external knowledge. responses to create exemplar embeddings which Our contributions are summarized as follows: were used in decoder to generate the response. • we explore a retrieval-generation training frame- Weston et al. took a standard generative model work that can increase the usage of training data which used a single encoder that takes the con- by directly considering the heterogeneous and catenation of the original query and the retrieved noisy training data as the "evidence". response as input. Wu et al. proposed to con- • We show that adding extra retrieval data while struct an edit vector by explicitly encoding the lexi- not training them can still gain performance ben- cal differences between input query and retrieved efits, even better than traditional training with query, then the edit vector and the prototype re- the retrieval data attached. sponse representation are fed to a decoder to gener- ate a new response. Cai et al. extracted skeletons • The proposed method performs well on zero- from the retrieval results, both the skeleton and shot experiments, which indicates that our the original query were used for response genera- method can generalize well in real-world ap- tion. Lewis et al. explored a fine-tuning recipe for plications. retrieval-augmented generation, which combined pre-trained parametric and non-parametric memory 2 Related Work for language generation. Izacard and Grave pro- Open-domain Dialogue System Open-domain posed Fusion-in-Decoder method which encoded dialogue system aims to perform chit-chat with peo- each evidence independently with the context when ple without the task and domain restriction. It is a generative model processing the retrieved passages. long-standing problem in natural language process- Most of these works retrieved external knowledge, ing, recently it has aroused the widespread interest usually unstructured knowledge passages, such as of researchers. Adiwardana et al. (2020) proposed Wizard of Wikipedia (Dinan et al., 2018), persona- Meena, a multi-turn open-domain chatbot trained chat (Zhang et al., 2018), and Wizard of Internet end-to-end on data mined and filtered from pub- (Komeili et al., 2021). Moreover, Li et al. (2020) lic domain social media conversations. Blender proposed a zero-resource knowledge-grounded di- (Roller et al., 2020; Xu et al., 2021) learn to pro- alogue model which bridged a context and a re- vide engaging talking points and listen to their part- sponse as knowledge and expressed it as latent ners, as well as displaying knowledge, empathy variable. and personality appropriately, while maintaining 3 Self-retrieval Method a consistent persona. Adapter-bot (Madotto et al., 2021) explored prompt-based few-shot learning in We start from an open-domain dialogue dataset dialogue tasks. Plato (Bao et al., 2019, 2020) intro- D = {(ci , ri )}N i=1 , where ci denotes multi-turn duced discrete latent variables to tackle the inherent dialogue context, consisting of dialogue utterances, one-to-many mapping problem in response gener- and ri represents the response. ation. Zhang et al. (2019) proposed DialoGPT Generally, we aim to build open-domain di- which was trained on 147M conversation-like ex- alogue systems that retrieve useful dialogue re- changes extracted from Reddit comment chains. sponse (as evidences) from other sessions to help Wang et al. (2020) introduced CDial-GPT, a pre- response generation. To tackle this problem, we
Figure 1: Overview of our self-retrieval approach. Our retriever first retrieves useful dialogue instances from the training dataset, which extends current data to context-evidence-response triples. And then, we adopt evidence- aware training models over the data with self-retrieval evidences. proposed a two-step framework. The overview of bacher, 2018; Melucci, 2016). Therefore, we adopt our approach is shown in Figure 1. the classic retriever BM25 (Robertson et al., 1995), 1. Firstly, we extend an open-domain dialogue which only relies on a few parameters and has been dataset with a retriever. Given the context of cur- proven to be robust to many different scenarios.1 rent dialogue turn ci , the retriever R(e{·} |ci ) re- Generally, BM25 gives match scores based on turns top-k relevant evidences as the evidence set bag-of-word features, and the ranking function is Ei = {e1:k } from a retrieval set. Note that dif- based on term frequency and inverse document ferent from existing knowledge-grounding meth- frequency. During the retrieval, for each context- ods, we do not introduce external data for re- response pair (ci , ri ), we define the retrieval set by triever, and we only consider retrieving evidence applying leave-one-out of the original training set from the training data at hand. By that, we ex- S = D − {(ci , ri )}, to ensure the model cannot see tend the dataset into context-evidence-response the true response during generation. triples D = {(ci , Ei , ri )}N i=1 . We explore three retrieval strategies: context-to- 2. Secondly, we adopt an evidence-aware gener- context (C 2 C) retrieval, context-to-response (C 2 R) ation model, which is a conditional language retrieval, and a MIX retrieval. model to generate the response y given the con- Context-to-context Matching C 2 C matches the text and the retrieved evidence p(y|c, E). We context ci of current dialogue and the context cj investigate two widely used architectures, an from the retrieval set S. And the evidence set of ci auto-regressive GPT, and an encoder-decoder is defined as: based language model T5. Next, we introduce how to design an effective EiC 2 C (ci , S) = argmaxK score(ci , cj ) , retriever in Section 3.1 and ways of implementing (cj ,rj )∈S evidence-aware generation on the basis of state- of-the-art pre-trained generation models in Sec- where argmaxK means selecting top k correspond- tion 3.2. ing responses r1:k as evidences e1:k with best matching score given by BM25. 3.1 Retrieve Dialogue Evidence Context-to-response Matching As the retrieval A variety of retrieval systems have been studied, set contains the dialogue response, we also perform including classic but effective bag-of-words system a Context-to-response (C 2 R) Matching. It is similar (Robertson et al., 1995) and up-to-date dense re- to C 2 C, while C 2 R directly matches the response triever, such as DPR (Karpukhin et al., 2020) and in the retrieval set. In C 2 R, BM25 computes the SparTerm (Bai et al., 2020). In this paper, we aim matching score based on the response rj of the with open-domain data and we would like not to introduce more data or more parameters, which 1 We also did preliminary experiments over BM25 and might lead to data selection bias in practice (Otter- DPR and they show no significant differences for our findings.
retrieval set. input, and then it generates the response. More precisely, for any instance (ci , Ei , ri ), all retrieved EiC 2 R (ci , S) = argmaxK score(ci , rj ) . evidences are concatenated before the dialogue (cj ,rj )∈S context ci , and the model directly generates the Mixed Matching We observed that these two response y after ci . We add special token [p] strategies, C 2 C and C 2 R, often obtain different re- before each retrieved evidence passage, and fol- sults. Therefore, we complement the two retrieval lowing Wang et al. (2020), we add [speaker1], sets of C 2 C and C 2 R with each other and combine [speaker2] to each utterance to indicate differ- them into a MIX retrieval set by re-ranking them ent speakers of muti-turn dialogue. using BM25 score. Finally, we take their responses Fusion-in-Decoder In our setups, we have mul- as evidences: tiple evidences for one instance, thus we adopt a slightly different model than the standard encoder- EiMIX (ui , S) = argmaxK {EiC 2 C , EiC 2 R } . decoder T5 (Raffel et al., 2020). We use F I D (Izacard and Grave, 2020), which was originally Filter During our preliminary studies, we found proposed for open-domain question answering. It that some retrieved evidences are not relevant to the considers encoding each evidence independently current query and context. It is arguable that very with context, so that these evidences will not af- few relevant evidences can be retrieved for some fect each other on the encoder side, which is a dialogue instances, and to study this we perform better solution to encode multiple evidences. In analysis in Section 4.5.2 and Section 4.5.3, where detail, F I D encodes a concatenation of the context we study different sizes of the retrieval set to ensure ci with each retrieved evidence ej . And then, it more relevant evidences can be found. Undoubt- concatenates all the encoded hidden representation, edly, these low-relevant evidences are harmful to and then it is passed to the decoder for generation. response generation. Therefore, we approach a Slightly different from the original architecture, we simple filter to discard evidences with very low add an additional passage that only encodes the matching scores. dialogue context, in case that one dialogue does 3.2 Evidence-aware Dialogue Generation not use any retrieved evidence (discussed in Sec- tion 4.5.5). Similarly, we add special tokens as we For generating more appropriate responses, our did for GPT-2. generator is a language model but also conditional on the retrieved evidence set.2 4 Experiments Y p(y|ci , Ei ) = p(yt |ci , Ei , y
conversation taken directly under the movie 4.3 Implementation and Setup subreddit (Dodge et al., 2015).5 We discard For retriever, we use Elastic implementation and instances with long turns or long sentences. In use the default parameter setups.7 As the context total, the movie dialog dataset has 940k dialogue has a different number of turns, we use the latest sessions after preprocessing. utterance of dialogue context as BM25 query in For both datasets, we randomly sample a training practice, which can yield more consistent matching set of 100k samples, a validation set of 10k sam- scores. Filter is used in all retrieval setups except ples, and a test set of 10k samples. Data outside the baselines. the above sets can be considered as the retrieval We performance an in-domain evaluation over resource. Noted that in our main experiments, the the two datasets. For each dataset, we adopt the pro- retrieval set (for train/dev/test) is exactly the train- posed three self-retrieval (SR) method, C 2 C, C 2 R, ing set, where we only retrieve from the training set. and MIX, comparing against the GPT-2 and F I D And experimental results using a larger retrieval baselines. We experiments with different numbers set are investigated and reported in Section 4.5.4, of retrieval evidence passages (see Section 4.5.3). which involves more evidence than the training set. Note that F I D degenerates to a standard T5 model without any evidences. We retrain our model based 4.2 Metrics on the pretrained checkpoint of GPT-2,8 and T5 To compare the response quality of different mod- checkpoint for F I D.9 We do model selection based els, we adopt both automatic metrics and human on PPL over the validation set. evaluations. We additionally perform a zero-shot cross- domain evaluation for both datasets using F I D.10 Automatic Metrics We deploy four commonly In this setup, we only train our best in-domain F I D used automatic metrics for the dialogue gen- model on one dataset and then directly test on the eration, the perplexity (PPL), unigram overlap other, while the retrieval set for inference is the (F1), BLEU, and distinct 1,2 (Dist-1,2). F1 and training set of the target domain. All other setups BLEU are commonly used to measure how similar follow the in-domain experiments. the machine-generated responses is to referenced 4.4 Results golden response (Miller et al., 2017; Papineni et al., 2002). Dist-1,2 measure the diversity of the gener- In-domain Table 1 reports the overall in-domain ated responses (Li et al., 2016). experimental results. Overall, our self-retrieval methods achieve better performance consistently Human Evaluations We also perform human across almost all automatic and human evaluation evaluation over the generated response. Following metrics in terms of generation quality. For gener- Song et al. (2021), we consider three conventional ation diversities (Dist-1 and Dist-2), our SR can criteria: fluency (Flue.), informativeness (Info.), still have comparable performance with the strong and relevance (Relv.). We recruit a team on Ama- baselines. For both GPT-2 and F I D, all three used zon Mechanical Turk consisting of several profes- matching strategies can improve the overall per- sional annotators, who are proficient in language formance, and MIX consistently outperforms the tasks but know nothing about the models.6 We sam- other two. Comparing with GPT-2 and F I D, two ple 200 instances for each model’s evaluation un- baselines achieve similar performance, while when der every setting and each sample was evaluated by adding our retrieved evidences, we observed F I D three people. Each criterion is rated on five scales, based methods performance better, demonstrating where 1, 3, and 5 indicate unacceptable, moderate, the effectiveness of evidence-aware training of F I D and perfect performance, respectively. We report in modeling multiple evidence passages. We also the average Fleiss’s kappa score (Fleiss and Cohen, illustrate the example generated by our approach 1973) on Reddit and Movie Dialogue, 0.47 and baselines in Table 2. Above all, these results and 0.44 respectively, indicating annotators have demonstrate that our approach could utilise more reached moderate agreement. 7 https://www.elastic.co/ 8 https://huggingface.co/gpt2/tree/main 5 9 https://research.fb.com/downloads/ https://huggingface.co/t5-small/tree/ babi/. main 6 10 https://www.mturk.com/ We ensure there is no overlap between the two datasets.
Automatic Metrics Human Evaluation Reddit PPL↓ F1↑ BLEU↑ Dist-1↑ Dist-2↑ Flue↑ Info↑ Relv↑ GPT-2 BASELINE 31.3 5.3 3.4 65.4 96.7 3.0 2.9 2.8 C2C 29.4 6.1 3.8 69.3 95.6 3.2 3.0 3.1 w. SR C2R 29.7 6.0 3.6 68.4 95.3 3.2 3.1 3.1 MIX 28.1 6.6 4.2 73.5 98.2 3.4 3.3 3.4 T5 BASELINE 25.5 5.2 3.7 95.7 96.3 3.1 3.0 3.1 C2C 25.0 8.0 5.9 91.2 93.8 3.3 3.2 3.3 F I D w. SR C2R 25.2 7.9 5.7 90.4 92.3 3.3 3.2 3.2 MIX 23.8 9.5 6.9 95.3 97.2 3.5 3.4 3.5 Movie PPL F1 BLEU Dist-1 Dist-2 Flue Info Relv GPT-2 BASELINE 25.6 5.4 3.3 64.3 96.0 3.0 2.9 2.8 C2C 23.5 6.1 3.8 66.9 93.9 3.2 3.1 3.1 w. SR C2R 23.5 6.0 3.7 67.8 92.7 3.2 3.0 3.1 MIX 22.7 6.7 4.2 71.4 96.1 3.4 3.3 3.3 T5 BASELINE 20.5 5.2 3.7 95.2 95.8 3.1 2.9 2.9 C2C 20.1 7.7 5.5 92.3 94.1 3.3 3.2 3.2 F I D w. SR C2R 20.2 7.7 5.4 91.7 93.6 3.3 3.1 3.2 MIX 18.9 9.2 6.6 94.9 96.8 3.6 3.5 3.6 Table 1: Automatic and human evaluation of the in-domain setups over Reddit and Movie Dialog, using 8 evidences passages. GPT-2 and T5 are baselines and “w. SR” (with self-retrieval) indicates our methods. The best results are in bold. Speaker1: Why do you get to decide who has something to offer ? Speaker2: He doesn’t , he is entitled to his opinion , this is the internet and a forum discussion thread . People post their opinions not the truth . Baseline Generation: Why have you already voted to make sure you for yourself to support yourself? Key Evidence 1: I like some of them . However , we are all entitled to our own opinions . Key Evidence 2: No you’re entitled to your opinion . I’d just prefer an opinion that didn’t contain a logical fallacy . Our Generation: You are right. People are entitled to their opinion. Ground Truth: I know , I was taking a round about way of trying to get him to questions his opinion . Table 2: Examples of responses generated by baselines and our approach based on F I D. of the dialogue data without introducing more data data compared with the vanilla training methods. compared with the baselines. Zero-shot Cross-domain Table 3 reports the re- 4.5 Analysis sults of zero-shot experiments using F I D. Again, we find that our methods with evidences achieve 4.5.1 Retrieval Strategies better performance compared to the baselines with- out knowledge and MIX performs the best. This Table 1 also shows the experimental results of dif- result indicates that our approach has good gener- ferent retrieval strategies. We find that MIX perform alisation and is robust to different datasets. better than context-to-context retrieval (C 2 C) and Overall, both in-domain and zero-shot results context-to-response retrieval (C 2 R), and the latter demonstrate our self-retrieval method can improve two methods show no significant difference. We the performance of open-domain dialogue gener- thought that both C 2 C and C 2 R can retrieve useful ation, and worth noting that our self-retrieval do evidences while from different aspects. And thus not use any additional resource. This indicates our mixing them can yield more useful informative and methods can unleash more potential of the dialogue relevant evidences and better performance as well.
Movie Dialogue → Reddit Reddit → Movie Dialogue PPL F1 BLEU Dist-1 Dist-2 PPL F1 BLEU Dist-1 Dist-2 T5 BASELINE 29.2 5.3 3.9 95.6 96.2 33.0 5.1 3.6 94.5 95.9 C2C 28.0 7.4 5.6 95.6 98.3 30.3 7.4 5.7 94.4 97.2 F I D w. SR C2R 28.2 7.3 5.4 94.8 98.0 30.6 7.3 5.6 93.8 96.6 MIX 26.6 9.0 6.5 96.0 98.6 27.9 8.7 6.5 94.9 97.9 Table 3: Automatic and human evaluation results of zero-shot experiments over Reddit and Movie Dialog with 8 retrieved evidence passages. The best results are in bold. Reddit Movie Dialog PPL F1 BLEU Dist-1 Dist-2 PPL F1 BLEU Dist-1 Dist-2 GPT-2 baseline 31.3 5.3 3.4 65.4 96.7 25.6 5.4 3.3 64.3 96.0 p1 29.8 5.9 3.8 70.5 96.1 23.9 5.8 3.6 67.3 92.1 p2 29.3 6.1 3.9 71.2 96.6 23.6 6.0 3.8 68.6 93.2 SR p4 28.6 6.3 4.0 72.1 97.3 23.1 6.3 4.0 69.8 94.3 p8 28.1 6.6 4.2 73.5 98.2 22.7 6.7 4.2 71.4 96.1 p16 27.9 6.8 4.4 74.0 98.7 22.5 6.8 4.3 71.8 96.4 T5 baseline 25.5 5.2 3.7 95.7 96.3 20.5 5.2 3.7 95.2 95.8 p1 25.4 7.5 5.6 93.7 95.8 20.2 7.2 5.2 94.0 95.9 p2 24.9 8.1 6.0 94.1 96.3 20.0 7.8 5.6 93.8 95.4 F I D w. SR p4 24.3 8.8 6.4 94.6 97.2 19.5 8.4 6.0 94.3 96.1 p8 23.8 9.5 6.9 95.3 97.2 18.9 9.2 6.6 94.9 96.8 p16 23.6 9.7 7.0 95.6 97.8 18.7 9.4 6.8 95.2 97.0 Table 4: Experimental results of different numbers of evidences used for generation using Reddit and Movie Dialog. p-k indicates the number of evidence passages used for generation. The best results are in bold. Reddit PPL F1 BLEU ing more retrieved evidences (p16) performs better GPT-2 BASELINE 31.3 5.3 3.4 than experiments with fewer retrieved evidences RANDOM 31.4 5.4 3.4 (i.e. p1, p2, p4, p8). While the performance gap SR w/o FILTER 28.8 6.2 3.9 w. FILTER 28.1 6.8 4.2 is getting smaller when increasing the evidence numbers. Considering the trade-off between effi- F I D (T5) BASELINE 25.5 5.2 3.7 RANDOM 25.7 5.2 3.6 ciency and performance, we report results using 8 SR w/o FILTER 24.7 8.3 6.1 evidence as our main results, which is considered w. FILTER 23.8 9.5 6.9 to be good enough. These results indicate that we can use more retrieved evidences, which leads to Table 5: Effectiveness of the filters. better experimental results and supporting more information is significant for the generative model. 4.5.2 Effectiveness of Filter Table 5 show the ablation study of with and without 4.5.4 Self-retrieval vs. Extra Evidences using filter during the retrieval step on Reddit. We notice that in our experiments we make the Here the finding is that experiment with filter (w. retrieval set exactly the same as the training set, FILTER ), has better performance than experiments denoted as the “self-retrieval (SR)” setup. One nat- without it (w/o FILTER), as well as a setup using ural question is can we use extra data for retrieval random evidences (RANDOM). These show that set? To further understand this question and to val- noisy evidences give no assistance, or even harm, idate the usefulness of our method, we carried out to the model and that the necessity of discarding experiments with different sizes of the training set low-relevant evidence for dialogue generation in and retrieval set. Specifically, we experiment with our method. additional setups by enlarging the retrieval sets, i.e. +200k, +400k, +600k, where “+” means extra data 4.5.3 Number of Retrieved Evidences for retrieval sets, and we also adopt baselines with We also carried out experiments with a different different training sizes of 100k, 300k, 500k, 700k number of retrieved evidences. Table 4 reports (denoted before “+”).11 the experimental results of using k evidences (p-k) 11 for generation. We observe that experiment us- Due to data size limitation, we did not occupy all setups.
(a) PPL on GPT-2 (b) F1 on GPT-2 (c) BLEU on GPT-2 (d) PPL on F I D (e) F1 on F I D (f) BLEU on F I D Figure 2: Results of different sizes of training set and retrieval set on the Movie Dialog with 8 retrieved evidences. “Self” indicates the training set used for self-retrieval and “+” means adding extra data for retrieval. (a) max setup over overlaps with bins = {0, · · · , 9, ≥ 10} (b) sum setup over overlaps using bin size = 5 Figure 3: Performance by different overlaps between evidences and ground-truth responses over datasetReddit. Figure 2 shows the experimental results.12 We sponses, and our method has good generalisation observe that experiments with larger retrieval sets over the retrieval evidences. achieve better results than those with small re- trieval sets across different training sizes. We be- 4.5.5 Relevance of Evidence and lieve larger retrieval sets can introduce more rel- Ground-truth evant evidences, which brings performance gain To further study how our methods make sense, we for the model. Another interesting finding is study how the relevance of the retrieved evidences that adding extra data for retrieval (100+600k, and ground-truth response can influence the gener- 300+400k, 500+200k) in our methods can outper- ation performance. For each instance (ci , ri ) which form the baselines (700k) with extra data added via used n retrieval evidences EiMIX = {e1 , e2 , ..., en }, direct training. Also, under the same amount of we compute the number of overlapped words be- total data (700k), leveraging more data for retrieval tween the ground-truth ri and each retrieved ev- (100+600k, 300+400k, 500+200k) has approach- idence. We study two setups by computing the ing performance with the self-retrieval with full overall overlap(E, ri ) using max and sum over data (self, 700k). It indicates that our methods the individual overlaps. can increase the usage of the training data only in a retrieval way without directly training these re- Figure 3 shows the results of these two setups. We observed that higher overlap leads to better 12 We also report a detailed results using (100+600k) setup performance. These results indicate that high rel- in Appendix A.1. evant retrieval evidences can help generate better
responses and low relevant knowledge are harmful, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, which is consistent to the findings in Section 4.5.2. Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Also, there are still some low-relevant left in the Chess, Jack Clark, Christopher Berner, Sam Mc- retrieval step. This indicates that open-domain dia- Candlish, Alec Radford, Ilya Sutskever, and Dario logue generation is still a difficult task, and better Amodei. 2020. Language models are few-shot learn- retrieval methods are required to further improve ers. ArXiv, abs/2005.14165. our generation performance. Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiaojiang Liu, Wai Lam, and Shuming Shi. 2019. Skeleton-to- 5 Conclusion response: Dialogue generation guided by retrieval memory. In Proceedings of the 2019 Conference of In this paper, we propose a self-retrieval training the North American Chapter of the Association for framework for open-domain dialogue generation. Computational Linguistics: Human Language Tech- Different from other knowledge-intensive tasks, nologies, Volume 1 (Long and Short Papers), pages 1219–1228. our framework only retrieves relevant dialogue in- stances from the training data (which can be ex- Emily Dinan, Stephen Roller, Kurt Shuster, Angela tended to a retrieval set) and without the need to Fan, Michael Auli, and Jason Weston. 2018. Wizard train them in the generation model. It is significant of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241. that we demonstrate that traditional training base- lines underuitlise the training data and our method Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine can utilise more potential of data. We show that Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. 2015. Evaluating pre- our method improves the robustness and generality requisite qualities for learning end-to-end dialog sys- of generative models as well as generate proper tems. arXiv preprint arXiv:1511.06931. response for complicated human conversation. In Joseph L Fleiss and Jacob Cohen. 1973. The equiv- future works, we would like to study better ways alence of weighted kappa and the intraclass corre- of evidence retrieval and evidence-aware training lation coefficient as measures of reliability. Educa- and we believe our approach can benefit to other tional and psychological measurement, 33(3):613– NLP tasks, such as classification task. 619. Gautier Izacard and Edouard Grave. 2020. Lever- aging passage retrieval with generative models for References open domain question answering. arXiv preprint Daniel Adiwardana, Minh-Thang Luong, David R So, arXiv:2007.01282. Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and et al. 2020. Towards a human-like open-domain Wen-tau Yih. 2020. Dense passage retrieval for chatbot. arXiv preprint arXiv:2001.09977. open-domain question answering. arXiv preprint arXiv:2004.04906. Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fang- Mojtaba Komeili, Kurt Shuster, and Jason Weston. shan Wang, and Qun Liu. 2020. Sparterm: Learn- 2021. Internet-augmented dialogue generation. ing term-based sparse representation for fast text re- arXiv preprint arXiv:2107.07566. trieval. arXiv preprint arXiv:2010.00768. Jonathan K Kummerfeld, Sai R Gouravajhala, Joseph Siqi Bao, Huang He, Fan Wang, Hua Wu, and Haifeng Peper, Vignesh Athreya, Chulaka Gunasekara, Jatin Wang. 2019. Plato: Pre-trained dialogue generation Ganhotra, Siva Sankalp Patel, Lazaros Polymenakos, model with discrete latent variable. arXiv preprint and Walter S Lasecki. 2018. A large-scale corpus arXiv:1910.07931. for conversation disentanglement. arXiv preprint arXiv:1810.11118. Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhen Guo, Zhibin Liu, and Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Xinchao Xu. 2020. Plato-2: Towards building an Petroni, Vladimir Karpukhin, Naman Goyal, Hein- open-domain chatbot via curriculum learning. arXiv rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- preprint arXiv:2006.16779. täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv preprint Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie arXiv:2005.11401. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, Askell, Sandhini Agarwal, Ariel Herbert-Voss, and Bill Dolan. 2016. A diversity-promoting ob- Gretchen Krueger, T. J. Henighan, Rewon Child, jective function for neural conversation models. In
Proceedings of the 2016 Conference of the North Haoyu Song, Yan Wang, Kaiyan Zhang, Wei-Nan American Chapter of the Association for Computa- Zhang, and Ting Liu. 2021. Bob: Bert over tional Linguistics: Human Language Technologies, bert for training persona-based dialogue models pages 110–119, San Diego, California. Association from limited personalized data. arXiv preprint for Computational Linguistics. arXiv:2106.06169. Linxiao Li, Can Xu, Wei Wu, Yufan Zhao, Xueliang Yiping Song, Cheng-Te Li, Jian-Yun Nie, Ming Zhang, Zhao, and Chongyang Tao. 2020. Zero-resource Dongyan Zhao, and Rui Yan. 2018. An ensem- knowledge-grounded dialogue generation. arXiv ble of retrieval-based and generation-based human- preprint arXiv:2008.12918. computer conversation systems. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Andrea Madotto, Zhaojiang Lin, Genta Indra Winata, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz and Pascale Fung. 2021. Few-shot bot: Prompt- Kaiser, and Illia Polosukhin. 2017. Attention is all based learning for dialogue systems. arXiv preprint you need. In Advances in neural information pro- arXiv:2110.08118. cessing systems, pages 5998–6008. Massimo Melucci. 2016. Impact of query sample se- Yida Wang, Pei Ke, Yinhe Zheng, Kaili Huang, Yong lection bias on information retrieval system ranking. Jiang, Xiaoyan Zhu, and Minlie Huang. 2020. A 2016 IEEE International Conference on Data Sci- large-scale chinese short-text conversation dataset. ence and Advanced Analytics (DSAA), pages 341– In NLPCC. 350. Jason Weston, Emily Dinan, and Alexander Miller. Alexander H. Miller, Will Feng, Dhruv Batra, Antoine 2018. Retrieve and refine: Improved sequence gen- Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and eration models for dialogue. In Proceedings of the Jason Weston. 2017. Parlai: A dialog research soft- 2018 EMNLP Workshop SCAI: The 2nd Interna- ware platform. In EMNLP. tional Workshop on Search-Oriented Conversational AI, pages 87–92. Jahna Otterbacher. 2018. Addressing social bias in in- formation retrieval. In CLEF. Yu Wu, Furu Wei, Shaohan Huang, Yunli Wang, Zhou- jun Li, and Ming Zhou. 2019. Response generation Gaurav Pandey, Danish Contractor, Vineet Kumar, and by context-aware prototype editing. In Proceedings Sachindra Joshi. 2018. Exemplar encoder-decoder of the AAAI Conference on Artificial Intelligence, for neural conversation generation. In Proceedings volume 33, pages 7281–7288. of the 56th Annual Meeting of the Association for Jing Xu, Arthur Szlam, and Jason Weston. 2021. Be- Computational Linguistics (Volume 1: Long Papers), yond goldfish memory: Long-term open-domain pages 1329–1338. conversation. arXiv preprint arXiv:2107.07567. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Jing Zhu. 2002. Bleu: a method for automatic eval- Szlam, Douwe Kiela, and Jason Weston. 2018. Per- uation of machine translation. In ACL. sonalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Lan- Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, guage models are unsupervised multitask learners. Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing OpenAI blog, 1(8):9. Liu, and Bill Dolan. 2019. Dialogpt: Large-scale generative pre-training for conversational response Colin Raffel, Noam Shazeer, Adam Roberts, Kather- generation. arXiv preprint arXiv:1911.00536. ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to- text transformer. Journal of Machine Learning Re- search, 21(140):1–67. Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp, 109:109. Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. 2020. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637.
A Appendix A.1 Full results of Retrieving Extra data We present a full results of enlarging the retrieval set to (100+600k) for both Reddit and Movie Dialogue, shown in Table 6. The training set is 100k as the same as the self-retrieval setup in main results. Reddit PPL↓ F1↑ BLEU↑ GPT-2 BASELINE 31.3 5.3 3.4 C2C 28.0 6.2 3.8 GPT-2 w. DR C2R 28.2 6.0 3.7 MIX 26.9 6.8 4.3 T5 BASELINE 25.5 5.2 3.7 C2C 23.6 9.6 7.2 F I D w. DR C2R 23.8 9.4 7.1 MIX 21.9 12.0 9.0 Movie PPL↓ F1↑ BLEU↑ GPT2 BASELINE 25.6 5.4 3.3 C2C 22.5 6.0 3.7 GPT-2 w. DR C2R 22.6 5.9 3.5 MIX 21.7 7.3 4.7 T5 BASELINE 20.5 5.2 3.7 C2C 19.2 9.1 6.9 F I D w. DR C2R 19.4 9.0 6.7 MIX 17.7 11.5 8.5 Table 6: Automatic evaluations of the in-domain setups on the Reddit and Movie Dialog datasets with 8 evidences for retrieval. The best results are in bold.
You can also read