Unsupervised Knowledge Selection for Dialogue Generation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Unsupervised Knowledge Selection for Dialogue Generation Xiuyi Chen, Feilong Chen, Fandong Meng, Peng Li, Jie Zhou Pattern Recognition Center, WeChat AI, Tencent Inc, Beijing, China {hugheren.chan,ivess.chan}@gmail.com {fandongmeng,patrickpli,withtomzhou}@tencent.com Abstract et al., 2019; Tang and Hu, 2019; Qin et al., 2019a; Li et al., 2019b; Zheng and Zhou, 2019; Meng Knowledge selection is an important and chal- et al., 2019; Ren et al., 2019; Ye et al., 2020; Lin lenging task which could provide the appropri- et al., 2020). However, they usually need the pre- ate knowledge for informative dialogue gen- eration. However, the needed gold knowl- identified knowledge and the knowledge access edge label is difficult to collect in reality. In task is less studied (Kim et al., 2020b) which is this paper, we study knowledge selection for the precursor to knowledge dialogue generation in dialogue generation in the unsupervised sce- reality (Zheng et al., 2020). nario and propose a novel Distilled Distant It is natural to extract the external knowledge via Supervision Loss (DDSL) to supervise knowl- information retrieval technology. Several works re- edge selection when the gold knowledge la- bel is unknown. Specifically, we first obtain gard the retrieved knowledge sentences as the pre- an oracle knowledge label via distant super- identified document (Ghazvininejad et al., 2018; vision and then leverage knowledge distilla- Michel Galley, 2018; Yang et al., 2019b; Gopalakr- tion to alleviate the noisy labeling problem ishnan et al., 2019). However, the retrieved docu- of distant supervision. Furthermore, we pro- ment contains redundant and irrelevant informa- pose a pretraining-finetuning strategy to deal tion which are harmful for dialogue generation with the mismatch knowledge selection prob- (Zhao et al., 2020b). Hence, knowledge selection lem that models tend to select the mismatched which chooses an appropriate sentence from the knowledge for dialogue generation in the un- supervised setting and will cause the degen- pre-retrieved knowledge pool gains much attention eration of knowledge-aware decoder. Exper- and it plays the role of knowledge access for gener- iments on two knowledge-grounded dialogue ation. Dinan et al. (2019) first propose knowledge datasets show that our approach manages to se- selection for dialogue generation which are two lect knowledge more accurately in the unsuper- sequential subtasks and the generation is based on vised setting and generates more informative the selected knowledge. Several works follow their responses, even outperforming many strong su- setting and achieve improvements with latent vari- pervised baselines.1 able models (Kim et al., 2020a; Chen et al., 2020b) 1 Introduction or more complex selection mechanism (Niu et al., 2019; Meng et al., 2020; Zheng et al., 2020; Meng To avoid general and dull dialogue generation (Li et al., 2021). Although those works show promis- et al., 2016), knowledge-grounded dialogue which ing performance with explicit use of knowledge in equips dialogue systems with external knowledge open-domain dialogue, they still need gold knowl- has become a popular research topic. Thanks to edge labels to train the selection module well (Kim the hand-collected knowledge-grounded dialogue et al., 2020a). And it is still less studied to make datasets which align each dialogue (even each utter- knowledge selection work well without gold knowl- ance) with a pre-identified document (Zhang et al., edge label, which is valuable and challenging. 2018; Zhou et al., 2018; Moghe et al., 2018; Zhou In this paper, we explore knowledge selection et al., 2020), many researchers focus on inject- for dialogue generation in the unsupervised sce- ing the given knowledge to generate informative nario and propose a novel Distilled Distant Su- responses and achieve promising results (Yavuz pervision Loss (DDSL) to supervise knowledge 1 The codes and model checkpoints will be available at selection when there is no gold knowledge label. https://github.com/ErenChan/UKSDG Specifically, we first obtain an oracle knowledge 1230 Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1230–1244 August 1–6, 2021. ©2021 Association for Computational Linguistics
label via distant supervision (Mintz et al., 2009) yt by incorporating the selected knowledge. In the to substitute the gold one in the unsupervised set- conventional supervised setting, there exists gold ting. However, distant supervision inevitably suf- knowledge label to supervise knowledge selection. fers from the noisy labeling problem due to liter- However, the manually labeled knowledge is dif- ally matching (Yang et al., 2019a). Therefore, to ficult to obtain in reality (Lian et al., 2019). As a train knowledge selection well without gold label, result, we study the unsupervised knowledge selec- we leverage knowledge distillation to reduce the tion for dialogue generation in this paper, which is noise of the oracle label. Furthermore, we find very valuable and challenging. that models tend to select the mismatched knowl- edge for dialogue generation in the unsupervised 2.2 Architecture Overview setting. And forcing the knowledge-aware decoder In the following subsections, we first introduce to leverage the selected knowledge at training will the three major components (Section 2.3 ∼ 2.5): cause the decoder degenerating into the knowledge- Encoder, Knowledge Selection (KS) and Decoder, independent decoder. To deal with this problem, which are trained jointly with the gold knowledge we propose to pretrain the knowledge selection and loss and response generation loss in the conven- response generation independently and then fine- tional supervised setting, as Figure 1 (a) shows. tune the decoder with the selected knowledge using Then we introduce our Distilled Distant Supervi- different sample weighting scores. We demonstrate sion Loss (DDSL) in Section 2.6 to train knowledge the effectiveness of our unsupervised approach on selection well in the unsupervised setting, which two knowledge-grounded dialogue datasets, i.e., consists of distant supervision and knowledge distil- Wizard of Wikipedia (Dinan et al., 2019) and Holl- lation, as Figure 1 (b) shows. Finally, we detail the E (Moghe et al., 2018) in comparison with various mismatch knowledge selection problem and how to supervised and unsupervised baselines. make the decoder leverage the selected knowledge Our contributions are summarized as follows: kts well in Section 2.7. • We propose Distilled Distant Supervision 2.3 Encoder Loss to make knowledge selection work well For each sentence st, we obtain the context aware in the unsupervised scenario where the gold word representations Hst via BERT (Devlin et al., knowledge label is not available. 2019) and the corresponding sentence representa- • We propose a pretraining-finetuning strategy tion hst via Mean Pooling (Cer et al., 2018): to alleviate the degeneration of knowledge- aware decoder caused by the mismatch knowl- Hst = BERT (st) ∈ RNst ×d , (1) hst = Mean Hst ∈ Rd edge selection problem. • Results on two datasets show that our ap- where Nst is the sentence length and d is the hidden proach manages to select knowledge more ac- size. Specifically, we represent the utterance xt curately in the unsupervised setting and even with Hxt and hxt , and represent each knowledge generates more informative responses than sentence kti ∈ Kt with Hkti and hkti . many strong supervised baselines. 2.4 Knowledge Selection 2 Approach In this paper, we mainly focus on knowledge se- 2.1 Task Formulation lection in the unsupervised setting and adopt the standard dot-product attention over the knowledge Given the utterance xt at each turn t and the associ- candidates to select knowledge (Dinan et al., 2019). ated knowledge pool Kt = {kti } = {kt1 , · · · , ktL }2 containing L retrieved candidate sentences kti , the Selection Query: The selection query consists final goal is to generate an informative response of the current utterance, the dialogue history dht = yt . Following (Dinan et al., 2019), we first learn [x1 , y1 ,· · ·, xt−1 , yt−1 ] and the history of selected to select the appropriate knowledge kts from the knowledge kht = [k10 ,· · ·, kt−1 0 ] as they help the knowledge pool Kt and then generate the response knowledge selection (Kim et al., 2020a). Formally, kt , kts , kt and kt0 indicate any, selected, gold and oracle 2 i we use two GRUs (Cho et al., 2014) to summarize knowledge sentences or labels, respectively. the dialogue and knowledge selection history as the 1231
Figure 1: The framework of knowledge selection for dialogue generation. (a) In the supervised setting. (b) In the unsupervised setting. Specifically, we replace the gold knowledge loss with our Distilled Distant Supervision Loss (DDSL) which contains Distant Supervision (DS) and knowledge distillation (see the dotted red lines). corresponding state vectors sdht and skht : where TD denotes the Transformer decoder, snt is the hidden vector for the n-th word in the response sdht = GRUdh [hxt ; hyt ] , sdht−1 ∈ Rd yt at t-th turn, pcp is the attention weight of the , (2) skht = GRUkh hkt0 , skht−1 ∈ Rd first head in the additional multi-head self-attention layer for the copy mechanism, which is short for where sdh0 and skh0 are zero vectors, hxt , hyt and MultiHeadcp (Vaswani et al., 2017), V is the vocab- hkt0 are sentence vectors of utterance xt , response ulary, and p (V) is the final generation distribution. yt and the oracle knowledge kt0 (will be described in Finally, we generate the word ytn with the high- Equation 8) and [·; ·] denotes concatenation. Then, est probability, and we keep generating by feeding the selection query qt is obtained: ytn to the decoder until the “heosi” token is gen- erated. We train the generation task the Negative qt = Wq skht−1 ; sdht−1 ; hxt ∈ Rd . (3) Log-Likelihood (NLL) loss: Knowledge Selection: The knowledge selection distribution S ∈ RL over the knowledge pool Kt ∈ LG = LNLL = − log p (yt |xt , kts ) (7) RL is obtained by the dot-product attention: The model is trained with the loss L = LKS + LG , exp hkti · qt where LKS is the knowledge selection loss, i.e., Skti = P . (4) Equation 5 in the supervised setting. j exp h j · qt k ∈Kt k t t 2.6 Distilled Distant Supervision Loss Finally,the knowledge kts with the highest attention In this section, we will introduce our Distilled Dis- score is selected for further response generation. If tant Supervision Loss (DDSL) to train knowledge the gold knowledge kt exists, we could train this selection well in the unsupervised setting, which task via the Cross Entropy (CE) loss: consists of distant supervision, label weighting and knowledge distillation. Actually, we first obtain a LKS = LCE (S, kt ) = − log Skt , (5) noisy oracle knowledge label via distant supervi- 2.5 Decoder sion. And then our DDSL tries to reduce the noise via label weighting and knowledge distillation. Following (Dinan et al., 2019; Kim et al., 2020a), our Transformer-based decoder takes the represen- Distant Supervision: Suppose that we have the tation concatenation Hrc = Hxt ; Hkts ∈ RNrc ,d utterance xt , response yt and the retrieved knowl- of current utterance xt and the selected knowledge edge pool Kt without knowing the gold knowledge kts as input, and uses the copy mechanism (Gu et al., label, we first calculate the confidence score Wkti 2016) to generate responses. The process of gener- whether each knowledge kti is matched up with this ating a word can be formulated as follows: dialogue flow by: snt = TD Hrc , yt
and τ is the temperature to reshape the confidence probability distribution W ∈ RL over the knowl- edge pool. Then, we obtain the oracle knowledge kt0 with the highest confidence score, assuming that the gold knowledge should contribute most tokens to the response generation. This is possible be- cause that 1) the causal modeling specified con- Figure 2: To deal with the mismatch problem, we first pretrain the Knowledge Selection (KS) and the decoder ditions (Selltiz et al., 1976) hold between knowl- with the matched knowledge4 in parallel along red lines. edge selection and response generation (Tuan et al., Then we finetune the decoder with the selected knowl- 2020), and 2) it is common for humans to (invol- edge weightedly along the green line. untarily) produce utterances which are copied or suitably modified from background knowledge sen- source of variance that can be used to cancel out tences (Moghe et al., 2018). the variance introduced by the label noise (Li et al., Although we can directly replace the gold label 2017). Moreover, previous works have proved kt in Equation 5 with the alternative one kt0 , there that the student can still be enhanced by the noisy are some noise. Therefore, we modify Equation 5 teacher (Sau and Balasubramanian, 2016; Xie et al., with label weighting via the confidence score: 2020; Yuan et al., 2020). Therefore, we believe that LKS = LWCE (S) = LCE S, kt0 · Wkt0 . (9) the student trained by two supervision signals gets benefits from the regularization of the soft target Knowledge Distillation: We further alleviate (Yuan et al., 2020). the noisy labeling problem of distance supervision To sum up, our DDSL loss is as follows: via Knowledge Distillation (KD) as shown in Fig- ure 1 (b). Following (Tian et al., 2020; Chen et al., LKS = LDDSL = LWCE (S)+ LKD , (11) 2020b), the teacher takes the context and response as input and generates the distribution of knowl- which consists of distant supervision, label weight- edge selection as soft target. Compared with the ing in Equation 9 and knowledge distillation in student, i.e., the standard knowledge selection mod- Equation 10 for unsupervised knowledge selection. ule described in Section 2.4, teacher has the gold 2.7 Training response as an additional input. Specifically, The Mismatch Knowledge Selection Problem: we makeup thed teacher query as qtea = W tea sdht ; skht−1 ∈ R , which contains As we know, there are chances that the selected t more information (i.e., the response) than qt in knowledge is not the gold one due to 1) the diversity Equation 3. Then we use this query qtea of knowledge selection in conversation (Kim et al., t to ob- tain the teacher’s knowledge selection distribution 2020a) and 2) the under-optimized knowledge se- T ∈ RL by Equation 4. Finally, online response- lection at early training stage (Zhao et al., 2020b). base knowledge distillation3 is formulated as: And it is more serious in the unsupervised setting where it is hard to train knowledge selection well. LKD = LWCE (T ) + DKL (T k S) The mismatch knowledge selection problem occurs X Tki , (10) due to the training paradigm in Figure 1 where the = LCE T , kt0 ·Wkt0 + Tkti log t decoder is trained to generate the gold response i Skti kt ∈Kt with mismatched knowledge. This mismatch prob- where the first term is used for teacher training lem causes the knowledge-aware decoder to take and the second term transfers the knowledge from the selected knowledge as noise and degenerate teacher to student based on the Kullback-Leibler into the knowledge-independent decoder. (KL) divergence between the teacher and student Our Pretraining-Finetuning Strategy: Al- distributions (i.e., T ∈ RL and S ∈ RL ). though training the decoder with the matched Although the teacher is also trained with the knowledge4 solves the mismatch problem. It can’t noisy label, the teacher produces an independent deal with wrong knowledge selection at inference 3 As surveyed in (Gou et al., 2020), response-based knowl- which is often the case, yet never seen at training. edge usually refers to the neural response of the last output 4 layer of the teacher model, and online distillation means we Note that the oracle knowledge is most literally matched train the teacher and student together. with the gold response by the metric defined in Equation 8. 1233
We take the plain idea as our pretraining stage, et al., 2018), both of which provide the knowledge and then train the decoder to adapt to the selected candidates with gold knowledge labels for knowl- knowledge using different weighting scores in the edge selection. To test our approach in the unsu- finetuning stage . pervised setting, we do not use the gold knowledge In the pretraining stage, we train knowledge label provided in those datasets. selection and response generation in parallel as WoW contains the dialogues between two partici- Figure 2 shows. The pretraining loss is as follows: pants on some open-domain topics, where one is a curious learner while the other plays the role of L0NLL = − log p yt |xt , kt0 , (12) a knowledgeable expert with access to the knowl- L = LKS + L0NLL edge pool. Each knowledge pool contains about 61 sentences on average per turn, which are retrieved where we use LKS = LDDSL in the unsupervised from Wikipedia based on the dialogue context via setting and the decoder is trained to generate the the IR system. There are 18430, 1948 and 1933 gold response with the oracle knowledge kt0 instead dialogues for training, validation and test, respec- of the selected one kts . Therefore, the decoder tively. According to the topic overlapping, the test learns how to incorporate the matched knowledge4 set is split into two subsets: 965 Test Seen and 968 kt0 into the response generation because kt0 is much Test Unseen dialogues, where Test Unseen consists more accurate than the selected one kts as Table 4 of 58 topics never seen in train or validation. shows. And we could alleviate the mismatch prob- Holl-E contains 7228, 930 and 913 dialogues for lem from the pretraining process because 1) we get training, validation and test, respectively. Two test a fully optimized knowledge selection module and versions are provided: one with a single gold refer- 2) the pretrained decoder provides a good initial- ence, the other with multiple gold references (more ization for finetuning. than one gold knowledge sentences and correspond- In the finetuning stage, we continuely train the ing responses for each given conversation context). pretrained decoder to adapt to the pretrained knowl- Each dialogue is assigned with a document of about edge selection module with the sample weighting 60 sentences on average as the knowledge pool. idea. And the finetuning loss is defined as follows: Here, we use the modified version (Kim et al., LG = LNLL · 1 + Wkts , (13) 2020a) which fits for knowledge selection. where LNLL is defined in Equation 7 and Wkts is 3.2 Models for Comparison the confidence or weighting score of the selected We compare our method with a set of baselines:5 knowledge kts defined in Equation 8. As mentioned above, the mismatch knowledge selection problem 3.2.1 No Knowledge is caused by the training paradigm that we may S2STransformer is a Seq2Seq model based on Trans- train the decoder to generate the gold response with former (Vaswani et al., 2017) that does not leverage the mismatched knowledge from the knowledge se- the external knowledge. lection module. Here, we finetune the pretrained S2SBERT replaces the Transformer Encoder with a decoder with the selected knowledge kts by giving pretrained BERT (Devlin et al., 2019). higher importance weights if the selected knowl- edge is suitable for the gold response generation. 3.2.2 Supervised Knowledge Selection In this way, we further alleviate the mismatch prob- TMN is short for End-to-End Transformer Mem- lem because we highlight the matched samples by Net (Dinan et al., 2019), which selects knowledge assigning an importance weight to each instance based on the Transformer memory network and (xt , kts , yt ) to reform the training data (Cai et al., generate responses via the Transformer decoder. 2020; Dong et al., 2020). TMNBERT+PostKS+CP , implemented by (Kim et al., 3 Experiments 2020a), enhances the encoder with BERT, knowl- edge selection module with PostKS (Lian et al., 3.1 Dataset 2019) and decoder with the copy mechanism (CP). We evaluate our model on two public knowledge- 5 See Appendix A.1 for more baselines, from knowledge- grounded dialogue datasets: Wizard of Wikipedia aware generation (Huang et al., 2020) to knowledge selection (WoW) (Dinan et al., 2019) and Holl-E (Moghe in the supervised, semi-supervised and unsupervised settings. 1234
Test Seen Test Unseen Setting Row Method Acc ↑ PPL ↓ R-1 ↑ R-2 ↑ Acc ↑ PPL ↓ R-1 ↑ R-2 ↑ 00 S2S†Transformer (Vaswani et al., 2017) N/A 57.4 16.5 4.4 N/A 102.0 14.6 2.9 0 No Knowledge 01 S2S†BERT N/A 53.9 17.2 4.8 N/A 93.3 14.9 3.3 10 TMN (Dinan et al., 2019) 21.1 63.5 16.9 N/A 14.3 97.3 14.4 N/A 1 Supervised 11 TMNBERT+PostKS+CP (2020a) 25.5 52.2 19.0 6.5 14.4 83.4 15.6 3.9 12 SKT (Kim et al., 2020a) 26.8 52.0 19.3 6.8 18.3 81.4 16.1 4.2 20 TMN0 (Dinan et al., 2019) 13.4 66.5 15.9 N/A 11.8 103.6 14.3 N/A 21 PostKS (Lian et al., 2019) 4.8 79.1 13.0 1.0 4.2 193.8 13.1 1.0 22 SKT0 (Kim et al., 2020a) 0.3 54.7 17.1 4.6 0.1 88.2 15.5 3.4 2 Unsupervised 0 UKSDG w/o DDSL 5.2 49.3 17.4 5.1 5.1 82.8 15.1 3.3 1 UKSDGvec w/o DDSL 13.5 46.1 17.8 5.2 13.1 82.3 15.5 3.3 2 UKSDG (ours) 23.8 51.8 19.5 6.8 16.2 76.3 16.3 4.4 3 UKSDGPF w/o SW (ours) 26.4 44.1 20.3 7.4 20.8 64.5 17.8 5.6 4 UKSDGPF (ours) 26.4 45.0 20.6 7.7 20.8 65.5 18.2 5.9 Table 1: Quantitative results on WoW. Our approach manages to select knowledge more accurately in the unsuper- vised setting and generate more informative responses than the strong baselines in the supervised setting. Note that models with “†” are implemented by ourselves and other models with citation are from the original paper. SKT (Kim et al., 2020a), short for Sequential 3.3 Implementation Details Knowledge Transformer, uses the posterior distri- We use TensorFlow 2.0 to implement our approach bution by sequential latent modeling and achieves base on SKT6 . All sentences are encoded by the promising results in the supervised setting. shared BERTBASE (Devlin et al., 2019), and the 3.2.3 Unsupervised Knowledge Selection response is greedily generated via a 5-layer Trans- TMN0 is TMN, trained only via generation loss in former decoder with copy mechanism. The hidden Equation 7 without knowledge loss in Equation 5. size d is 768 and the vocabulary size |V | is 30, 522. SKT0 is SKT optimized without knowledge loss. The knowledge selection module contains two sep- PostKS (Lian et al., 2019) takes the benefits of arate one-layer GRUs and one projection layer. Our latent variable models and leverages the posterior proposed DDSL contains no trainable parameters knowledge distribution as a pseudo label for knowl- except one projection layer in the teacher selection edge selection without knowledge loss. Here, we module. And the temperature τ is 0.1. use the results provided by (Kim et al., 2020a). We use the Adam optimizer (Kingma and Ba, 2014) with gradient clipping at 0.4 to train our 3.2.4 Our Unsupervised Methods models on a single GPU (TITAN Xp). The learning We implement our model in the unsupervised set- rate is 2e−5 and the batch7 size is 1. Moreover, ting, namely Unsupervised Knowledge Selection we apply label smoothing (Pereyra et al., 2017) for Dialogue Generation (UKSDG), which is opti- and set 0.1 for knowledge selection and 0.05 for mized with our DDSL in Equation 11 for unsuper- response generation. In the pretraining-finetuning vised knowledge selection and generation loss in strategy, we use 5 and 20 epochs in the pretraining Equation 7. UKSDGPF indicates that we adopt our and finetuning stage, respectively. The pretrained Pretraining-Finetuning strategy to alleviate the mis- models are selected according to the accuracy score match knowledge selection problem in Section 2.7. and other models are selected according to the R-1 Furthermore, we remove several components for score since knowledge selection aims to serve for ablation study: high-quality generation. (1) UKSDG w/o DDSL is only optimized by the It takes almost the same time for the conver- generation loss in Equation 7 without our DDSL. gence of UKSDG and KSDG as we only replace (2) UKSDGvec w/o DDSL further replaces the de- our DDSL with the gold knowledge selection loss. coder input Hrc (in Section 2.5) with the aver- The convergence of SKT is a bit slower as it is hard aged knowledge vector enhanced context Hvec = 6 Thanks for their processed datasets, models and P Hxt + ki ∈Kt Skti ·hkti . t evaluation codes at https://github.com/bckim92/ (3) UKSDGPF w/o SW does not use the Sample sequential-knowledge-transformer. 7 Weighting in Equation 13. Each example is a dialogue rather than an individual turn. 1235
Single Reference Multi Reference Setting Row Method Acc ↑ PPL ↓ R-1 ↑ R-2 ↑ Acc ↑ PPL ↓ R-1 ↑ R-2 ↑ 00 S2S†Transformer (Vaswani et al., 2017) N/A 105.8 19.1 9.8 N/A 72.4 24.1 12.9 0 No Knowledge 01 S2S†BERT N/A 89.4 19.0 9.6 N/A 61.9 23.5 12.1 10 TMN (Dinan et al., 2019) 22.7 140.6 20.1 10.3 32.3 83.6 24.3 12.8 1 Supervised 11 TMNBERT+PostKS+CP (2020a) 27.8 47.4 29.2 22.3 37.8 27.9 35.9 29.0 12 SKT (Kim et al., 2020a) 29.2 48.9 29.8 23.1 39.2 28.5 36.5 29.7 21 PostKS (Lian et al., 2019) 1.5 196.6 15.2 6.0 3.2 114.1 19.2 7.9 22 SKT0 ‡ (Kim et al., 2020a) 0.1 77.1 19.2 9.9 0.1 75.8 19.4 10.1 2 Unsupervised 0 UKSDG w/o DDSL 2.9 76.5 20.0 10.5 4.3 52.0 25.0 13.6 1 UKSDGvec w/o DDSL 19.1 78.1 20.9 10.9 28.8 53.2 25.4 13.3 2 UKSDG (ours) 24.2 55.4 29.7 23.1 33.5 31.9 36.4 29.7 3 UKSDGPF w/o SW (ours) 25.1 38.7 30.8 23.2 35.0 22.7 37.9 30.0 4 UKSDGPF (ours) 25.1 39.4 31.0 24.0 35.0 22.9 38.1 31.1 Table 2: Quantitative results on Holl-E. The method with “‡” is reported by rerunning the released code, and models with “†” are implemented by ourselves. to optimize the latent variable model. It takes about respectively. And we have the following consistent 2.5 times as long for the convergence of UKSDGPF observations:8 (1) Comparing SKT and SKT0 , we and KSDGPF . The computation for confidence cal- see that the gold knowledge loss plays an important culation during inference is the same for UKSDG, role to train knowledge selection well. (2) Compar- KSDG (w/ and w/o PF) and SKT because we adopt ing row 0 and 2, we see that our proposed DDSL is the same backbone. All of our models contain the key to train knowledge selection well in the un- about 174M parameters. supervised setting and can be the alternative of the gold knowledge loss. As a result, our approach sig- 3.4 Evaluation nificantly outperforms the other unsupervised meth- Automatic Evaluation. We automatically evalu- ods on all metrics (significance tests (Koehn, 2004), ate knowledge selection with accuracy (Acc), re- p-value < 0.01). (3) Although UKSDGvec w/o sponse generation with perplexity (PPL), unigram DDSL could learn some patterns of knowledge se- F1 (R-1) and bigram F1 (R-2), which are com- lection from the gradient of generation loss, we see monly used in this task (Dinan et al., 2019; Kim that compressing the knowledge into a vector will et al., 2020a; Chen et al., 2020b). We also remove loss much information for dialogue generation. (4) all the punctuation and (a, an, the) to compute the Although our UKSDGPF usually makes knowledge R-1 and R-2 scores as (Kim et al., 2020a) do. Note selection worse than the supervised SKT in row that lower PPL and higher R-1 and R-2 scores indi- 1 2, we achieve higher generation quality, which cate better generation quality. indicates that our pretraining-finetuning strategy could alleviate the mismatch knowledge selection Human Evaluation. We firstly select 100 sam- problem and emphasizes the importance of lever- ples from each test set on WoW for human eval- aging the selected knowledge properly for future uation. Then we follow (Kim et al., 2020a; Li study on this task. (5) Comparing row 3 and 4, we et al., 2019a) and ask three annotators to evaluate see that the sample weighting also helps in the fine- the generation quality according to Engagingness tuning stage, though PPL score is slightly worse and Knowledgeability from 1 to 4, where 1 means due to the difficulty of injecting the knowledge not at all, 2 is a little, 3 is somewhat, and 4 is a into responses. (6) Moreover, our approach demon- lot. Engagingness measures how much do you like strates the stronger ability of generalization with the response and Knowledgeability measures the smaller performance gap between Test Seen and informativeness in the responses. Test Unseen in Table 1, which we attribute to our 4 Results and Analysis DDSL because it works in the unsupervised setting and is more suitable for Test Unseen. For example, 4.1 Main Results compared with SKT in row 1 2, UKSDGPF in row Quantitative Results. We report the automatic 8 More observations with other baselines can be found in results on WoW and Holl-E in Table 1 and Table 2, Appendix A.2. 1236
Test Seen Test Unseen WoW Holl E Method Method EGG KLD EGG KLD Seen Unseen Single Multi SKT 2.70±0.05 2.64±0.05 2.46±0.05 2.50±0.06 UKSDG (ours) 2.49±0.05 2.37±0.06 2.33±0.06 2.27±0.05 Gold label (kt ) 100.0 100.0 100.0 100.0 UKSDGPF (ours) 2.72±0.05 2.71±0.05 2.49±0.06 2.60±0.05 Oracle Label (kt0 ) 70.1 69.0 90.3 90.4 Gold Response 3.25±0.04 3.37±0.05 3.16±0.05 3.24±0.05 LDDSL in Eq. 11 26.4 20.8 25.1 35.0 w/o LKD in Eq. 10 25.8 19.1 24.1 33.5 Table 3: Results of human evaluations on WoW. EGG w/o LKD & Wkt0 in Eq. 9 25.5 18.8 24.0 34.2 and KLD denote Engagingness and Knowledgeability, respectively. Table 4: Ablation study. Knowledge selection accuracy of UKSDGPF trained with different losses. 4 achieves the highest selection accuracy in Test Unseen with a slightly lower accuracy in Test Seen. servations: (1) The oracle knowledge from dis- Qualitative Results. We report human evalua- tant supervision contains several informative to- tion results of the generated responses according kens which could help dialogue generation since to Engagingness and Knowledgeability in Table 3. they appears in the gold response as defined by Comparing UKSDG and UKSDGPF , we see the Equation 8. In particular, the oracle knowledge effectiveness of our pretraining-finetuning strategy in the case 3 is also appropriate and the gold re- which could alleviate the mismatch problem de- sponse contains much more information than the scribed in Section 2.7. Our UKSDGPF in the unsu- gold knowledge, which indicates the diversity of pervised setting generates responses slightly better knowledge selection and selecting one sentence than SKT in the supervised setting. Moreover, the may be not enough for informative generation. (2) improvement is much obvious according to Knowl- However, some oracle knowledge labels are differ- edgeability, which also indicates the importance of ent from the gold ones (i.e., the selected one by using the selected knowledge properly. human). Our model still learns to select knowl- edge as human, which we attribute to our DDSL 4.2 Ablation Study since knowledge distillation alleviates the noisy We have introduced the components of our DDSL labeling problem of distant supervision. (3) SKT in Section 2.6 and shown its effectiveness in Sec- does not leverage the selected knowledge well and tion 4.1. Here, we analyse our DDSL via an abla- generates the dull response. For example, SKT re- tion study in Table 4. Actually, we train UKSDGPF peats the context in case 1, generates the verbose on knowledge selection task, using our DDSL with and contradictory response in case 2 and does not components removed. We have the following obser- provide new information in case 3. (4) Whereas, vations: (1) We see that most of the oracle knowl- our UKSDGPF firstly selects the appropriate knowl- edge from distant supervision is the same as the edge as human does, and then generate fluent and gold knowledge label by human. Therefore, it is ac- informative responses by alleviate the mismatch ceptable to directly use this oracle knowledge label knowledge selection problem with the help of the to substitute the gold label as the supervision signal pretraining-finetuning strategy. This indicates the (see the last row of Table 4). (2) However, distant importance of leveraging the selected knowledge supervision inevitably suffers from the noisy label- properly for future study. ing problem due to literally matching. For example, 5 Related Work there are about 30% oracle knowledges different from the gold ones on WoW where the responses External knowledge has been wildly explored to convey knowledge much more flexibly. (3) Loss enhance dialogue understanding and/or improve weighting in Equation 9 helps on WoW where the dialogue generation (Zhu et al., 2017; Liu et al., noisy labeling problem is serious. (4) Knowledge 2018; Chen et al., 2019; Wu et al., 2019; Chen distillation in Equation 10 could further alleviate et al., 2020a; Tuan et al., 2020; Sun et al., 2020; the label noise, and our method manages to select Zhang et al., 2020; Yu et al., 2020; Ni et al., 2021). knowledge accurately in the unsupervised setting. To make use of knowledge, knowledge access is very important. Therefore, knowledge selection 4.3 Case Study which selecting appropriate knowledge given the Table 5 presents several examples of generated re- dialog context gains much attention (Zhang et al., sponses on WoW and we have the following ob- 2019; Meng et al., 2019; Ren et al., 2019; Dinan 1237
Context Are you a fan of elvis presley? KS Regarded as one of the most significant cultural icons of the 20th century, he is often referred to as the “king of rock and ...” Human DG You mean the king of rock and roll. Actually yes I am. best of all time. Don’t you agree? KS Elvis Aaron Presley (January 8, 1935 - August 16, 1977) was an American singer, musician, and actor. Case 1 SKT RG Yes, I love Elvis Presley! OK Thornton’s recording of “hound dog” is listed as one of the rock and roll hall of fame’s UKSDGPF KS Regarded as one of the most significant cultural icons of the 20th century, he is often referred to as the “king of rock and ...” DG Yes, I love him! He is regarded as the king of rock and roll. Context I love to go bowling with my family, but i’m a horrible bowler. Do you like it? KS Bowling is a sport or leisure activity in which a player rolls or throws a bowling ball towards a target. Human DG Oh, yes, I love bowling. Rolling that ball down the lane and knocking down the pins gives me a charge. KS It is one of the major forms of throwing sports. Case 2 SKT RG I do! I love bowling! I like to play, but i’m not a fan of bowling. OK In pin bowling variations, the target is usually to knock over pins at the end of a lane. UKSDGPF KS Bowling is a sport or leisure activity in which a player rolls or throws a bowling ball towards a target. DG I love bowling! It is a sport or leisure activity in which a player rolls a bowling ball towards a target. Context Have you ever seen the Walking Dead? KS The Walking Dead is an American post-apocalyptic horror television series developed by Frank Darabont for AMC that ... Human DG The American post-apocalyptic series? The one based on the comic book series by Robert Kirkman? Yes I love it! How about you? KS The Walking Dead is an American post-apocalyptic horror television series developed by Frank Darabont for AMC that ... Case 3 SKT RG Yes, I have seen the Walking Dead. OK It is based on the comic book series “The Walking Dead” by Robert Kirkman, ... UKSDGPF KS The Walking Dead is an American post-apocalyptic horror television series developed by Frank Darabont for AMC that ... DG Yes, I have seen the Walking Dead. It was developed by Frank Darabont for AMC. Table 5: Examples of generated responses on WoW. KS, DG and OK denote Knowledge Selection, Dialogue Generation and Oracle Knowledge, respectively. et al., 2019; Kim et al., 2020a; Meng et al., 2020; the task-oriented dialogue system. Ren et al. (2019) Chen et al., 2020b; Zheng et al., 2020; Huang et al., propose a distantly supervised learning schema at 2020). In this paper, we focus on knowledge selec- segment level to effectively learn the topic transi- tion in the unsupervised setting where there is no tion vector. Although inspired by the similar idea, gold knowledge label. Lian et al. (2019) and Kim we are devoted to knowledge selection in the unsu- et al. (2020a) attempt to deal with this problem via pervised setting, which is a different application of latent models, yet their performance is less than DS. Moreover, rather than just using distant super- satisfactory. Differently, we design our DDSL to vision, we design our DDSL with label weighting make knowledge selection work well in the unsu- and knowledge distillation to deal with the noisy pervised setting. There is a very recent work (Zhao labeling problem from DS. et al., 2020b) that finetunes GPT-2 (Radford et al., 2019) with the unsupervised pretrained knowledge 6 Conclusion selection module on unlabeled dialogues. We are We study unsupervised knowledge selection for di- different in two aspects: (1) Our DDSL leverages alogue generation where the gold knowledge label knowledge distillation to alleviate the label noise is not available. Actually, we design the Distilled at the pretraining stage; (2) We adopt the sample Distant Supervision Loss, a novel and effective weighting idea at the finetuning stage. And we will solution to train knowledge selection well in the leverage GPT-2 for future study. unsupervised setting. Furthermore, we propose the Our work is inspired by Distant Supervision pretraining-finetuning strategy to deal with the mis- (DS), an effective method to generate labeled data match knowledge selection problem that models with external knowledge base (KB) for informa- tend to select the mismatched knowledge for dia- tion extraction (Mintz et al., 2009; Min et al., 2013; logue generation in the unsupervised setting and Zeng et al., 2015; Wang et al., 2018). Following will cause the degeneration of knowledge-aware this idea, Gopalakrishnan et al. (2019) use the or- decoder. Experiments show that our approach man- acle knowledge from DS to construct the Topical- ages to select knowledge more accurately in the un- Chat dataset. Similarly, Qin et al. (2019b) obtain supervised setting and generates more informative the weakly labeled data to train a KB retriever in responses than many strong supervised baselines. 1238
References Jiahua Dong, Yang Cong, Gan Sun, Bineng Zhong, and Xiaowei Xu. 2020. What can be transferred: Unsu- Hengyi Cai, Hongshen Chen, Yonghao Song, Cheng pervised domain adaptation for endoscopic lesions Zhang, Xiaofang Zhao, and Dawei Yin. 2020. Data segmentation. In IEEE/CVF Conference on Com- Manipulation: Towards Effective Instance Learn- puter Vision and Pattern Recognition (CVPR), pages ing for Neural Dialogue Generation via Learning 4022–4031. to Augment and Reweight. In Proceedings of the 58th Annual Meeting of the Association for Compu- Marjan Ghazvininejad, Chris Brockett, Ming-Wei tational Linguistics, pages 6334–6343, Online. As- Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and sociation for Computational Linguistics. Michel Galley. 2018. A knowledge-grounded neural conversation model. In Thirty-Second AAAI Confer- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, ence on Artificial Intelligence. Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Karthik Gopalakrishnan, Behnam Hedayatnia, Qin- et al. 2018. Universal Sentence Encoder. arXiv lang Chen, Anna Gottardi, Sanjeev Kwatra, Anu preprint arXiv:1803.11175. Venkatesh, Raefer Gabriel, and Dilek Hakkani- Tür. 2019. Topical-Chat: Towards Knowledge- Feilong Chen, Fandong Meng, Jiaming Xu, Peng Li, Grounded Open-Domain Conversations. Proc. In- Bo Xu, and Jie Zhou. 2020a. DMRM: A dual- terspeech 2019, pages 1891–1895. channel multi-hop reasoning model for visual dialog. In The Thirty-Fourth AAAI Conference on Artificial Jianping Gou, Baosheng Yu, Stephen John Maybank, Intelligence, AAAI 2020, The Thirty-Second Inno- and Dacheng Tao. 2020. Knowledge Distillation: A vative Applications of Artificial Intelligence Confer- Survey. CoRR, abs/2006.05525. ence, IAAI 2020, The Tenth AAAI Symposium on Ed- ucational Advances in Artificial Intelligence, EAAI Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. 2020, New York, NY, USA, February 7-12, 2020, Li. 2016. Incorporating Copying Mechanism in pages 7504–7511. AAAI Press. Sequence-to-Sequence Learning. In Proceedings Xiuyi Chen, Fandong Meng, Peng Li, Feilong Chen, of the 54th Annual Meeting of the Association for Bo Xu, Shuang Xu, and Jie Zhou. 2020b. Bridging Computational Linguistics (Volume 1: Long Papers), the Gap between Prior and Posterior Knowledge Se- pages 1631–1640, Berlin, Germany. Association for lection for Knowledge-Grounded Dialogue Genera- Computational Linguistics. tion. In Proceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing. Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020. Challenges in Building Intelligent Open-Domain Di- Xiuyi Chen, Jiaming Xu, and Bo Xu. 2019. A Work- alog Systems. ACM Trans. Inf. Syst., 38(3). ing Memory Model for Task-oriented Dialog Re- sponse Generation. In Proceedings of the 57th An- Byeongchang Kim, Jaewoo Ahn, and Gunhee Kim. nual Meeting of the Association for Computational 2020a. Sequential Latent Knowledge Selection for Linguistics, pages 2687–2693, Florence, Italy. Asso- Knowledge-Grounded Dialogue. arXiv: Computa- ciation for Computational Linguistics. tion and Language. Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- Seokhwan Kim, Mihail Eric, Karthik Gopalakrishnan, cehre, Dzmitry Bahdanau, Fethi Bougares, Hol- Behnam Hedayatnia, Yang Liu, and Dilek Hakkani- ger Schwenk, and Yoshua Bengio. 2014. Learn- Tur. 2020b. Beyond Domain APIs: Task-oriented ing Phrase Representations using RNN Encoder– Conversational Modeling with Unstructured Knowl- Decoder for Statistical Machine Translation. In edge Access. arXiv preprint arXiv:2006.03533. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Diederik P Kingma and Jimmy Ba. 2014. Adam: A pages 1724–1734. Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Philipp Koehn. 2004. Statistical significance tests for Deep Bidirectional Transformers for Language Un- machine translation evaluation. In Proceedings of derstanding. In Proceedings of the 2019 Conference the 2004 conference on empirical methods in natural of the North American Chapter of the Association language processing, pages 388–395. for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, pages 4171–4186. and Bill Dolan. 2016. A Diversity-Promoting Objec- tive Function for Neural Conversation Models. In Emily Dinan, Stephen Roller, Kurt Shuster, Angela Proceedings of the 2016 Conference of the North Fan, Michael Auli, and Jason Weston. 2019. Wizard American Chapter of the Association for Computa- of Wikipedia: Knowledge-Powered Conversational tional Linguistics: Human Language Technologies, Agents. In International Conference on Learning pages 110–119, San Diego, California. Association Representations. for Computational Linguistics. 1239
Linxiao Li, Can Xu, Wei Wu, Yufan Zhao, Xueliang Information Retrieval, SIGIR ’20, pages 1151–1160, Zhao, and Chongyang Tao. 2020. Zero-Resource New York, NY, USA. Association for Computing Knowledge-Grounded Dialogue Generation. CoRR, Machinery. abs/2008.12918. Xiang Gao Bill Dolan Jianfeng Gao Michel Galley, Margaret Li, Jason Weston, and Stephen Roller. 2019a. Chris Brockett. 2018. Grounded Response Gener- ACUTE-EVAL: Improved Dialogue Evaluation with ation Task at DSTC7. Proceedings of the AAAI-19 Optimized Questions and Multi-turn Comparisons. Workshop on Dialog System Technology Challenges. arXiv preprint arXiv:1909.03087. Bonan Min, Ralph Grishman, Li Wan, Chang Wang, Yuncheng Li, Jianchao Yang, Yale Song, Liangliang and David Gondek. 2013. Distant Supervision for Cao, Jiebo Luo, and Li-Jia Li. 2017. Learning from Relation Extraction with an Incomplete Knowledge Noisy Labels with Distillation. 2017 IEEE Interna- Base. In Proceedings of the 2013 Conference of the tional Conference on Computer Vision (ICCV). North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- Zekang Li, Cheng Niu, Fandong Meng, Yang Feng, gies, pages 777–782, Atlanta, Georgia. Association Qian Li, and Jie Zhou. 2019b. Incremental Trans- for Computational Linguistics. former with Deliberation Decoder for Document Grounded Conversations. In Proceedings of the Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju- 57th Annual Meeting of the Association for Compu- rafsky. 2009. Distant supervision for relation ex- tational Linguistics, pages 12–21. traction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of Rongzhong Lian, Min Xie, Fan Wang, Jinhua Peng, the ACL and the 4th International Joint Conference and Hua Wu. 2019. Learning to Select Knowledge on Natural Language Processing of the AFNLP, for Response Generation in Dialog Systems. In Pro- pages 1003–1011, Suntec, Singapore. Association ceedings of the 28th International Joint Conference for Computational Linguistics. on Artificial Intelligence, pages 5081–5087. AAAI Press. Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Mitesh M Khapra. 2018. Towards Exploiting Back- Xiexiong Lin, Weiyu Jian, Jianshan He, Taifeng Wang, ground Knowledge for Building Conversation Sys- and Wei Chu. 2020. Generating Informative Con- tems. In Proceedings of the 2018 Conference on versational Response using Recurrent Knowledge- Empirical Methods in Natural Language Processing, Interaction and Knowledge-Copy. In Proceedings of pages 2322–2332. the 58th Annual Meeting of the Association for Com- putational Linguistics, pages 41–52, Online. Associ- Jinjie Ni, Tom Young, Vlad Pandelea, Fuzhao Xue, ation for Computational Linguistics. Vinay Adiga, and Erik Cambria. 2021. Recent advances in deep learning-based dialogue systems. Shuman Liu, Hongshen Chen, Zhaochun Ren, Yang CoRR, abs/2105.04387. Feng, Qun Liu, and Dawei Yin. 2018. Knowledge Diffusion for Neural Dialogue Generation. In Pro- Zheng-Yu Niu, Hua Wu, Haifeng Wang, et al. 2019. ceedings of the 56th Annual Meeting of the Associa- Knowledge Aware Conversation Generation with tion for Computational Linguistics (Volume 1: Long Explainable Reasoning over Augmented Graphs. In Papers), pages 1489–1498, Melbourne, Australia. Proceedings of the 2019 Conference on Empirical Association for Computational Linguistics. Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- Chuan Meng, Pengjie Ren, Zhumin Chen, Christof guage Processing (EMNLP-IJCNLP), pages 1782– Monz, Jun Ma, and Maarten de Rijke. 2019. RefNet: 1792. A Reference-aware Network for Background Based Conversation. arXiv preprint arXiv:1908.06449. Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. 2017. Regular- Chuan Meng, Pengjie Ren, Zhumin Chen, Zhaochun izing Neural Networks by Penalizing Confident Out- Ren, Tengxiao Xi, and Maarten de Rijke. 2021. put Distributions. arXiv preprint arXiv:1701.06548. Initiative-Aware Self-Supervised learning for Knowledge-Grounded Conversations. In Pro- Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong ceedings of the 44rd International ACM SIGIR Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jian- Conference on Research and Development in feng Gao. 2019a. Conversing by Reading: Content- Information Retrieval, SIGIR ’21. Association for ful Neural Conversation with On-demand Machine Computing Machinery. Reading. In Proceedings of the 57th Annual Meet- ing of the Association for Computational Linguistics, Chuan Meng, Pengjie Ren, Zhumin Chen, Weiwei Sun, pages 5427–5436. Zhaochun Ren, Zhaopeng Tu, and Maarten de Ri- jke. 2020. DukeNet: A Dual Knowledge Interac- Libo Qin, Yijia Liu, Wanxiang Che, Haoyang Wen, tion Network for Knowledge-Grounded Conversa- Yangming Li, and Ting Liu. 2019b. Entity- tion. In Proceedings of the 43rd International ACM Consistent End-to-end Task-Oriented Dialogue Sys- SIGIR Conference on Research and Development in tem with KB Retriever. In Proceedings of the 1240
2019 Conference on Empirical Methods in Natu- Wenquan Wu, Zhen Guo, Xiangyang Zhou, Hua Wu, ral Language Processing and the 9th International Xiyuan Zhang, Rongzhong Lian, and Haifeng Wang. Joint Conference on Natural Language Process- 2019. Proactive Human-Machine Conversation with ing (EMNLP-IJCNLP), pages 133–142, Hong Kong, Explicit Conversation Goal. In Proceedings of the China. Association for Computational Linguistics. 57th Annual Meeting of the Association for Com- putational Linguistics, pages 3794–3804, Florence, Alec Radford, Jeff Wu, Rewon Child, David Luan, Italy. Association for Computational Linguistics. Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation Pengjie Ren, Zhumin Chen, Christof Monz, Jun Ma, Networks: Sequence Generation Beyond One-Pass and Maarten de Rijke. 2019. Thinking Globally, Decoding. In Advances in Neural Information Pro- Acting Locally: Distantly Supervised Global-to- cessing Systems, pages 1784–1794. Local Knowledge Selection for Background Based Conversation. arXiv preprint arXiv:1908.09528. Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. 2020. Self-training with Noisy Student Bharat Bhusan Sau and Vineeth N Balasubrama- improves ImageNet classification. In Proceedings of nian. 2016. Deep Model Compression: Distilling the IEEE/CVF Conference on Computer Vision and Knowledge from Noisy Teachers. arXiv preprint Pattern Recognition, pages 10687–10698. arXiv:1610.09650. Yan Xu, Etsuko Ishii, Zihan Liu, Genta Indra C. Selltiz, L.S. Wrightsman, S.W. Cook, and Society Winata, Dan Su, Andrea Madotto, and Pascale for the Psychological Study of Social Issues. 1976. Fung. 2021. Retrieval-free knowledge-grounded di- Research Methods in Social Relations. Holt, Rine- alogue response generation with adapters. CoRR, hart and Winston. abs/2105.06232. Yajing Sun, Yue Hu, Luxi Xing, Jing Yu, and Yuqiang Kaijia Yang, Liang He, Xin-yu Dai, Shujian Huang, Xie. 2020. History-adaption Knowledge Incorpora- and Jiajun Chen. 2019a. Exploiting Noisy Data in tion Mechanism for Multi-turn Dialogue System. In Distant Supervision Relation Classification. In Pro- AAAI. ceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Xiangru Tang and Po Hu. 2019. Knowledge-Aware Linguistics: Human Language Technologies, Vol- Self-Attention Networks for Document Grounded ume 1 (Long and Short Papers), pages 3216–3225, Dialogue Generation. In International Conference Minneapolis, Minnesota. Association for Computa- on Knowledge Science, Engineering and Manage- tional Linguistics. ment, pages 400–411. Springer. Liu Yang, Junjie Hu, Minghui Qiu, Chen Qu, Jian- Zhiliang Tian, Wei Bi, Dongkyu Lee, Lanqing Xue, feng Gao, W. Bruce Croft, Xiaodong Liu, Yelong Yiping Song, Xiaojiang Liu, and Nevin L. Zhang. Shen, and Jingjing Liu. 2019b. A Hybrid Retrieval- 2020. Response-Anticipated Memory for On- Generation Neural Conversation Model. In Pro- Demand Knowledge Integration in Response Gen- ceedings of the 28th ACM International Conference eration. In Proceedings of the 58th Annual Meet- on Information and Knowledge Management, CIKM ing of the Association for Computational Linguis- 2019, Beijing, China, November 3-7, 2019, pages tics, pages 650–659, Online. Association for Com- 1341–1350. ACM. putational Linguistics. Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, Dilek Yi-Lin Tuan, Wei Wei, and William Yang Wang. 2020. Hakkani-Tür, and Amazon Alexa AI. 2019. DEEP- Unsupervised Injection of Knowledge into Dialogue COPY: Grounded Response Generation with Hierar- Generation via Language Models. arXiv preprint chical Pointer Networks. In 20th Annual Meeting arXiv:2004.14614. of the Special Interest Group on Discourse and Dia- logue, page 122. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Hao-Tong Ye, Kai-Lin Lo, Shang-Yu Su, and Yun- Kaiser, and Illia Polosukhin. 2017. Attention is all Nung Chen. 2020. Knowledge-grounded response you need. In Advances in neural information pro- generation with deep attentional latent-variable cessing systems, pages 5998–6008. model. Computer Speech & Language, page 101069. Guanying Wang, Wen Zhang, Ruoxu Wang, Yalin Zhou, Xi Chen, Wei Zhang, Hai Zhu, and Huajun Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu, Chen. 2018. Label-Free Distant Supervision for Qingyun Wang, Heng Ji, and Meng Jiang. 2020. Relation Extraction via Knowledge Graph Embed- A survey of knowledge-enhanced text generation. ding. In Proceedings of the 2018 Conference on CoRR, abs/2010.04389. Empirical Methods in Natural Language Processing, pages 2246–2255, Brussels, Belgium. Association Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and for Computational Linguistics. Jiashi Feng. 2020. Revisiting Knowledge Distilla- 1241
tion via Label Smoothing Regularization. In Pro- Wenya Zhu, Kaixiang Mo, Yu Zhang, Zhangbin ceedings of the IEEE/CVF Conference on Computer Zhu, Xuezheng Peng, and Qiang Yang. 2017. Vision and Pattern Recognition, pages 3903–3911. Flexible End-to-End Dialogue System for Knowl- edge Grounded Conversation. arXiv preprint Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. arXiv:1709.04264. 2015. Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. In A Additional Experiments Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages A.1 Models for Comparison 1753–1762, Lisbon, Portugal. Association for Com- A.1.1 Our Unsupervised Methods putational Linguistics. We implement our model in the unsupervised set- Duzhen Zhang, Xiuyi Chen, Shuang Xu, and Bo Xu. ting, namely Unsupervised Knowledge Selection 2020. Knowledge aware emotion recognition in tex- for Dialogue Generation (UKSDG), which is opti- tual conversations via multi-task incremental trans- former. In COLING. mized with our DDSL in Equation 11 for unsuper- vised knowledge selection and generation loss in Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Equation 7. UKSDGPF indicates that we adopt our Szlam, Douwe Kiela, and Jason Weston. 2018. Per- sonalizing Dialogue Agents: I have a dog, do you Pretraining-Finetuning strategy to alleviate the mis- have pets too? In Proceedings of the 56th An- match knowledge selection problem in Section 2.7. nual Meeting of the Association for Computational Furthermore, we remove several components for Linguistics (Volume 1: Long Papers), pages 2204– ablation study: 2213. (1) UKSDG w/o DDSL does not use our DDSL. Yangjun Zhang, Pengjie Ren, and Maarten de Rijke. (2) UKSDGvec w/o DDSL further replaces the de- 2019. Improving background based conversation coder input Hrc (in Section 2.5) with the aver- with context-aware knowledge pre-selection. arXiv aged knowledge vector enhanced context Hvec = preprint arXiv:1906.06685. P Hxt + ki ∈Kt Skti ·hkti . t Xueliang Zhao, Wei Wu, Chongyang Tao, Can Xu, (3) UKSDGPF w/o SW does not use the Sample Dongyan Zhao, and Rui Yan. 2020a. Low-Resource Weighting in Equation 13. Knowledge-Grounded Dialogue Generation. In International Conference on Learning Representa- A.1.2 Our Supervised Methods tions. Here, we also report the Knowledge Selection for Xueliang Zhao, Wei Wu, Can Xu, Chongyang Dialogue Generation (KSDG) in the supervised Tao, Dongyan Zhao, and Rui Yan. 2020b. setting as described in Section 2.3, 2.4 and 2.5. And Knowledge-Grounded Dialogue Generation with Pre-trained Language Models. arXiv preprint KSDGPF alleviates the mismatch problem using arXiv:2010.08824. the pretrain-finetuning strategy in Section 2.7. We use the gold knowledge loss in Equation 5 to train Chujie Zheng, Yunbo Cao, Daxin Jiang, and Minlie Huang. 2020. Difference-aware Knowledge Selec- KSDG and KSDGPF in the supervised setting while tion for Knowledge-grounded Conversation Genera- we train UKSDG and UKSDGPF with our DDSL tion. CoRR, abs/2009.09378. when the gold knowledge is not available. We compare our method with a set of baselines: Wen Zheng and Ke Zhou. 2019. Enhancing Conver- sational Dialogue Models with Grounded Knowl- A.1.3 No Knowledge edge. In Proceedings of the 28th ACM International Conference on Information and Knowledge Manage- S2STransformer is a Seq2Seq model based on Trans- ment, pages 709–718. former (Vaswani et al., 2017) that does not leverage the external knowledge. Hao Zhou, Chujie Zheng, Kaili Huang, Minlie Huang, and Xiaoyan Zhu. 2020. KdConv: A Chinese S2SBERT replaces the Transformer Encoder with a Multi-domain Dialogue Dataset Towards Multi-turn pre-trained BERT (Devlin et al., 2019). Knowledge-driven Conversation. In Proceedings KnowExpert (Xu et al., 2021) avoids the knowl- of the 58th Annual Meeting of the Association for edge retrieval process and attempts to inject prior Computational Linguistics, pages 7098–7108, On- line. Association for Computational Linguistics. knowledge into the pre-trained language models for knowledge-grounded dialogue generation task. Kangyan Zhou, Shrimai Prabhumoye, and Alan W Essentially, KnowExpert stores knowledge in its Black. 2018. A Dataset for Document Grounded Conversations. In Proceedings of the 2018 Con- parameters with lightweight adapters. Therefore, ference on Empirical Methods in Natural Language KnowExpert does not use knowledge explicitly at Processing, pages 708–713. inference. 1242
You can also read