Unsupervised Knowledge Selection for Dialogue Generation

Page created by William Tucker

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Unsupervised Knowledge Selection for Dialogue Generation

                Xiuyi Chen, Feilong Chen, Fandong Meng, Peng Li, Jie Zhou
               Pattern Recognition Center, WeChat AI, Tencent Inc, Beijing, China
                       {hugheren.chan,ivess.chan}@gmail.com
              {fandongmeng,patrickpli,withtomzhou}@tencent.com

                      Abstract                             et al., 2019; Tang and Hu, 2019; Qin et al., 2019a;
                                                           Li et al., 2019b; Zheng and Zhou, 2019; Meng
    Knowledge selection is an important and chal-          et al., 2019; Ren et al., 2019; Ye et al., 2020; Lin
    lenging task which could provide the appropri-
                                                           et al., 2020). However, they usually need the pre-
    ate knowledge for informative dialogue gen-
    eration. However, the needed gold knowl-
                                                           identified knowledge and the knowledge access
    edge label is difficult to collect in reality. In      task is less studied (Kim et al., 2020b) which is
    this paper, we study knowledge selection for           the precursor to knowledge dialogue generation in
    dialogue generation in the unsupervised sce-           reality (Zheng et al., 2020).
    nario and propose a novel Distilled Distant               It is natural to extract the external knowledge via
    Supervision Loss (DDSL) to supervise knowl-
                                                           information retrieval technology. Several works re-
    edge selection when the gold knowledge la-
    bel is unknown. Specifically, we first obtain          gard the retrieved knowledge sentences as the pre-
    an oracle knowledge label via distant super-           identified document (Ghazvininejad et al., 2018;
    vision and then leverage knowledge distilla-           Michel Galley, 2018; Yang et al., 2019b; Gopalakr-
    tion to alleviate the noisy labeling problem           ishnan et al., 2019). However, the retrieved docu-
    of distant supervision. Furthermore, we pro-           ment contains redundant and irrelevant informa-
    pose a pretraining-finetuning strategy to deal         tion which are harmful for dialogue generation
    with the mismatch knowledge selection prob-            (Zhao et al., 2020b). Hence, knowledge selection
    lem that models tend to select the mismatched
                                                           which chooses an appropriate sentence from the
    knowledge for dialogue generation in the un-
    supervised setting and will cause the degen-           pre-retrieved knowledge pool gains much attention
    eration of knowledge-aware decoder. Exper-             and it plays the role of knowledge access for gener-
    iments on two knowledge-grounded dialogue              ation. Dinan et al. (2019) first propose knowledge
    datasets show that our approach manages to se-         selection for dialogue generation which are two
    lect knowledge more accurately in the unsuper-         sequential subtasks and the generation is based on
    vised setting and generates more informative           the selected knowledge. Several works follow their
    responses, even outperforming many strong su-
                                                           setting and achieve improvements with latent vari-
    pervised baselines.1
                                                           able models (Kim et al., 2020a; Chen et al., 2020b)
1   Introduction                                           or more complex selection mechanism (Niu et al.,
                                                           2019; Meng et al., 2020; Zheng et al., 2020; Meng
To avoid general and dull dialogue generation (Li          et al., 2021). Although those works show promis-
et al., 2016), knowledge-grounded dialogue which           ing performance with explicit use of knowledge in
equips dialogue systems with external knowledge            open-domain dialogue, they still need gold knowl-
has become a popular research topic. Thanks to             edge labels to train the selection module well (Kim
the hand-collected knowledge-grounded dialogue             et al., 2020a). And it is still less studied to make
datasets which align each dialogue (even each utter-       knowledge selection work well without gold knowl-
ance) with a pre-identified document (Zhang et al.,        edge label, which is valuable and challenging.
2018; Zhou et al., 2018; Moghe et al., 2018; Zhou             In this paper, we explore knowledge selection
et al., 2020), many researchers focus on inject-           for dialogue generation in the unsupervised sce-
ing the given knowledge to generate informative            nario and propose a novel Distilled Distant Su-
responses and achieve promising results (Yavuz             pervision Loss (DDSL) to supervise knowledge
  1
    The codes and model checkpoints will be available at   selection when there is no gold knowledge label.
https://github.com/ErenChan/UKSDG                          Specifically, we first obtain an oracle knowledge

                                                        1230
            Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1230–1244
                          August 1–6, 2021. ©2021 Association for Computational Linguistics

label via distant supervision (Mintz et al., 2009)                  yt by incorporating the selected knowledge. In the
to substitute the gold one in the unsupervised set-                 conventional supervised setting, there exists gold
ting. However, distant supervision inevitably suf-                  knowledge label to supervise knowledge selection.
fers from the noisy labeling problem due to liter-                  However, the manually labeled knowledge is dif-
ally matching (Yang et al., 2019a). Therefore, to                   ficult to obtain in reality (Lian et al., 2019). As a
train knowledge selection well without gold label,                  result, we study the unsupervised knowledge selec-
we leverage knowledge distillation to reduce the                    tion for dialogue generation in this paper, which is
noise of the oracle label. Furthermore, we find                     very valuable and challenging.
that models tend to select the mismatched knowl-
edge for dialogue generation in the unsupervised                    2.2   Architecture Overview
setting. And forcing the knowledge-aware decoder                    In the following subsections, we first introduce
to leverage the selected knowledge at training will                 the three major components (Section 2.3 ∼ 2.5):
cause the decoder degenerating into the knowledge-                  Encoder, Knowledge Selection (KS) and Decoder,
independent decoder. To deal with this problem,                     which are trained jointly with the gold knowledge
we propose to pretrain the knowledge selection and                  loss and response generation loss in the conven-
response generation independently and then fine-                    tional supervised setting, as Figure 1 (a) shows.
tune the decoder with the selected knowledge using                  Then we introduce our Distilled Distant Supervi-
different sample weighting scores. We demonstrate                   sion Loss (DDSL) in Section 2.6 to train knowledge
the effectiveness of our unsupervised approach on                   selection well in the unsupervised setting, which
two knowledge-grounded dialogue datasets, i.e.,                     consists of distant supervision and knowledge distil-
Wizard of Wikipedia (Dinan et al., 2019) and Holl-                  lation, as Figure 1 (b) shows. Finally, we detail the
E (Moghe et al., 2018) in comparison with various                   mismatch knowledge selection problem and how to
supervised and unsupervised baselines.                              make the decoder leverage the selected knowledge
   Our contributions are summarized as follows:                     kts well in Section 2.7.

    • We propose Distilled Distant Supervision                      2.3   Encoder
      Loss to make knowledge selection work well                    For each sentence st, we obtain the context aware
      in the unsupervised scenario where the gold                   word representations Hst via BERT (Devlin et al.,
      knowledge label is not available.                             2019) and the corresponding sentence representa-
    • We propose a pretraining-finetuning strategy                  tion hst via Mean Pooling (Cer et al., 2018):
      to alleviate the degeneration of knowledge-
      aware decoder caused by the mismatch knowl-                              Hst = BERT (st) ∈ RNst ×d
                                                                                                         ,                (1)
                                                                               hst = Mean Hst ∈ Rd
                                                                                              
      edge selection problem.

    • Results on two datasets show that our ap-                     where Nst is the sentence length and d is the hidden
      proach manages to select knowledge more ac-                   size. Specifically, we represent the utterance xt
      curately in the unsupervised setting and even                 with Hxt and hxt , and represent each knowledge
      generates more informative responses than                     sentence kti ∈ Kt with Hkti and hkti .
      many strong supervised baselines.
                                                                    2.4   Knowledge Selection
2     Approach                                                      In this paper, we mainly focus on knowledge se-
2.1    Task Formulation                                             lection in the unsupervised setting and adopt the
                                                                    standard dot-product attention over the knowledge
Given the utterance xt at each turn t and the associ-
                                                                    candidates to select knowledge (Dinan et al., 2019).
ated knowledge pool Kt = {kti } = {kt1 , · · · , ktL }2
containing L retrieved candidate sentences kti , the                Selection Query: The selection query consists
final goal is to generate an informative response                   of the current utterance, the dialogue history dht =
yt . Following (Dinan et al., 2019), we first learn                 [x1 , y1 ,· · ·, xt−1 , yt−1 ] and the history of selected
to select the appropriate knowledge kts from the                    knowledge kht = [k10 ,· · ·, kt−1   0 ] as they help the
knowledge pool Kt and then generate the response                    knowledge selection (Kim et al., 2020a). Formally,
    kt , kts , kt and kt0 indicate any, selected, gold and oracle
   2 i                                                              we use two GRUs (Cho et al., 2014) to summarize
knowledge sentences or labels, respectively.                        the dialogue and knowledge selection history as the

                                                                1231

Figure 1: The framework of knowledge selection for dialogue generation. (a) In the supervised setting. (b) In the
unsupervised setting. Specifically, we replace the gold knowledge loss with our Distilled Distant Supervision Loss
(DDSL) which contains Distant Supervision (DS) and knowledge distillation (see the dotted red lines).

corresponding state vectors sdht and skht :                where TD denotes the Transformer decoder, snt is
                                                           the hidden vector for the n-th word in the response
   sdht = GRUdh [hxt ; hyt ] , sdht−1 ∈ Rd
                                     
                                                           yt at t-th turn, pcp is the attention weight of the
                                          ,      (2)
   skht = GRUkh hkt0 , skht−1 ∈ Rd                         first head in the additional multi-head self-attention
                                                           layer for the copy mechanism, which is short for
where sdh0 and skh0 are zero vectors, hxt , hyt and        MultiHeadcp (Vaswani et al., 2017), V is the vocab-
hkt0 are sentence vectors of utterance xt , response       ulary, and p (V) is the final generation distribution.
yt and the oracle knowledge kt0 (will be described in      Finally, we generate the word ytn with the high-
Equation 8) and [·; ·] denotes concatenation. Then,        est probability, and we keep generating by feeding
the selection query qt is obtained:                        ytn to the decoder until the “heosi” token is gen-
                                                           erated. We train the generation task the Negative
        qt = Wq skht−1 ; sdht−1 ; hxt ∈ Rd .
                                     
                                                  (3)
                                                           Log-Likelihood (NLL) loss:
Knowledge Selection: The knowledge selection
distribution S ∈ RL over the knowledge pool Kt ∈                    LG = LNLL = − log p (yt |xt , kts )       (7)
RL is obtained by the dot-product attention:
                                                           The model is trained with the loss L = LKS + LG ,
                                   
                      exp hkti · qt                        where LKS is the knowledge selection loss, i.e.,
          Skti = P                       . (4)           Equation 5 in the supervised setting.
                    j    exp   h   j · qt
                  k ∈Kt          k
                     t              t                      2.6   Distilled Distant Supervision Loss
Finally,the knowledge kts   with the highest attention     In this section, we will introduce our Distilled Dis-
score is selected for further response generation. If      tant Supervision Loss (DDSL) to train knowledge
the gold knowledge kt exists, we could train this          selection well in the unsupervised setting, which
task via the Cross Entropy (CE) loss:                      consists of distant supervision, label weighting and
                                                           knowledge distillation. Actually, we first obtain a
         LKS = LCE (S, kt ) = − log Skt ,          (5)
                                                           noisy oracle knowledge label via distant supervi-
2.5   Decoder                                              sion. And then our DDSL tries to reduce the noise
                                                           via label weighting and knowledge distillation.
Following (Dinan et al., 2019; Kim et al., 2020a),
our Transformer-based decoder   takes the
                                          represen-       Distant Supervision: Suppose that we have the
tation concatenation Hrc = Hxt ; Hkts ∈ RNrc ,d            utterance xt , response yt and the retrieved knowl-
of current utterance xt and the selected knowledge         edge pool Kt without knowing the gold knowledge
kts as input, and uses the copy mechanism (Gu et al.,      label, we first calculate the confidence score Wkti
2016) to generate responses. The process of gener-         whether each knowledge kti is matched up with this
ating a word can be formulated as follows:                 dialogue flow by:
    snt = TD Hrc , yt

and τ is the temperature to reshape the confidence
probability distribution W ∈ RL over the knowl-
edge pool. Then, we obtain the oracle knowledge
kt0 with the highest confidence score, assuming that
the gold knowledge should contribute most tokens
to the response generation. This is possible be-
cause that 1) the causal modeling specified con-                 Figure 2: To deal with the mismatch problem, we first
                                                                 pretrain the Knowledge Selection (KS) and the decoder
ditions (Selltiz et al., 1976) hold between knowl-
                                                                 with the matched knowledge4 in parallel along red lines.
edge selection and response generation (Tuan et al.,             Then we finetune the decoder with the selected knowl-
2020), and 2) it is common for humans to (invol-                 edge weightedly along the green line.
untarily) produce utterances which are copied or
suitably modified from background knowledge sen-                 source of variance that can be used to cancel out
tences (Moghe et al., 2018).                                     the variance introduced by the label noise (Li et al.,
    Although we can directly replace the gold label              2017). Moreover, previous works have proved
kt in Equation 5 with the alternative one kt0 , there            that the student can still be enhanced by the noisy
are some noise. Therefore, we modify Equation 5                  teacher (Sau and Balasubramanian, 2016; Xie et al.,
with label weighting via the confidence score:                   2020; Yuan et al., 2020). Therefore, we believe that
     LKS = LWCE (S) = LCE S, kt0 · Wkt0 .
                                     
                                                  (9)            the student trained by two supervision signals gets
                                                                 benefits from the regularization of the soft target
Knowledge Distillation: We further alleviate                     (Yuan et al., 2020).
the noisy labeling problem of distance supervision                  To sum up, our DDSL loss is as follows:
via Knowledge Distillation (KD) as shown in Fig-
ure 1 (b). Following (Tian et al., 2020; Chen et al.,                      LKS = LDDSL = LWCE (S)+ LKD ,                (11)
2020b), the teacher takes the context and response
as input and generates the distribution of knowl-                which consists of distant supervision, label weight-
edge selection as soft target. Compared with the                 ing in Equation 9 and knowledge distillation in
student, i.e., the standard knowledge selection mod-             Equation 10 for unsupervised knowledge selection.
ule described in Section 2.4, teacher has the gold
                                                                 2.7   Training
response as an additional input.
    Specifically,                                                The Mismatch Knowledge Selection Problem:
                 we makeup thed teacher query as
qtea  =  W  tea  sdht ; skht−1 ∈ R , which contains              As we know, there are chances that the selected
  t
more information (i.e., the response) than qt in                 knowledge is not the gold one due to 1) the diversity
Equation 3. Then we use this query qtea                          of knowledge selection in conversation (Kim et al.,
                                           t to ob-
tain the teacher’s knowledge selection distribution              2020a) and 2) the under-optimized knowledge se-
T ∈ RL by Equation 4. Finally, online response-                  lection at early training stage (Zhao et al., 2020b).
base knowledge distillation3 is formulated as:                   And it is more serious in the unsupervised setting
                                                                 where it is hard to train knowledge selection well.
  LKD = LWCE (T ) + DKL (T k S)                                  The mismatch knowledge selection problem occurs
                           X         Tki , (10)                  due to the training paradigm in Figure 1 where the
      = LCE T , kt0 ·Wkt0 +
                   
                             Tkti log t                          decoder is trained to generate the gold response
                           i
                                     Skti
                                 kt ∈Kt
                                                                 with mismatched knowledge. This mismatch prob-
where the first term is used for teacher training                lem causes the knowledge-aware decoder to take
and the second term transfers the knowledge from                 the selected knowledge as noise and degenerate
teacher to student based on the Kullback-Leibler                 into the knowledge-independent decoder.
(KL) divergence between the teacher and student                  Our Pretraining-Finetuning Strategy: Al-
distributions (i.e., T ∈ RL and S ∈ RL ).                        though training the decoder with the matched
   Although the teacher is also trained with the                 knowledge4 solves the mismatch problem. It can’t
noisy label, the teacher produces an independent                 deal with wrong knowledge selection at inference
    3
      As surveyed in (Gou et al., 2020), response-based knowl-   which is often the case, yet never seen at training.
edge usually refers to the neural response of the last output
                                                                    4
layer of the teacher model, and online distillation means we          Note that the oracle knowledge is most literally matched
train the teacher and student together.                          with the gold response by the metric defined in Equation 8.

                                                            1233

We take the plain idea as our pretraining stage,            et al., 2018), both of which provide the knowledge
and then train the decoder to adapt to the selected         candidates with gold knowledge labels for knowl-
knowledge using different weighting scores in the           edge selection. To test our approach in the unsu-
finetuning stage .                                          pervised setting, we do not use the gold knowledge
   In the pretraining stage, we train knowledge             label provided in those datasets.
selection and response generation in parallel as            WoW contains the dialogues between two partici-
Figure 2 shows. The pretraining loss is as follows:         pants on some open-domain topics, where one is
                                                            a curious learner while the other plays the role of
           L0NLL = − log p yt |xt , kt0
                                        
                                          ,     (12)        a knowledgeable expert with access to the knowl-
                L = LKS + L0NLL                             edge pool. Each knowledge pool contains about 61
                                                            sentences on average per turn, which are retrieved
where we use LKS = LDDSL in the unsupervised
                                                            from Wikipedia based on the dialogue context via
setting and the decoder is trained to generate the
                                                            the IR system. There are 18430, 1948 and 1933
gold response with the oracle knowledge kt0 instead
                                                            dialogues for training, validation and test, respec-
of the selected one kts . Therefore, the decoder
                                                            tively. According to the topic overlapping, the test
learns how to incorporate the matched knowledge4
                                                            set is split into two subsets: 965 Test Seen and 968
kt0 into the response generation because kt0 is much
                                                            Test Unseen dialogues, where Test Unseen consists
more accurate than the selected one kts as Table 4
                                                            of 58 topics never seen in train or validation.
shows. And we could alleviate the mismatch prob-
                                                            Holl-E contains 7228, 930 and 913 dialogues for
lem from the pretraining process because 1) we get
                                                            training, validation and test, respectively. Two test
a fully optimized knowledge selection module and
                                                            versions are provided: one with a single gold refer-
2) the pretrained decoder provides a good initial-
                                                            ence, the other with multiple gold references (more
ization for finetuning.
                                                            than one gold knowledge sentences and correspond-
    In the finetuning stage, we continuely train the
                                                            ing responses for each given conversation context).
pretrained decoder to adapt to the pretrained knowl-
                                                            Each dialogue is assigned with a document of about
edge selection module with the sample weighting
                                                            60 sentences on average as the knowledge pool.
idea. And the finetuning loss is defined as follows:
                                                            Here, we use the modified version (Kim et al.,
             LG = LNLL · 1 + Wkts ,
                                     
                                                 (13)       2020a) which fits for knowledge selection.

where LNLL is defined in Equation 7 and Wkts is             3.2     Models for Comparison
the confidence or weighting score of the selected           We compare our method with a set of baselines:5
knowledge kts defined in Equation 8. As mentioned
above, the mismatch knowledge selection problem             3.2.1    No Knowledge
is caused by the training paradigm that we may              S2STransformer is a Seq2Seq model based on Trans-
train the decoder to generate the gold response with        former (Vaswani et al., 2017) that does not leverage
the mismatched knowledge from the knowledge se-             the external knowledge.
lection module. Here, we finetune the pretrained            S2SBERT replaces the Transformer Encoder with a
decoder with the selected knowledge kts by giving           pretrained BERT (Devlin et al., 2019).
higher importance weights if the selected knowl-
edge is suitable for the gold response generation.          3.2.2    Supervised Knowledge Selection
In this way, we further alleviate the mismatch prob-        TMN is short for End-to-End Transformer Mem-
lem because we highlight the matched samples by             Net (Dinan et al., 2019), which selects knowledge
assigning an importance weight to each instance             based on the Transformer memory network and
(xt , kts , yt ) to reform the training data (Cai et al.,   generate responses via the Transformer decoder.
2020; Dong et al., 2020).
                                                            TMNBERT+PostKS+CP , implemented by (Kim et al.,
3     Experiments                                           2020a), enhances the encoder with BERT, knowl-
                                                            edge selection module with PostKS (Lian et al.,
3.1    Dataset                                              2019) and decoder with the copy mechanism (CP).
We evaluate our model on two public knowledge-                 5
                                                                 See Appendix A.1 for more baselines, from knowledge-
grounded dialogue datasets: Wizard of Wikipedia             aware generation (Huang et al., 2020) to knowledge selection
(WoW) (Dinan et al., 2019) and Holl-E (Moghe                in the supervised, semi-supervised and unsupervised settings.

                                                       1234

Test Seen                       Test Unseen
      Setting      Row   Method
                                                                    Acc ↑   PPL ↓   R-1 ↑   R-2 ↑   Acc ↑   PPL ↓    R-1 ↑   R-2 ↑
                   00    S2S†Transformer   (Vaswani et al., 2017)    N/A     57.4    16.5     4.4    N/A    102.0     14.6     2.9
 0 No Knowledge
                   01    S2S†BERT                                    N/A     53.9    17.2     4.8    N/A     93.3     14.9     3.3
                   10    TMN (Dinan et al., 2019)                    21.1    63.5    16.9    N/A     14.3     97.3    14.4    N/A
   1 Supervised    11    TMNBERT+PostKS+CP (2020a)                   25.5    52.2    19.0     6.5    14.4     83.4    15.6     3.9
                   12    SKT (Kim et al., 2020a)                     26.8    52.0    19.3     6.8    18.3     81.4    16.1     4.2
                   20    TMN0 (Dinan et al., 2019)                   13.4    66.5    15.9    N/A     11.8   103.6     14.3    N/A
                   21    PostKS (Lian et al., 2019)                   4.8    79.1    13.0     1.0     4.2   193.8     13.1     1.0
                   22    SKT0 (Kim et al., 2020a)                     0.3    54.7    17.1     4.6     0.1    88.2     15.5     3.4
  2 Unsupervised    0    UKSDG w/o DDSL                               5.2    49.3    17.4     5.1     5.1     82.8    15.1     3.3
                    1    UKSDGvec w/o DDSL                           13.5    46.1    17.8     5.2    13.1     82.3    15.5     3.3
                    2    UKSDG (ours)                                23.8    51.8    19.5     6.8    16.2     76.3    16.3     4.4
                    3    UKSDGPF w/o SW (ours)                       26.4    44.1    20.3     7.4    20.8     64.5    17.8     5.6
                    4    UKSDGPF (ours)                              26.4    45.0    20.6     7.7    20.8     65.5    18.2     5.9

Table 1: Quantitative results on WoW. Our approach manages to select knowledge more accurately in the unsuper-
vised setting and generate more informative responses than the strong baselines in the supervised setting. Note that
models with “†” are implemented by ourselves and other models with citation are from the original paper.

SKT (Kim et al., 2020a), short for Sequential                        3.3    Implementation Details
Knowledge Transformer, uses the posterior distri-                    We use TensorFlow 2.0 to implement our approach
bution by sequential latent modeling and achieves                    base on SKT6 . All sentences are encoded by the
promising results in the supervised setting.                         shared BERTBASE (Devlin et al., 2019), and the
3.2.3 Unsupervised Knowledge Selection                               response is greedily generated via a 5-layer Trans-
TMN0 is TMN, trained only via generation loss in                     former decoder with copy mechanism. The hidden
Equation 7 without knowledge loss in Equation 5.                     size d is 768 and the vocabulary size |V | is 30, 522.
SKT0 is SKT optimized without knowledge loss.                        The knowledge selection module contains two sep-
PostKS (Lian et al., 2019) takes the benefits of                     arate one-layer GRUs and one projection layer. Our
latent variable models and leverages the posterior                   proposed DDSL contains no trainable parameters
knowledge distribution as a pseudo label for knowl-                  except one projection layer in the teacher selection
edge selection without knowledge loss. Here, we                      module. And the temperature τ is 0.1.
use the results provided by (Kim et al., 2020a).                        We use the Adam optimizer (Kingma and Ba,
                                                                     2014) with gradient clipping at 0.4 to train our
3.2.4 Our Unsupervised Methods                                       models on a single GPU (TITAN Xp). The learning
We implement our model in the unsupervised set-                      rate is 2e−5 and the batch7 size is 1. Moreover,
ting, namely Unsupervised Knowledge Selection                        we apply label smoothing (Pereyra et al., 2017)
for Dialogue Generation (UKSDG), which is opti-                      and set 0.1 for knowledge selection and 0.05 for
mized with our DDSL in Equation 11 for unsuper-                      response generation. In the pretraining-finetuning
vised knowledge selection and generation loss in                     strategy, we use 5 and 20 epochs in the pretraining
Equation 7. UKSDGPF indicates that we adopt our                      and finetuning stage, respectively. The pretrained
Pretraining-Finetuning strategy to alleviate the mis-                models are selected according to the accuracy score
match knowledge selection problem in Section 2.7.                    and other models are selected according to the R-1
Furthermore, we remove several components for                        score since knowledge selection aims to serve for
ablation study:                                                      high-quality generation.
(1) UKSDG w/o DDSL is only optimized by the                             It takes almost the same time for the conver-
generation loss in Equation 7 without our DDSL.                      gence of UKSDG and KSDG as we only replace
(2) UKSDGvec w/o DDSL further replaces the de-                       our DDSL with the gold knowledge selection loss.
coder input Hrc (in Section 2.5) with the aver-                      The convergence of SKT is a bit slower as it is hard
aged knowledge    vector enhanced context Hvec =
                                                                        6
                                                                          Thanks for their processed datasets, models and
       P
Hxt + ki ∈Kt Skti ·hkti .
           t                                                         evaluation codes at https://github.com/bckim92/
(3) UKSDGPF w/o SW does not use the Sample                           sequential-knowledge-transformer.
                                                                        7
Weighting in Equation 13.                                                 Each example is a dialogue rather than an individual turn.

                                                                1235

Single Reference                   Multi Reference
        Setting      Row   Method
                                                                      Acc ↑    PPL ↓    R-1 ↑    R-2 ↑   Acc ↑   PPL ↓    R-1 ↑    R-2 ↑
                     00    S2S†Transformer   (Vaswani et al., 2017)    N/A     105.8     19.1      9.8    N/A      72.4    24.1     12.9
    0 No Knowledge
                     01    S2S†BERT                                    N/A      89.4     19.0      9.6    N/A      61.9    23.5     12.1
                     10    TMN (Dinan et al., 2019)                    22.7    140.6     20.1     10.3    32.3     83.6    24.3     12.8
     1 Supervised    11    TMNBERT+PostKS+CP (2020a)                   27.8     47.4     29.2     22.3    37.8     27.9    35.9     29.0
                     12    SKT (Kim et al., 2020a)                     29.2     48.9     29.8     23.1    39.2     28.5    36.5     29.7
                     21    PostKS (Lian et al., 2019)                   1.5    196.6     15.2      6.0     3.2    114.1    19.2      7.9
                     22    SKT0 ‡ (Kim et al., 2020a)                   0.1     77.1     19.2      9.9     0.1     75.8    19.4     10.1

    2 Unsupervised    0    UKSDG w/o DDSL                               2.9      76.5    20.0     10.5     4.3     52.0    25.0     13.6
                      1    UKSDGvec w/o DDSL                           19.1      78.1    20.9     10.9    28.8     53.2    25.4     13.3
                      2    UKSDG (ours)                                24.2      55.4    29.7     23.1    33.5     31.9    36.4     29.7
                      3    UKSDGPF w/o SW (ours)                       25.1      38.7    30.8     23.2    35.0     22.7    37.9     30.0
                      4    UKSDGPF (ours)                              25.1      39.4    31.0     24.0    35.0     22.9    38.1     31.1

Table 2: Quantitative results on Holl-E. The method with “‡” is reported by rerunning the released code, and
models with “†” are implemented by ourselves.

to optimize the latent variable model. It takes about                  respectively. And we have the following consistent
2.5 times as long for the convergence of UKSDGPF                       observations:8 (1) Comparing SKT and SKT0 , we
and KSDGPF . The computation for confidence cal-                       see that the gold knowledge loss plays an important
culation during inference is the same for UKSDG,                       role to train knowledge selection well. (2) Compar-
KSDG (w/ and w/o PF) and SKT because we adopt                          ing row 0 and 2, we see that our proposed DDSL is
the same backbone. All of our models contain                           the key to train knowledge selection well in the un-
about 174M parameters.                                                 supervised setting and can be the alternative of the
                                                                       gold knowledge loss. As a result, our approach sig-
3.4     Evaluation                                                     nificantly outperforms the other unsupervised meth-
Automatic Evaluation. We automatically evalu-                          ods on all metrics (significance tests (Koehn, 2004),
ate knowledge selection with accuracy (Acc), re-                       p-value < 0.01). (3) Although UKSDGvec w/o
sponse generation with perplexity (PPL), unigram                       DDSL could learn some patterns of knowledge se-
F1 (R-1) and bigram F1 (R-2), which are com-                           lection from the gradient of generation loss, we see
monly used in this task (Dinan et al., 2019; Kim                       that compressing the knowledge into a vector will
et al., 2020a; Chen et al., 2020b). We also remove                     loss much information for dialogue generation. (4)
all the punctuation and (a, an, the) to compute the                    Although our UKSDGPF usually makes knowledge
R-1 and R-2 scores as (Kim et al., 2020a) do. Note                     selection worse than the supervised SKT in row
that lower PPL and higher R-1 and R-2 scores indi-                     1 2, we achieve higher generation quality, which
cate better generation quality.                                        indicates that our pretraining-finetuning strategy
                                                                       could alleviate the mismatch knowledge selection
Human Evaluation. We firstly select 100 sam-                           problem and emphasizes the importance of lever-
ples from each test set on WoW for human eval-                         aging the selected knowledge properly for future
uation. Then we follow (Kim et al., 2020a; Li                          study on this task. (5) Comparing row 3 and 4, we
et al., 2019a) and ask three annotators to evaluate                    see that the sample weighting also helps in the fine-
the generation quality according to Engagingness                       tuning stage, though PPL score is slightly worse
and Knowledgeability from 1 to 4, where 1 means                        due to the difficulty of injecting the knowledge
not at all, 2 is a little, 3 is somewhat, and 4 is a                   into responses. (6) Moreover, our approach demon-
lot. Engagingness measures how much do you like                        strates the stronger ability of generalization with
the response and Knowledgeability measures the                         smaller performance gap between Test Seen and
informativeness in the responses.                                      Test Unseen in Table 1, which we attribute to our
4     Results and Analysis                                             DDSL because it works in the unsupervised setting
                                                                       and is more suitable for Test Unseen. For example,
4.1     Main Results                                                   compared with SKT in row 1 2, UKSDGPF in row
Quantitative Results. We report the automatic                            8
                                                                           More observations with other baselines can be found in
results on WoW and Holl-E in Table 1 and Table 2,                      Appendix A.2.

                                                                  1236

Test Seen Test Unseen WoW Holl E
Method Method
EGG KLD EGG KLD
Seen Unseen Single Multi
SKT 2.70±0.05 2.64±0.05 2.46±0.05 2.50±0.06
UKSDG (ours) 2.49±0.05 2.37±0.06 2.33±0.06 2.27±0.05 Gold label (kt ) 100.0 100.0 100.0 100.0
UKSDGPF (ours) 2.72±0.05 2.71±0.05 2.49±0.06 2.60±0.05 Oracle Label (kt0 ) 70.1 69.0 90.3 90.4
Gold Response 3.25±0.04 3.37±0.05 3.16±0.05 3.24±0.05
LDDSL in Eq. 11 26.4 20.8 25.1 35.0
w/o LKD in Eq. 10 25.8 19.1 24.1 33.5
Table 3: Results of human evaluations on WoW. EGG w/o LKD & Wkt0 in Eq. 9 25.5 18.8 24.0 34.2
and KLD denote Engagingness and Knowledgeability,
respectively. Table 4: Ablation study. Knowledge selection accuracy
of UKSDGPF trained with different losses.
4 achieves the highest selection accuracy in Test
Unseen with a slightly lower accuracy in Test Seen. servations: (1) The oracle knowledge from dis-
Qualitative Results. We report human evalua- tant supervision contains several informative to-
tion results of the generated responses according kens which could help dialogue generation since
to Engagingness and Knowledgeability in Table 3. they appears in the gold response as defined by
Comparing UKSDG and UKSDGPF , we see the Equation 8. In particular, the oracle knowledge
effectiveness of our pretraining-finetuning strategy in the case 3 is also appropriate and the gold re-
which could alleviate the mismatch problem de- sponse contains much more information than the
scribed in Section 2.7. Our UKSDGPF in the unsu- gold knowledge, which indicates the diversity of
pervised setting generates responses slightly better knowledge selection and selecting one sentence
than SKT in the supervised setting. Moreover, the may be not enough for informative generation. (2)
improvement is much obvious according to Knowl- However, some oracle knowledge labels are differ-
edgeability, which also indicates the importance of ent from the gold ones (i.e., the selected one by
using the selected knowledge properly. human). Our model still learns to select knowl-
edge as human, which we attribute to our DDSL
4.2 Ablation Study since knowledge distillation alleviates the noisy
We have introduced the components of our DDSL labeling problem of distant supervision. (3) SKT
in Section 2.6 and shown its effectiveness in Sec- does not leverage the selected knowledge well and
tion 4.1. Here, we analyse our DDSL via an abla- generates the dull response. For example, SKT re-
tion study in Table 4. Actually, we train UKSDGPF peats the context in case 1, generates the verbose
on knowledge selection task, using our DDSL with and contradictory response in case 2 and does not
components removed. We have the following obser- provide new information in case 3. (4) Whereas,
vations: (1) We see that most of the oracle knowl- our UKSDGPF firstly selects the appropriate knowl-
edge from distant supervision is the same as the edge as human does, and then generate fluent and
gold knowledge label by human. Therefore, it is ac- informative responses by alleviate the mismatch
ceptable to directly use this oracle knowledge label knowledge selection problem with the help of the
to substitute the gold label as the supervision signal pretraining-finetuning strategy. This indicates the
(see the last row of Table 4). (2) However, distant importance of leveraging the selected knowledge
supervision inevitably suffers from the noisy label- properly for future study.
ing problem due to literally matching. For example,
5 Related Work
there are about 30% oracle knowledges different
from the gold ones on WoW where the responses External knowledge has been wildly explored to
convey knowledge much more flexibly. (3) Loss enhance dialogue understanding and/or improve
weighting in Equation 9 helps on WoW where the dialogue generation (Zhu et al., 2017; Liu et al.,
noisy labeling problem is serious. (4) Knowledge 2018; Chen et al., 2019; Wu et al., 2019; Chen
distillation in Equation 10 could further alleviate et al., 2020a; Tuan et al., 2020; Sun et al., 2020;
the label noise, and our method manages to select Zhang et al., 2020; Yu et al., 2020; Ni et al., 2021).
knowledge accurately in the unsupervised setting. To make use of knowledge, knowledge access is
very important. Therefore, knowledge selection
4.3 Case Study which selecting appropriate knowledge given the
Table 5 presents several examples of generated re- dialog context gains much attention (Zhang et al.,
sponses on WoW and we have the following ob- 2019; Meng et al., 2019; Ren et al., 2019; Dinan

1237

Context Are you a fan of elvis presley?
KS Regarded as one of the most significant cultural icons of the 20th century, he is often referred to as the “king of rock and ...”
Human
DG You mean the king of rock and roll. Actually yes I am. best of all time. Don’t you agree?
KS Elvis Aaron Presley (January 8, 1935 - August 16, 1977) was an American singer, musician, and actor.
Case 1 SKT
RG Yes, I love Elvis Presley!
OK Thornton’s recording of “hound dog” is listed as one of the rock and roll hall of fame’s
UKSDGPF KS Regarded as one of the most significant cultural icons of the 20th century, he is often referred to as the “king of rock and ...”
DG Yes, I love him! He is regarded as the king of rock and roll.
Context I love to go bowling with my family, but i’m a horrible bowler. Do you like it?
KS Bowling is a sport or leisure activity in which a player rolls or throws a bowling ball towards a target.
Human
DG Oh, yes, I love bowling. Rolling that ball down the lane and knocking down the pins gives me a charge.
KS It is one of the major forms of throwing sports.
Case 2 SKT
RG I do! I love bowling! I like to play, but i’m not a fan of bowling.
OK In pin bowling variations, the target is usually to knock over pins at the end of a lane.
UKSDGPF KS Bowling is a sport or leisure activity in which a player rolls or throws a bowling ball towards a target.
DG I love bowling! It is a sport or leisure activity in which a player rolls a bowling ball towards a target.
Context Have you ever seen the Walking Dead?
KS The Walking Dead is an American post-apocalyptic horror television series developed by Frank Darabont for AMC that ...
Human
DG The American post-apocalyptic series? The one based on the comic book series by Robert Kirkman? Yes I love it! How about you?
KS The Walking Dead is an American post-apocalyptic horror television series developed by Frank Darabont for AMC that ...
Case 3 SKT
RG Yes, I have seen the Walking Dead.
OK It is based on the comic book series “The Walking Dead” by Robert Kirkman, ...
UKSDGPF KS The Walking Dead is an American post-apocalyptic horror television series developed by Frank Darabont for AMC that ...
DG Yes, I have seen the Walking Dead. It was developed by Frank Darabont for AMC.

Table 5: Examples of generated responses on WoW. KS, DG and OK denote Knowledge Selection, Dialogue
Generation and Oracle Knowledge, respectively.

et al., 2019; Kim et al., 2020a; Meng et al., 2020; the task-oriented dialogue system. Ren et al. (2019)
Chen et al., 2020b; Zheng et al., 2020; Huang et al., propose a distantly supervised learning schema at
2020). In this paper, we focus on knowledge selec- segment level to effectively learn the topic transi-
tion in the unsupervised setting where there is no tion vector. Although inspired by the similar idea,
gold knowledge label. Lian et al. (2019) and Kim we are devoted to knowledge selection in the unsu-
et al. (2020a) attempt to deal with this problem via pervised setting, which is a different application of
latent models, yet their performance is less than DS. Moreover, rather than just using distant super-
satisfactory. Differently, we design our DDSL to vision, we design our DDSL with label weighting
make knowledge selection work well in the unsu- and knowledge distillation to deal with the noisy
pervised setting. There is a very recent work (Zhao labeling problem from DS.
et al., 2020b) that finetunes GPT-2 (Radford et al.,
2019) with the unsupervised pretrained knowledge 6 Conclusion
selection module on unlabeled dialogues. We are
We study unsupervised knowledge selection for di-
different in two aspects: (1) Our DDSL leverages
alogue generation where the gold knowledge label
knowledge distillation to alleviate the label noise
is not available. Actually, we design the Distilled
at the pretraining stage; (2) We adopt the sample
Distant Supervision Loss, a novel and effective
weighting idea at the finetuning stage. And we will
solution to train knowledge selection well in the
leverage GPT-2 for future study.
unsupervised setting. Furthermore, we propose the
Our work is inspired by Distant Supervision pretraining-finetuning strategy to deal with the mis-
(DS), an effective method to generate labeled data match knowledge selection problem that models
with external knowledge base (KB) for informa- tend to select the mismatched knowledge for dia-
tion extraction (Mintz et al., 2009; Min et al., 2013; logue generation in the unsupervised setting and
Zeng et al., 2015; Wang et al., 2018). Following will cause the degeneration of knowledge-aware
this idea, Gopalakrishnan et al. (2019) use the or- decoder. Experiments show that our approach man-
acle knowledge from DS to construct the Topical- ages to select knowledge more accurately in the un-
Chat dataset. Similarly, Qin et al. (2019b) obtain supervised setting and generates more informative
the weakly labeled data to train a KB retriever in responses than many strong supervised baselines.

1238

References Jiahua Dong, Yang Cong, Gan Sun, Bineng Zhong, and
Xiaowei Xu. 2020. What can be transferred: Unsu-
Hengyi Cai, Hongshen Chen, Yonghao Song, Cheng pervised domain adaptation for endoscopic lesions
Zhang, Xiaofang Zhao, and Dawei Yin. 2020. Data segmentation. In IEEE/CVF Conference on Com-
Manipulation: Towards Effective Instance Learn- puter Vision and Pattern Recognition (CVPR), pages
ing for Neural Dialogue Generation via Learning 4022–4031.
to Augment and Reweight. In Proceedings of the
58th Annual Meeting of the Association for Compu- Marjan Ghazvininejad, Chris Brockett, Ming-Wei
tational Linguistics, pages 6334–6343, Online. As- Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and
sociation for Computational Linguistics. Michel Galley. 2018. A knowledge-grounded neural
conversation model. In Thirty-Second AAAI Confer-
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, ence on Artificial Intelligence.
Nicole Limtiaco, Rhomni St John, Noah Constant,
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Karthik Gopalakrishnan, Behnam Hedayatnia, Qin-
et al. 2018. Universal Sentence Encoder. arXiv lang Chen, Anna Gottardi, Sanjeev Kwatra, Anu
preprint arXiv:1803.11175. Venkatesh, Raefer Gabriel, and Dilek Hakkani-
Tür. 2019. Topical-Chat: Towards Knowledge-
Feilong Chen, Fandong Meng, Jiaming Xu, Peng Li,
Grounded Open-Domain Conversations. Proc. In-
Bo Xu, and Jie Zhou. 2020a. DMRM: A dual-
terspeech 2019, pages 1891–1895.
channel multi-hop reasoning model for visual dialog.
In The Thirty-Fourth AAAI Conference on Artificial
Jianping Gou, Baosheng Yu, Stephen John Maybank,
Intelligence, AAAI 2020, The Thirty-Second Inno-
and Dacheng Tao. 2020. Knowledge Distillation: A
vative Applications of Artificial Intelligence Confer-
Survey. CoRR, abs/2006.05525.
ence, IAAI 2020, The Tenth AAAI Symposium on Ed-
ucational Advances in Artificial Intelligence, EAAI
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.
2020, New York, NY, USA, February 7-12, 2020,
Li. 2016. Incorporating Copying Mechanism in
pages 7504–7511. AAAI Press.
Sequence-to-Sequence Learning. In Proceedings
Xiuyi Chen, Fandong Meng, Peng Li, Feilong Chen, of the 54th Annual Meeting of the Association for
Bo Xu, Shuang Xu, and Jie Zhou. 2020b. Bridging Computational Linguistics (Volume 1: Long Papers),
the Gap between Prior and Posterior Knowledge Se- pages 1631–1640, Berlin, Germany. Association for
lection for Knowledge-Grounded Dialogue Genera- Computational Linguistics.
tion. In Proceedings of the 2020 Conference on Em-
pirical Methods in Natural Language Processing. Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020.
Challenges in Building Intelligent Open-Domain Di-
Xiuyi Chen, Jiaming Xu, and Bo Xu. 2019. A Work- alog Systems. ACM Trans. Inf. Syst., 38(3).
ing Memory Model for Task-oriented Dialog Re-
sponse Generation. In Proceedings of the 57th An- Byeongchang Kim, Jaewoo Ahn, and Gunhee Kim.
nual Meeting of the Association for Computational 2020a. Sequential Latent Knowledge Selection for
Linguistics, pages 2687–2693, Florence, Italy. Asso- Knowledge-Grounded Dialogue. arXiv: Computa-
ciation for Computational Linguistics. tion and Language.

Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- Seokhwan Kim, Mihail Eric, Karthik Gopalakrishnan,
cehre, Dzmitry Bahdanau, Fethi Bougares, Hol- Behnam Hedayatnia, Yang Liu, and Dilek Hakkani-
ger Schwenk, and Yoshua Bengio. 2014. Learn- Tur. 2020b. Beyond Domain APIs: Task-oriented
ing Phrase Representations using RNN Encoder– Conversational Modeling with Unstructured Knowl-
Decoder for Statistical Machine Translation. In edge Access. arXiv preprint arXiv:2006.03533.
Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Diederik P Kingma and Jimmy Ba. 2014. Adam: A
pages 1724–1734. Method for Stochastic Optimization. arXiv preprint
arXiv:1412.6980.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Philipp Koehn. 2004. Statistical significance tests for
Deep Bidirectional Transformers for Language Un- machine translation evaluation. In Proceedings of
derstanding. In Proceedings of the 2019 Conference the 2004 conference on empirical methods in natural
of the North American Chapter of the Association language processing, pages 388–395.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
pages 4171–4186. and Bill Dolan. 2016. A Diversity-Promoting Objec-
tive Function for Neural Conversation Models. In
Emily Dinan, Stephen Roller, Kurt Shuster, Angela Proceedings of the 2016 Conference of the North
Fan, Michael Auli, and Jason Weston. 2019. Wizard American Chapter of the Association for Computa-
of Wikipedia: Knowledge-Powered Conversational tional Linguistics: Human Language Technologies,
Agents. In International Conference on Learning pages 110–119, San Diego, California. Association
Representations. for Computational Linguistics.

1239

Linxiao Li, Can Xu, Wei Wu, Yufan Zhao, Xueliang Information Retrieval, SIGIR ’20, pages 1151–1160,
Zhao, and Chongyang Tao. 2020. Zero-Resource New York, NY, USA. Association for Computing
Knowledge-Grounded Dialogue Generation. CoRR, Machinery.
abs/2008.12918.
Xiang Gao Bill Dolan Jianfeng Gao Michel Galley,
Margaret Li, Jason Weston, and Stephen Roller. 2019a. Chris Brockett. 2018. Grounded Response Gener-
ACUTE-EVAL: Improved Dialogue Evaluation with ation Task at DSTC7. Proceedings of the AAAI-19
Optimized Questions and Multi-turn Comparisons. Workshop on Dialog System Technology Challenges.
arXiv preprint arXiv:1909.03087.
Bonan Min, Ralph Grishman, Li Wan, Chang Wang,
Yuncheng Li, Jianchao Yang, Yale Song, Liangliang and David Gondek. 2013. Distant Supervision for
Cao, Jiebo Luo, and Li-Jia Li. 2017. Learning from Relation Extraction with an Incomplete Knowledge
Noisy Labels with Distillation. 2017 IEEE Interna- Base. In Proceedings of the 2013 Conference of the
tional Conference on Computer Vision (ICCV). North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
Zekang Li, Cheng Niu, Fandong Meng, Yang Feng, gies, pages 777–782, Atlanta, Georgia. Association
Qian Li, and Jie Zhou. 2019b. Incremental Trans- for Computational Linguistics.
former with Deliberation Decoder for Document
Grounded Conversations. In Proceedings of the Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju-
57th Annual Meeting of the Association for Compu- rafsky. 2009. Distant supervision for relation ex-
tational Linguistics, pages 12–21. traction without labeled data. In Proceedings of
the Joint Conference of the 47th Annual Meeting of
Rongzhong Lian, Min Xie, Fan Wang, Jinhua Peng, the ACL and the 4th International Joint Conference
and Hua Wu. 2019. Learning to Select Knowledge on Natural Language Processing of the AFNLP,
for Response Generation in Dialog Systems. In Pro- pages 1003–1011, Suntec, Singapore. Association
ceedings of the 28th International Joint Conference for Computational Linguistics.
on Artificial Intelligence, pages 5081–5087. AAAI
Press. Nikita Moghe, Siddhartha Arora, Suman Banerjee, and
Mitesh M Khapra. 2018. Towards Exploiting Back-
Xiexiong Lin, Weiyu Jian, Jianshan He, Taifeng Wang, ground Knowledge for Building Conversation Sys-
and Wei Chu. 2020. Generating Informative Con- tems. In Proceedings of the 2018 Conference on
versational Response using Recurrent Knowledge- Empirical Methods in Natural Language Processing,
Interaction and Knowledge-Copy. In Proceedings of pages 2322–2332.
the 58th Annual Meeting of the Association for Com-
putational Linguistics, pages 41–52, Online. Associ- Jinjie Ni, Tom Young, Vlad Pandelea, Fuzhao Xue,
ation for Computational Linguistics. Vinay Adiga, and Erik Cambria. 2021. Recent
advances in deep learning-based dialogue systems.
Shuman Liu, Hongshen Chen, Zhaochun Ren, Yang CoRR, abs/2105.04387.
Feng, Qun Liu, and Dawei Yin. 2018. Knowledge
Diffusion for Neural Dialogue Generation. In Pro- Zheng-Yu Niu, Hua Wu, Haifeng Wang, et al. 2019.
ceedings of the 56th Annual Meeting of the Associa- Knowledge Aware Conversation Generation with
tion for Computational Linguistics (Volume 1: Long Explainable Reasoning over Augmented Graphs. In
Papers), pages 1489–1498, Melbourne, Australia. Proceedings of the 2019 Conference on Empirical
Association for Computational Linguistics. Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
Chuan Meng, Pengjie Ren, Zhumin Chen, Christof guage Processing (EMNLP-IJCNLP), pages 1782–
Monz, Jun Ma, and Maarten de Rijke. 2019. RefNet: 1792.
A Reference-aware Network for Background Based
Conversation. arXiv preprint arXiv:1908.06449. Gabriel Pereyra, George Tucker, Jan Chorowski,
Łukasz Kaiser, and Geoffrey Hinton. 2017. Regular-
Chuan Meng, Pengjie Ren, Zhumin Chen, Zhaochun izing Neural Networks by Penalizing Confident Out-
Ren, Tengxiao Xi, and Maarten de Rijke. 2021. put Distributions. arXiv preprint arXiv:1701.06548.
Initiative-Aware Self-Supervised learning for
Knowledge-Grounded Conversations. In Pro- Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong
ceedings of the 44rd International ACM SIGIR Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jian-
Conference on Research and Development in feng Gao. 2019a. Conversing by Reading: Content-
Information Retrieval, SIGIR ’21. Association for ful Neural Conversation with On-demand Machine
Computing Machinery. Reading. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
Chuan Meng, Pengjie Ren, Zhumin Chen, Weiwei Sun, pages 5427–5436.
Zhaochun Ren, Zhaopeng Tu, and Maarten de Ri-
jke. 2020. DukeNet: A Dual Knowledge Interac- Libo Qin, Yijia Liu, Wanxiang Che, Haoyang Wen,
tion Network for Knowledge-Grounded Conversa- Yangming Li, and Ting Liu. 2019b. Entity-
tion. In Proceedings of the 43rd International ACM Consistent End-to-end Task-Oriented Dialogue Sys-
SIGIR Conference on Research and Development in tem with KB Retriever. In Proceedings of the

1240

2019 Conference on Empirical Methods in Natu- Wenquan Wu, Zhen Guo, Xiangyang Zhou, Hua Wu,
ral Language Processing and the 9th International Xiyuan Zhang, Rongzhong Lian, and Haifeng Wang.
Joint Conference on Natural Language Process- 2019. Proactive Human-Machine Conversation with
ing (EMNLP-IJCNLP), pages 133–142, Hong Kong, Explicit Conversation Goal. In Proceedings of the
China. Association for Computational Linguistics. 57th Annual Meeting of the Association for Com-
putational Linguistics, pages 3794–3804, Florence,
Alec Radford, Jeff Wu, Rewon Child, David Luan, Italy. Association for Computational Linguistics.
Dario Amodei, and Ilya Sutskever. 2019. Language
Models are Unsupervised Multitask Learners. Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin,
Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation
Pengjie Ren, Zhumin Chen, Christof Monz, Jun Ma, Networks: Sequence Generation Beyond One-Pass
and Maarten de Rijke. 2019. Thinking Globally, Decoding. In Advances in Neural Information Pro-
Acting Locally: Distantly Supervised Global-to- cessing Systems, pages 1784–1794.
Local Knowledge Selection for Background Based
Conversation. arXiv preprint arXiv:1908.09528. Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and
Quoc V Le. 2020. Self-training with Noisy Student
Bharat Bhusan Sau and Vineeth N Balasubrama- improves ImageNet classification. In Proceedings of
nian. 2016. Deep Model Compression: Distilling the IEEE/CVF Conference on Computer Vision and
Knowledge from Noisy Teachers. arXiv preprint Pattern Recognition, pages 10687–10698.
arXiv:1610.09650.
Yan Xu, Etsuko Ishii, Zihan Liu, Genta Indra
C. Selltiz, L.S. Wrightsman, S.W. Cook, and Society Winata, Dan Su, Andrea Madotto, and Pascale
for the Psychological Study of Social Issues. 1976. Fung. 2021. Retrieval-free knowledge-grounded di-
Research Methods in Social Relations. Holt, Rine- alogue response generation with adapters. CoRR,
hart and Winston. abs/2105.06232.
Yajing Sun, Yue Hu, Luxi Xing, Jing Yu, and Yuqiang Kaijia Yang, Liang He, Xin-yu Dai, Shujian Huang,
Xie. 2020. History-adaption Knowledge Incorpora- and Jiajun Chen. 2019a. Exploiting Noisy Data in
tion Mechanism for Multi-turn Dialogue System. In Distant Supervision Relation Classification. In Pro-
AAAI. ceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational
Xiangru Tang and Po Hu. 2019. Knowledge-Aware Linguistics: Human Language Technologies, Vol-
Self-Attention Networks for Document Grounded ume 1 (Long and Short Papers), pages 3216–3225,
Dialogue Generation. In International Conference Minneapolis, Minnesota. Association for Computa-
on Knowledge Science, Engineering and Manage- tional Linguistics.
ment, pages 400–411. Springer.
Liu Yang, Junjie Hu, Minghui Qiu, Chen Qu, Jian-
Zhiliang Tian, Wei Bi, Dongkyu Lee, Lanqing Xue, feng Gao, W. Bruce Croft, Xiaodong Liu, Yelong
Yiping Song, Xiaojiang Liu, and Nevin L. Zhang. Shen, and Jingjing Liu. 2019b. A Hybrid Retrieval-
2020. Response-Anticipated Memory for On- Generation Neural Conversation Model. In Pro-
Demand Knowledge Integration in Response Gen- ceedings of the 28th ACM International Conference
eration. In Proceedings of the 58th Annual Meet- on Information and Knowledge Management, CIKM
ing of the Association for Computational Linguis- 2019, Beijing, China, November 3-7, 2019, pages
tics, pages 650–659, Online. Association for Com- 1341–1350. ACM.
putational Linguistics.
Semih Yavuz, Abhinav Rastogi, Guan-Lin Chao, Dilek
Yi-Lin Tuan, Wei Wei, and William Yang Wang. 2020. Hakkani-Tür, and Amazon Alexa AI. 2019. DEEP-
Unsupervised Injection of Knowledge into Dialogue COPY: Grounded Response Generation with Hierar-
Generation via Language Models. arXiv preprint chical Pointer Networks. In 20th Annual Meeting
arXiv:2004.14614. of the Special Interest Group on Discourse and Dia-
logue, page 122.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Hao-Tong Ye, Kai-Lin Lo, Shang-Yu Su, and Yun-
Kaiser, and Illia Polosukhin. 2017. Attention is all Nung Chen. 2020. Knowledge-grounded response
you need. In Advances in neural information pro- generation with deep attentional latent-variable
cessing systems, pages 5998–6008. model. Computer Speech & Language, page
101069.
Guanying Wang, Wen Zhang, Ruoxu Wang, Yalin
Zhou, Xi Chen, Wei Zhang, Hai Zhu, and Huajun Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu,
Chen. 2018. Label-Free Distant Supervision for Qingyun Wang, Heng Ji, and Meng Jiang. 2020.
Relation Extraction via Knowledge Graph Embed- A survey of knowledge-enhanced text generation.
ding. In Proceedings of the 2018 Conference on CoRR, abs/2010.04389.
Empirical Methods in Natural Language Processing,
pages 2246–2255, Brussels, Belgium. Association Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and
for Computational Linguistics. Jiashi Feng. 2020. Revisiting Knowledge Distilla-

1241

tion via Label Smoothing Regularization. In Pro- Wenya Zhu, Kaixiang Mo, Yu Zhang, Zhangbin
ceedings of the IEEE/CVF Conference on Computer Zhu, Xuezheng Peng, and Qiang Yang. 2017.
Vision and Pattern Recognition, pages 3903–3911. Flexible End-to-End Dialogue System for Knowl-
edge Grounded Conversation. arXiv preprint
Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. arXiv:1709.04264.
2015. Distant Supervision for Relation Extraction
via Piecewise Convolutional Neural Networks. In A Additional Experiments
Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pages A.1 Models for Comparison
1753–1762, Lisbon, Portugal. Association for Com- A.1.1 Our Unsupervised Methods
putational Linguistics.
We implement our model in the unsupervised set-
Duzhen Zhang, Xiuyi Chen, Shuang Xu, and Bo Xu. ting, namely Unsupervised Knowledge Selection
2020. Knowledge aware emotion recognition in tex-
for Dialogue Generation (UKSDG), which is opti-
tual conversations via multi-task incremental trans-
former. In COLING. mized with our DDSL in Equation 11 for unsuper-
vised knowledge selection and generation loss in
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Equation 7. UKSDGPF indicates that we adopt our
Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
sonalizing Dialogue Agents: I have a dog, do you Pretraining-Finetuning strategy to alleviate the mis-
have pets too? In Proceedings of the 56th An- match knowledge selection problem in Section 2.7.
nual Meeting of the Association for Computational Furthermore, we remove several components for
Linguistics (Volume 1: Long Papers), pages 2204– ablation study:
2213.
(1) UKSDG w/o DDSL does not use our DDSL.
Yangjun Zhang, Pengjie Ren, and Maarten de Rijke. (2) UKSDGvec w/o DDSL further replaces the de-
2019. Improving background based conversation coder input Hrc (in Section 2.5) with the aver-
with context-aware knowledge pre-selection. arXiv
aged knowledge vector enhanced context Hvec =
preprint arXiv:1906.06685. P
Hxt + ki ∈Kt Skti ·hkti .
t
Xueliang Zhao, Wei Wu, Chongyang Tao, Can Xu, (3) UKSDGPF w/o SW does not use the Sample
Dongyan Zhao, and Rui Yan. 2020a. Low-Resource Weighting in Equation 13.
Knowledge-Grounded Dialogue Generation. In
International Conference on Learning Representa- A.1.2 Our Supervised Methods
tions.
Here, we also report the Knowledge Selection for
Xueliang Zhao, Wei Wu, Can Xu, Chongyang Dialogue Generation (KSDG) in the supervised
Tao, Dongyan Zhao, and Rui Yan. 2020b. setting as described in Section 2.3, 2.4 and 2.5. And
Knowledge-Grounded Dialogue Generation with
Pre-trained Language Models. arXiv preprint KSDGPF alleviates the mismatch problem using
arXiv:2010.08824. the pretrain-finetuning strategy in Section 2.7. We
use the gold knowledge loss in Equation 5 to train
Chujie Zheng, Yunbo Cao, Daxin Jiang, and Minlie
Huang. 2020. Difference-aware Knowledge Selec-
KSDG and KSDGPF in the supervised setting while
tion for Knowledge-grounded Conversation Genera- we train UKSDG and UKSDGPF with our DDSL
tion. CoRR, abs/2009.09378. when the gold knowledge is not available.
We compare our method with a set of baselines:
Wen Zheng and Ke Zhou. 2019. Enhancing Conver-
sational Dialogue Models with Grounded Knowl- A.1.3 No Knowledge
edge. In Proceedings of the 28th ACM International
Conference on Information and Knowledge Manage- S2STransformer is a Seq2Seq model based on Trans-
ment, pages 709–718. former (Vaswani et al., 2017) that does not leverage
the external knowledge.
Hao Zhou, Chujie Zheng, Kaili Huang, Minlie Huang,
and Xiaoyan Zhu. 2020. KdConv: A Chinese S2SBERT replaces the Transformer Encoder with a
Multi-domain Dialogue Dataset Towards Multi-turn pre-trained BERT (Devlin et al., 2019).
Knowledge-driven Conversation. In Proceedings KnowExpert (Xu et al., 2021) avoids the knowl-
of the 58th Annual Meeting of the Association for edge retrieval process and attempts to inject prior
Computational Linguistics, pages 7098–7108, On-
line. Association for Computational Linguistics. knowledge into the pre-trained language models
for knowledge-grounded dialogue generation task.
Kangyan Zhou, Shrimai Prabhumoye, and Alan W Essentially, KnowExpert stores knowledge in its
Black. 2018. A Dataset for Document Grounded
Conversations. In Proceedings of the 2018 Con-
parameters with lightweight adapters. Therefore,
ference on Empirical Methods in Natural Language KnowExpert does not use knowledge explicitly at
Processing, pages 708–713. inference.

1242

You can also read