Selecting the optimal dialogue response once for all from a panoramic view

Selecting the optimal dialogue response once for all from a panoramic view
Selecting the optimal dialogue response once for all from a panoramic
                                                 Chiyu Song∗ 12 , Hongliang He∗ 12 , Haofei Yu1 , Huachuan Qiu12 , Zhenzhong Lan† 2
                                                                                 Zhejiang University
                                                                                 Westlake University
                                                      {songchiyu, hehongliang, qiuhuachuan, lanzhenzhong}

                                             As an essential component of dialogue sys-
                                             tems, response selection aims to pick out the
                                             optimal response among candidates to con-
                                             tinue the dialogue. In existing studies, this
                                             task is usually regarded as a binary classi-
                                             fication problem, where every candidate is
                                             ranked respectively for appropriateness. To
                                             improve its performance, we reformulate this
                                             task as a multiple-choice problem that al-
                                             lows the best selection to be made in one-
                                             shot inference. This new view inspires us
                                             to propose an architecture called Panoramic-
                                                                                                      Figure 1: Panoramic-encoder reformulates the re-
                                             encoder1 with a novel Candidates Attention
                                                                                                      sponse selection problem in existing studies.
                                             Mechanism (CAM), which allows context-
                                             wise attention between responses and leads to
                                             fine-grained comparisons. Furthermore, we
                                                                                                      Roller et al., 2020; Bao et al., 2021). This approach
                                             investigate and incorporate several techniques
                                             that have been proven effective for improv-              produces more coherent responses in the following
                                             ing response selection. Experiments on three             way: (i) Generating and decoding multiple can-
                                             benchmarks show that our method pushes the               didates from a generator. (ii) Passing them to a
                                             state-of-the-art while achieving approximately           selector and finding the best one. (iii) Returning
                                             3X faster inference speed.                               the best response to the user. In this paper, we pay
                                                                                                      more attention to the usage of response selection in
                                         1   Introduction                                             generation-based chatbots.
                                         Nowadays, dialogue systems have gained increas-                 After the advent of transformers (Vaswani et al.,
                                         ing attention in the natural language processing             2017) and pre-trained models (Devlin et al., 2018;
                                         community. Depending on the implementation,                  Liu et al., 2019; Lan et al., 2019), various works
                                         they can be categorized as retrieval-based (Lowe             have been proposed to progress the response selec-
                                         et al., 2015; Tao et al., 2019; Yuan et al., 2019) or        tion task. Humeau et al. (2019) conducts compar-
                                         generation-based (Vinyals and Le, 2015; Serban               isons between two commonly used architectures
                                         et al., 2016). The former converses with people              Cross-encoder and Bi-encoder. It proves that the
                                         by selecting the optimal response from a candidate           former performs better than the latter in terms of
                                         pool, while the latter generates proper responses            effectiveness. Lu et al. (2020) and Gu et al. (2020)
                                         autoregressively via a sequence-to-sequence model.           reveal the importance of speaker change infor-
                                         Response selection plays a vital role in both meth-          mation by incorporating speaker segmentation or
                                         ods, with the rise of an approach called "sample-            speaker embeddings. Li et al. (2021) leverages con-
                                         and-rank" (Adiwardana et al., 2020) in advanced              trastive learning and in-batch negative training to
                                         generation-based chatbots (Zhang et al., 2019;               construct better matching representations. Whang
                                                                                                      et al. (2021) and Xu et al. (2020) design multiple
                                                Equal contribution.
                                             Corresponding author.
                                                Corresponding author.
                                               Our work will be open-source for reproducibility and   to fetch more knowledge from a fixed-size corpus.
                                         future research.                                             Whang et al. (2020) and Han et al. (2021) explore
Selecting the optimal dialogue response once for all from a panoramic view
Figure 2: Model architectures of Cross- and Bi- encoder.

the effects of domain-specific post-training on im-
proving multi-turn response selection. We consider
combining these techniques into our work with
necessary modifications and will comprehensively
discuss them in section 3.
   Despite a wide range of studies in this field, re-
sponse selection has always been viewed as a bi-
nary classification problem. We pair a given con-
text with every candidate response, then respec-
tively classify the match as True if appropriate,
otherwise False. To improve its effectiveness and
efficiency, the proposing Panoramic-encoder re-
formulates this task as a multiple-choice problem,       Figure 3: Example of how the Panoramic-encoder dis-
where all candidates can be assessed simultane-          tinguishes strong distractors. The bold value represents
                                                         a correction in prediction confidence.
ously, as shown in Figure 1. It exploits the Candi-
dates Attention Mechanism (CAM) to conduct fine-
grained comparisons between responses and yield
                                                         (i) Processing every candidate respectively slows
the global optimal choice in one-shot inference. We
                                                         down the training and inference speed by a large
conduct experiments on three benchmark datasets:
                                                         margin. (ii) It only regards the relatedness between
PersonaChat (Zhang et al., 2018), Ubuntu Dialogue
                                                         the context and individual responses without di-
Corpus V1 (Lowe et al., 2015), and Ubuntu Dia-
                                                         rectly comparing candidates. Thus, it is hard to
logue Corpus V2 (Lowe et al., 2017). Results show
                                                         distinguish the gound truth from some strong dis-
our work achieves new state-of-the-art with signifi-
                                                         tractors, as illustrated in Figure 3.
cant speed-up at inference time.
                                                             To address the aforementioned issues, we pro-
2   Motivation                                           pose a new formulation of the response se-
                                                         lection task: Given a dialogue context c =
The multi-turn response selection task has long          {u1 , u2 , ..., uN } where each {uk }N k=1 represents
been regarded as: Given a dialogue context c =           a single utterance from either speaker. A candidate
{u1 , u2 , ..., uN } where each {uk }N k=1 represents    pool p = {r1 , r2 , ..., rM } contains M individual
a single utterance from either speaker, p =              responses where ric is the optimal choice for con-
{r1 , r2 , ..., rM } is a candidate pool that contains   text c. A selector model is trained to identify the
M individual responses. We pair every candidate ri       gold label by fitting s(c, p) = i. One explicit bene-
respectively with c and optimize m(c, ri ) to |1|if      fit of this approach is that it only requires one-shot
it is a proper match, otherwise |0|.Since the fun-       predictions to select the global optimal response,
damental goal of this task is to find out the best       increasing training and inference efficiencies. It
response to continue the conversation, there are         can also improve accuracy by employing the atten-
two obvious drawbacks in the existing approach:          tion mechanism between candidates to distinguish
Selecting the optimal dialogue response once for all from a panoramic view
Figure 4: Model architecture of Panoramic-encoder.

them more subtly.                                        2018), such an architecture jointly encodes the
   In addition, we revisit some recent advances (Gu      concatenated context and response to make a pre-
et al., 2020; Li et al., 2021; Xu et al., 2020) in       diction. Another popular Bi-encoder architecture
this field and discover that numerous studies have       (Reimers and Gurevych, 2019) encodes the con-
been proven helpful in boosting the performance of       text and the candidate separately, then scores the
response selection. Unfortunately, however, differ-      relatedness between their representations. Lowe
ent designs may be incompatible with each other          et al. (2015), Yoshino et al. (2019), and Dinan et al.
(Humeau et al., 2019). It motivates the develop-         (2020) conduct corresponding experiments on the
ment of our new framework that incorporates vari-        response selection task. Please refer to Figure 2 for
ous techniques in this paper.                            a better understanding.

3     Methods
In this section, we first discuss several useful tech-
niques in implementing a good response selector.            Urbanek et al. (2019) and Humeau et al. (2019)
Then we explain the proposing Panoramic-encoder          prove that Cross-encoder generally yields better
with Candidates Attention Mechanism in detail and        results because tokens in the context and the candi-
present how to combine these various techniques          date can directly attend to each other. Bi-encoder
with necessary modifications.                            is more efficient because candidate representations
                                                         can be cached and reused once they are created.
3.1    Architecture                                      However, it is impossible to precompute embed-
To the best of our knowledge, a model architecture       dings for freshly generated responses. Considering
called Cross-encoder is widely used as a response        its usage in generation-based chatbots, we choose
selector in many advanced chatbots (Bao et al.,          Cross-encoder as the basic architecture for our im-
2019). Like the typical BERT design (Devlin et al.,      plementation.
Selecting the optimal dialogue response once for all from a panoramic view
3.2   Speaker Change Information                                                   Persona-Chat
Being aware of the speaker change information is                     R20@1            R20@5            MRR
essential to train a good model on dialogue data.       Type (a)   0.809 ± 0.004   0.975 ± 0.002   0.882 ± 0.002
There are two commonly used strategies to achieve       Type (b)   0.869 ± 0.001   0.989 ± 0.000   0.922 ± 0.000
this: adding speaker-aware embedding to the to-         Type (c)   0.870 ± 0.004   0.987 ± 0.001   0.922 ± 0.002
ken representation and adding special tokens to
                                                        Table 1: Performance of three types of Candidates At-
segment utterances from different speakers. Wolf
                                                        tention Mechanisms on Persona-Chat. Average and
et al. (2019) and Wang et al. (2020) equip dialogue     standard deviation are calculated on three runs with dif-
generation with these approaches while Lu et al.        ferent seeds.
(2020) and Gu et al. (2020) verify their necessities
for the response selection task. We can integrate
this technique directly into Panoramic-encoder.         of post-training on response selection, but we will
                                                        not delve into this method in this paper.
3.3   In-batch Negative Training
In studies related to contrastive learning, in-batch    3.6   Panoramic-encoder
negative training is able to generate representations   As depicted in Figure 4, our Panoramic-encoder ex-
with better uniformity and alignment (Wang and          ploits a pre-trained transformer encoder (Vaswani
Isola, 2020; Gao et al., 2021; Fang et al., 2020).      et al., 2017) as the basis. It resembles the Cross-
Humeau et al. (2019) and Li et al. (2021) adapt         encoder (Humeau et al., 2019) but has several cru-
this technique to the response selection task, but as   cial distinctions: (i) The most apparent is all the
described in Humeau et al. (2019), it is problematic    candidates are concatenated and jointly encoded
to recycle the in-batch negative labels in the Cross-   with the input context. (ii) We reuse the positional
encoder architecture because the context and the        embeddings for different candidates to comply with
response are jointly processed. However, since          the length limit. (iii) Each candidate is surrounded
we concatenate candidates in Panoramic-encoder,         by |[CLS]|and |[SEP]|tokens, and two spe-
it is natural to use the other labels in the same       cial |[SPK]|tokens are used to segment the sen-
batch as negatives. Our ablation study in section       tences from alternating speakers. (iv) We explore
4.4 demonstrates this is an effective technique for     and utilize the Candidates Attention Mechanism
enhancing performance.                                  (CAM) instead of the default self-attention.
                                                           We analyze three different types of CAM, as
3.4   Auxiliary Training Tasks
                                                        exhibited in Figure 5. Type (a) is identical to the
Models pre-trained on a large corpus have general       all-to-all attention in Transformers. However, it
abilities to understand the text. To fetch more in-     has a position confusion problem in the current de-
domain knowledge, Xu et al. (2020) and Whang            sign. For illustration, the first token in candidate i
et al. (2021) investigate some self-supervised learn-   cannot distinguish its own second token from the
ing objectives that are multi-task trained with re-     other candidates’ because they share the same po-
sponse selection. To keep the simplicity of our         sitional embeddings. To address this problem, we
work, we do not explore various auxiliary tasks. We     forbid explicit attention between candidates in type
find that adding a single masked language model         (b) CAM, but they can still exchange information
(MLM) loss to our model is straightforward and          indirectly through common connections with the
empirically powerful.                                   context. In type (c) CAM, we try to further prohibit
                                                        the attention from context to candidates, but the
3.5   Domain Post-training                              opposite direction is still allowed. We study the
Similar to training with auxiliary tasks, post-         effects of three CAM versions on PersonaChat and
training targets on further improving the domain        put the results in Table 1. In the subsequent exper-
adaption of pre-trained models in a self-supervised     iments, type (b) is decided as the default setting
manner. It proceeds in the middle of pre-training       because of its superior and stable performance.
and fine-tuning with the help of additional domain-        In Panoramic-encoder, as mentioned in section
specific data. Since it is an independent stage, it     2, instead of assessing each response respectively,
is compatible with all other methods. Whang et al.      it compares all candidates simultaneously to find
(2020) and Han et al. (2021) validate the usefulness    the global optimum in one shot. The given dia-
Figure 5: Three types of the Candidates Attention Mechanism (CAM), attention is prohibited in the unfilled areas.

logue context c = {u1 , u2 , ..., uN } and the candi-        • Ubuntu Dialogue Corpus V1 (Lowe et al.,
date pool p = {r1 , r2 , ..., rM } are jointly encoded         2015) contains 1 million conversations about
to yield output representations H. As discussed                technical support for the Ubuntu system. We
earlier, the candidate pool in our implementation              use the clean version proposed by (), which
consists of the gold response and the other in-batch           has numbers, URLs, and system paths re-
negative samples.                                              placed by special placeholders.

                                                             • Ubuntu Dialogue Corpus V2 (Lowe et al.,
                 H = encode(c, p)
                                                               2017) has several updates and bug fixes com-
   We then obtain an aggregated embedding Ei for               pared to V1. The major one is that the training,
each candidate by averaging all token represen-                validation, and test sets are split into different
tations belonging to it in H. After aggregation,               time periods.
every Ei is reduced to a single logit, which is later
merged and fed into a softmax operation.                  4.2      Results
                                                          We initialize our implementation with the check-
                                                          point provided by Huggingface2 . All the reported
      Ypred = softmax({w(E1 ), . . . , w(Em )})
                                                          results are fine-tuned on the BERT-base model (De-
                                                          vlin et al., 2018) without any post-training. We
  A ground truth label is one-hot at the index of
                                                          measure Rc@k and mean reciprocal rank (MRR)
the only positive candidate. Then the model is
                                                          on three benchmark datasets. Table 2 demonstrates
optimized by minimizing the cross-entropy loss
                                                          the superiority of the Panoramic-encoder against
between the prediction and ground truth. We also
                                                          the other state-of-the-art methods.
plus an auxiliary MLM loss to the original training
objectives.                                               4.3      Inference Speed
                                                          The Panoramic-encoder has a significant advantage
                                                          over the baseline in terms of efficiency. It is ev-
         `ce = cross_entropy(Ypred , Ylabel )
                                                          idently because the Panoramic-encoder can find
                                                          the optimal response among candidates in one shot
                   ` = `ce + `mlm                         rather than rank each candidate in turn. This fea-
                                                          ture remarkably reduces the number of inferences
4     Experiments                                         the Panoramic-encoder requires during evaluation.
                                                          Therefore, for the sake of fair comparison, we con-
4.1    Dataset                                            trol the peak GPU memory usages of all models to
    • PersonaChat (Zhang et al., 2018) is a crowd-        the same value by assigning them different batch
      sourced dataset with two-speaker talks condi-       sizes. Results in Table ?? verify that our model
      tioned on their given persona, containing short     achieves approximately 3X speed up at inference
      descriptions of characters they will imitate in     time.
      the dialogue.                                    
Persona-Chat             Ubuntu Dialogue Corpus V1       Ubuntu Dialogue Corpus V2
                                               R20@1          MRR        R10@1       R10@2     R10@5     R10@1     R10@2      R10@5
  BERT(Devlin et al., 2018)                    0.707          0.808          0.808    0.897     0.975     0.781     0.890      0.980
 BERT-CRA(Gu et al., 2021)                     0.843          0.903           N/A      N/A       N/A       N/A       N/A        N/A
  SA-BERT(Gu et al., 2020)                      N/A            N/A           0.855    0.928     0.983     0.830     0.919      0.985
  BERT-SL(Xu et al., 2020)                      N/A            N/A           0.884    0.946     0.990      N/A       N/A        N/A
UMS-BERT(Whang et al., 2021)                    N/A            N/A           0.843    0.920     0.982      N/A       N/A        N/A
 BERT+FGC(Li et al., 2021)                      N/A            N/A           0.829    0.910     0.980      N/A       N/A        N/A
                                               0.869         0.922        0.886      0.946      0.989    0.859     0.938      0.990
                                               ±0.001        ±0.000       ±0.001     ±0.001    ±0.000    ±0.000    ±0.001     ±0.000

Table 2: Evaluation on three benchmark datasets. All results reported in the table are fine-tuned on the naive Bert-
base (Devlin et al., 2018) model without any post-training. Average and standard deviation are calculated on three
runs with different seeds.

           Model                        Peak GPU Memory Allocated                              Candidates         Inference Time
  Bert Baseline                                                                                                        213.27s
                                                           1.017 GB                              189200
Panoramic-encoder                                                                                                      74.62s
       Table 3: Comparisons of the efficiency between Panoramic-encoder and baseline method on Ubuntu V2.

                                     Ubuntu Dialogue Corpus V2                  three benchmarks demonstrate the superiority of
                             dev                    test
                                                                                our work against other advanced studies.
                            R10@1     R10@1    R10@2       R10@5    MRR
Panoramic-encoder           85.99*    85.92*   93.77*      98.98*   91.51*
w/o. MLM Loss                81.76     82.00    91.14      98.37    88.89
w/o. Speaker Segmentation    84.11     84.45    92.65      98.40    90.40       References
w/o. Concatenation           84.38     84.43    92.98      98.66    90.48
w/o. In-batch Negatives      78.28     78.69    91.29      98.72    87.34       Daniel Adiwardana, Minh-Thang Luong, David R So,
                                                                                  Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,
Table 4: Ablation studies on Ubuntu Dialogue Corpus                               Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,
V2.                                                                               et al. 2020. Towards a human-like open-domain
                                                                                  chatbot. arXiv preprint arXiv:2001.09977.
                                                                                Siqi Bao, Huang He, Fan Wang, Hua Wu, and Haifeng
4.4     Ablation Studies                                                          Wang. 2019. Plato: Pre-trained dialogue generation
                                                                                   model with discrete latent variable. arXiv preprint
Besides the novel architecture change for candi-                                   arXiv:1910.07931.
dates concatenation, the Panoramic-encoder also
benefits from other proposed techniques. In this                                Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng
part, we closely analyze their effects on model per-                              Wang, Wenquan Wu, Zhihua Wu, Zhen Guo, Hua
                                                                                   Lu, Xinxian Huang, et al. 2021. Plato-xl: Exploring
formance. Table 4 contains ablation studies con-                                   the large-scale pre-training of dialogue generation.
ducted on the Ubuntu Dialogue Corpus V2.                                           arXiv preprint arXiv:2109.09519.
                                                                                Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
5     Conclusion                                                                   Kristina Toutanova. 2018. Bert: Pre-training of deep
                                                                                   bidirectional transformers for language understand-
We propose a new problem formulation for the di-                                   ing. arXiv preprint arXiv:1810.04805.
alogue response selection task. Based on this, we
present the Panoramic-encoder architecture with                                 Emily Dinan, Varvara Logacheva, Valentin Malykh,
                                                                                  Alexander Miller, Kurt Shuster, Jack Urbanek,
the Candidates Attention Mechanism, which allows                                  Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan
interactions between all candidates and selects the                               Lowe, et al. 2020. The second conversational in-
global optimal response in single inference. We                                   telligence challenge (convai2). In The NeurIPS’18
investigate some proposed techniques and effec-                                   Competition, pages 187–208. Springer.
tively unify them to push the state-of-the-art and                              Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan
achieve faster inference speed. Experiments on                                    Ding, and Pengtao Xie. 2020. Cert: Contrastive
