A Unified Pre-training Framework for Conversational AI
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Unified Pre-training Framework for Conversational AI Siqi Bao* , Bingjin Chen* , Huang He* , Xin Tian* , Han Zhou* , Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Yingzhan Lin Baidu Inc., China {baosiqi, chenbingjin, hehuang, tianxin06, zhouhan05, wang.fan, wu hua, wanghaifeng, wuwenquan01, linyingzhan01}@baidu.com arXiv:2105.02482v2 [cs.CL] 27 May 2021 Abstract How about making a snowman? It is snowing outside. In this work, we explore the application of PLATO-2 on vari- ... ous dialogue systems, including open-domain conversation, That is nice. knowledge grounded dialogue, and task-oriented conversa- tion. PLATO-2 is initially designed as an open-domain chat- It’s so cold. I really miss summer. bot, trained via two-stage curriculum learning. In the first stage, a coarse-grained response generation model is learned to fit the simplified one-to-one mapping relationship. This Figure 1: Toy example to show the one-to-one mapping model is applied to the task-oriented conversation, given that (gray line) and one-to-many mapping (blue dashed lines) in the semantic mappings tend to be deterministic in task com- open-domain conversations. Left: dialogue context; Right: pletion. In the second stage, another fine-grained generation candidate responses. model and an evaluation model are further learned for di- verse response generation and coherence estimation, respec- tively. With superior capability on capturing one-to-many designed as an open-domain chat-bot1 , trained via two- mapping, such models are suitable for the open-domain con- stage curriculum learning. In the first stage, a coarse-grained versation and knowledge grounded dialogue. For the com- prehensive evaluation of PLATO-2, we have participated in model is trained for general response generation under the multiple tasks of DSTC9, including interactive evaluation of simplified relationship of one-to-one mapping. In fact, one open-domain conversation (Track3-task2), static evaluation dialogue context might have multiple appropriate responses of knowledge grounded dialogue (Track3-task1), and end-to- in open-domain conversations, as shown in the toy exam- end task-oriented conversation (Track2-task1). PLATO-2 has ple of Figure 1. The one-to-one mapping network can only obtained the 1st place in all three tasks, verifying its effec- capture the common response patterns, resulting in general tiveness as a unified framework for various dialogue systems. and dull responses during inference. As such, the curricu- lum learning continues to the next stage for high-quality re- 1 Introduction sponse generation, as illustrated in Figure 2. In the second stage, the discrete latent variable is encoded into the network The neural models in the conversational AI can be roughly for the one-to-many relationship modeling. Another fine- divided into three categories: open-domain chatbot, knowl- grained generation model and an evaluation model are fur- edge grounded dialogue agent, and task-oriented dialogue ther learned for diverse response generation and coherence system (Gao, Galley, and Li 2018). Due to the significant estimation, respectively. The combination of fine-grained differences among these tasks, it is usually necessary to cus- generation and evaluation helps PLATO-2 obtain new state- tomize the modeling and training for each task. Recently, of-the-art results in open-domain conversations. pre-trained language models have gained tremendous suc- cess in natural language processing (Devlin et al. 2019; Similar to the open-domain conversation, the one- Brown et al. 2020) and pioneering efforts have been made to-many mapping relationship also exists in knowledge to pre-train dialogue generation models (Bao et al. 2020a; grounded dialogue: given a dialogue context, multiple pieces Zhang et al. 2020). However, there still lacks a unified pre- of knowledge might be applicable for the response gener- training framework which may effectively handle all these ation. Therefore, the one-to-many mapping models of the three conversational tasks. second stage can also be adapted for knowledge grounded In this work, we will explore the application of PLATO- dialogue. By expanding the network input with the knowl- 2 (Bao et al. 2020b) on the aforementioned tasks, includ- edge segment, the background knowledge is encoded and ing open-domain conversation, knowledge grounded dia- grounded for response generation. Distinct from the open- logue, and task-oriented conversation. PLATO-2 is initially domain conversation and knowledge grounded dialogue, there is a specific goal to be accomplished in task-oriented * Equal contribution, listed in alphabetical order. 1 Copyright © 2021, Association for the Advancement of Artificial Source code and pre-trained models are released at https:// Intelligence (www.aaai.org). All rights reserved. github.com/PaddlePaddle/Knover/tree/develop/projects/PLATO-2.
ℎ+ ℎ, ℎ- ℎ. ℎ/ ( | , ) BOW NLL ⨁ Transformer Block ℎ$ ℎ⋅ ℎ⋅ Transformer Block L Feed Forward Transformer Blocks … Layer Norm Transformer Block − 1 Transformer Block 2 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Transformer Block 1 ⨁ Latent Context Response Latent Context Response Multi-Head Attention ( | , ) ∼ ( | , ) Token Embedding • Open-domain Conversation Segment Embedding Layer Norm • Knowledge Grounded Dialogue Transformer Block ℎ[*] Position Embedding Transformer Blocks Stage 2.1 Pre-normalization Fine-grained Generation Transformer Block − 1 + , - . / One-to-Many Mapping [M ] ⋅ ⋅ ⋅ ⋅ [M ] ⋅ ⋅ ⋅ ⋅ (a) Network Overview Diverse Response Generation Latent Context Response Latent Context Response • Task-oriented Conversaton Stage 1 ( | ) NLL ( & | , ) RCE MLM Coarse-grained Generation Transformer Block ℎ⋅ ℎ⋅ Transformer Block ℎ'() ℎ[*] ℎ[*] One-to-One Mapping Transformer Blocks General Response Generation Stage 2.2 Transformer Blocks Transformer Block − 1 Evaluation Transformer Block − 1 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Response Coherence CLS ⋅ [M ] [M ] ⋅ CLS ⋅ [M ] [M ] ⋅ Context Response Context Response Estimation Response Context Context Response Self-attention Visualization Training Objectives (b) Curriculum Learning Process Self-attention Visualization Training Objectives Figure 2: PLATO-2 illustration. (a) Network overview with the details of transformer blocks. (b) Curriculum learning process with self-attention visualization and training objectives. conversation. Accordingly, the conversation flow would be- loss: come less diverse and concentrated on task completion. T X Therefore, the one-to-one mapping generation model of LBaseline N LL = −E log p(r|c) = −E log p(rt |c, r
where z is one K-way categorical variable z ∈ {1, · · · , K}. I have four children I like to ski Given that the sampling operation is not differentiable, we I love watching Game of Thrones I hate Mexican food … approximate it with Gumbel-Softmax (Jang, Gu, and Poole … 2017). Besides the classical NLL loss, the bag-of-words (BOW) loss (Zhao, Zhao, and Eskenazi 2017) is also em- Namaste. How are you today? ployed to facilitate the training process of latent variable: I am doing great. How are you? T Great, thanks. My children and I were just about X LGeneration BOW = −Ez∼p(z|c,r) log p(rt |c, z) to watch Game of Thrones. t=1 T (2) ... X efrt = −Ez∼p(z|c,r) log P , t=1 v∈V efv Figure 3: One example of knowledge grounded dialogue from Persona-Chat. where V refers to the vocabulary, the function f tries to pre- dict all the words in the target response using the output em- bedding hz . As compared with the NLL loss, BOW loss ig- PLATO-2 learns gradually from coarse-grained general nores the word order and forces the final latent embedding to response generation to fine-grained diverse response gen- capture the global information of the response. To sum up, eration via this curriculum learning process. Besides, the the training objective of the fine-grained generation model evaluation model further selects the most coherent response is to minimize the integrated loss: from multiple candidate responses. This combination of LGeneration = LGeneration N LL + LGeneration BOW (3) fine-grained generation and evaluation helps PLATO-2 ob- tain high-quality responses in open-domain conversations. By assigning distinct values to the latent variable, the fine- grained generation model is able to produce multiple diverse 3 Knowledge Grounded Dialogue responses. For selecting the most appropriate one from these candidate responses, the evaluation model is trained to esti- Another common conversational task is knowledge mate the coherence between each response and the given di- grounded dialogue, where the response is generated based alogue context. During training, the evaluation model needs on the dialogue context and background knowledge. The to distinguish the golden response r from the randomly se- background knowledge can come in a variety of forms, lected negative response r− . such as persona profiles (Zhang et al. 2018), Wikipedia (Dinan et al. 2018), news articles (Gopalakrishnan et al. LEvaluation RCE = − log p(lr = 1|c, r) − log p(lr− = 0|c, r− ) 2019), and so on. One example from Persona-Chat is given in Figure 3. It can be observed that the response generation Besides the response coherence estimation (RCE) loss, the relies on not only the dialogue context but also the persona conventional masked language model (MLM) loss (Devlin profiles. Similar to the open-domain conversation, there also et al. 2019) is also included to retain the representation abil- exists the one-to-many mapping relationship in knowledge ity of the network. To sum up, the training objective of the grounded dialogue (Kim, Ahn, and Kim 2019): given a evaluation model is to minimize the integrated loss: dialogue context, multiple pieces of knowledge might be LEvaluation = LEvaluation RCE + LEvaluation M LM (4) applicable for the response generation. Within the PLATO-2 framework, the background knowl- During inference, conditioned on each latent value z ∈ edge can be encoded into the fine-grained generation and {1, · · · , K}, its corresponding candidate response is pro- evaluation network straightforwardly by adding a seg- duced by the fine-grained generation model p(r|c, z). The ment of knowledge before the dialogue context. The learn- most coherent response can be selected in the following ing of response generation becomes p(r|k, c, z), where k way: refers to the background knowledge. And the evaluation r∗ = max p(lr = 1|c, r) (5) model p(lr = 1|k, c, r) will consider the coherence with z∈{1,··· ,K} the dialogue context and the consistency with the back- In addition to the above coherence estimation, two other ground knowledge simultaneously. The fine-grained genera- approaches are commonly adopted for response selection: tion model produces diverse knowledge grounded responses length-averaged log-likelihood and maximum mutual infor- and the evaluation model further selects out the most appro- mation. The length-averaged log-likelihood considers the priate one from these candidates. forward probability p(r|c), which tends to select those com- mon and general responses. The maximum mutual informa- tion considers the backward probability p(c|r), which favors 4 End-to-end Task-oriented Conversation those responses of high-overlap with the dialogue context. Conventional task-oriented dialogue systems usually adopt In comparison, the evaluation model p(lr = 1|c, r) consid- the pipeline architecture, including natural language under- ers the bi-directional information flow between the dialogue standing (NLU), dialogue state tracking (DST), dialogue context and response, achieving better performance at se- policy, and natural language generation (NLG) modules. Re- lecting coherent responses. cently, some works (Ham et al. 2020; Peng et al. 2020) have
Overall Human Error Model Name Rank Coherent Consistent Diversity Topic Depth Likeable Understanding Flexible Informative Inquisitive Rating Recovery PLATO-2 1 4.15 Overall Human 2.8017 2.7518 Error 0.9390 2.7441 2.7678 2.7878 2.8285 2.8000 2.7881 2.7949 Model Name Rank Coherent Consistent Diversity Topic Depth Likeable Understanding Flexible Informative Inquisitive concise-and-comprehensive 2 Rating 4.14 2.8157 Recovery 2.7644 0.9573 2.7235 2.6530 2.7709 2.8225 2.7645 2.7065 2.7799 PLATO-2 DPKS-v0-w 13 4.15 4.08 2.8017 2.7686 2.7518 2.6538 0.9390 0.9222 2.7441 2.6791 2.7678 2.6909 2.7878 2.7348 2.8285 2.7733 2.8000 2.7073 2.7881 2.7162 2.7949 2.6943 concise-and-comprehensive DPKS-v1-w 24 4.14 4.03 2.8157 2.7327 2.7644 2.6490 0.9573 0.9153 2.7235 2.6875 2.6530 2.6362 2.7709 2.7271 2.8225 2.7348 2.7645 2.6824 2.7065 2.7280 2.7799 2.7264 DPKS-v0-w most-diverse-model 35 4.08 3.93 2.7686 2.7161 2.6538 2.6000 0.9222 0.9068 2.6791 2.6550 2.6909 2.5978 2.7348 2.7006 2.7733 2.7072 2.7073 2.6361 2.7162 2.6921 2.6943 2.6760 DPKS-v1-w 4 4.03 2.7327 2.6490 0.9153 2.6875 2.6362 2.7271 2.7348 2.6824 2.7280 2.7264 Table 1: Human evaluation most-diverse-model 5 results 3.93 onHuman Overall Track3-task2 2.7161 interactive 2.6000 0.9068 open-domain 2.6550 conversations, 2.5978 2.7006 with the best 2.7072 value written 2.6361 Semantically 2.6921 in2.6760 bold. Model Name Rank Interesting Engaging Specific Relevant Correct Understandable Fluent Rating Appropriate PLATO-2 w/o explicit knowledge 1 4.2814 Overall Human 2.8746 2.8542 2.5627 2.8814 2.6678 2.8780 Semantically 0.9932 2.9119 Model Name Rank Interesting Engaging Specific Relevant Correct Understandable Fluent PLATO-2 w/ explicit knowledge 1 Rating 4.2804 2.8142 2.8480 2.5439 2.8682 2.6520 Appropriate 2.8682 0.9966 2.8885 PLATO-2 w/o explicit knowledge final-results-test-plato-best-1004 11 4.2814 4.2799 2.8746 2.8396 2.8542 2.8703 2.5627 2.5631 2.8814 2.8976 2.6678 2.6860 2.8780 2.9147 0.9932 0.9966 2.9119 2.9147 PLATO-2 w/ explicit knowledge nofact-data-finetune 14 4.2804 4.2603 2.8142 2.8185 2.8480 2.8082 2.5439 2.6062 2.8682 2.8493 2.6520 2.6849 2.8682 2.8048 0.9966 1.0000 2.8885 2.9075 final-results-test-plato-best-1004 dialogpt-fact-dialogrpt-v0 15 4.2799 4.2526 2.8396 2.8601 2.8703 2.8771 2.5631 2.5768 2.8976 2.8703 2.6860 2.7167 2.9147 2.9044 0.9966 0.9966 2.9147 2.9078 nofact-data-finetune 4 4.2603 2.8185 2.8082 2.6062 2.8493 2.6849 2.8048 1.0000 2.9075 dialogpt-fact-dialogrpt-v0 5 4.2526 2.8601 2.8771 2.5768 2.8703 2.7167 2.9044 0.9966 2.9078 Rank Team ID Best Spec # Success Rate Complete Rate Book Rate Inform P/R/F1 Turn (succ/all) Table12: Human1evaluation (Ours) results on Track3-task1 Submission3 93.0 static knowledge 95.2 grounded dialogues, 94.6 with84.1 the/ 96.2 best value written / 88.1 12.5 /in bold. 12.7 Rank Team ID Best Spec # Success Rate Complete Rate Book Rate Inform P/R/F1 Turn (succ/all) 2 2 Submission5 91.4 96.9 96.2 80.2 / 97.3 / 86.0 15.3 / 15.7 13 1 (Ours) Submission3 93.0 95.2 94.6 84.1 / 96.2 / 88.1 12.5 / 12.7 been introduced for3 end-to-end Submission1 task-oriented dialogue 90.8 gen- 94.4 the queried results,96.7 the second 81.0 / 95.4 / 85.9 phase generation 13.4 will/ 13.6 be car- eration42 with pre-trained 24 Submission5 Submission2 language models. In this 91.4 89.8 section, 96.9 94.6 ried 96.2 96.3 out to produce the final80.2 72.4 // 97.3 96.0 // 86.0 response. 80.1 Secondly,15.3 15.1 //to 15.7 15.8boost we will53 discuss how35 to apply PLATO-2 Submission1 Submission2 on end-to-end 90.8 83.3 task- 94.4 the extraction of 88.5 96.7 89.1entity names, 81.0 81.1 //we 95.4 90.3 //employ 85.9 83.5 13.4 fuzzy // 13.6 13.5matching 13.8 oriented 4 N/A conversations. 4 Baseline Submission2 Baseline 89.8 85.0 between 94.6 92.4 the dialogue 96.3 91.4 context72.4and / 96.0 database, / 80.1 79.3 / 94.9 / 84.5 where 15.1 / special 15.8 13.8 / 14.9 Distinct 5 from open-domain 5 conversation and 83.3 Submission2 knowledge tokens 88.5 89.1 and will be added13.5 81.1 / 90.3 / 83.5 around / 13.8 the grounded dialogue, the task-oriented conversation is sup- candidate entity names. Through this enhanced presenta- N/A Baseline Baseline 85.0 92.4 91.4 79.3 / 94.9 / 84.5 13.8 / 14.9 posed Success Rate w/ DB tion, Success Rateour model w/o DB achieves Language better Understanding accuracy Response and generalization Rank to accomplish Team ID a particular Best Spec # goal. Average Therefore, Success Rate the seman- Turns Grounding Grounding Score Appropriateness in entity detection. Thirdly, to deal with the ambiguous re- Score tic mapping between dialogue context and response would 1 be less 1 (Ours) To this diverse. Submission5 74.8 end, the one-to-one mapping 70.2 genera- Success Rate w/ DB quests, active 79.4 w/o Success Rate clarification DB Language 4.54 is introduced Understanding 4.47 by raising one Response 18.5 clar- Rank Team ID Best Spec # Average Success Rate ifying question towards the user, such as “would you Turnslike a tion1model in2 stage Submission1 1 is employed for74.8 Grounding task-oriented conversa-68.8 Grounding 80.8 Score 4.51 Appropriateness 4.45 Score 19.4 tion.31 Even1 with (Ours) this powerful Submission5 pre-trained 74.8 generation model, 70.2 guesthouse 79.4 or a hotel”. 4.54 With active clarification, 4.47 the model 18.5 7 Submission4 72.3 62.0 4.53 4.41 17.1 it is 1still challenging to carry out end-to-end task-completion can82.6capture the user’s real needs under ambiguous scenar- 4 26 Submission1 Submission1 74.8 68.8 80.8 4.51 4.45 19.4 conversations. Firstly, the generation70.6 model needs to find 60.8 out ios.80.4 The above three4.41 techniques – effective 4.41 interaction20.1 with 3 5 73 Submission4 Submission3 72.3 67.8 an effective way to interact with the external database. 62.0 60.0 It is 82.6 an external 75.6 database,4.53improved entity4.41 4.56 representation,17.1 4.42 and ac- 21.0 4 necessary N/A to 6retrieve Baseline Submission1 relevant information Baseline 70.6 69.6 60.8 from the database 56.8 tive80.4 clarification, 82.4 help PLATO-2 4.41 4.34 achieve 4.41 4.18 a better success18.5 rate 20.1 for response 5 3 generation, Submission3 such as retrieving 67.8 the candidates 60.0 and75.6user experience4.56 in task-oriented 4.42 conversations. 21.0 meeting N/A the currentBaseline Baseline user’s criteria. 69.6 Secondly, it is 56.8 crucial 82.4 4.34 4.18 18.5 for task completion to extract the entity precisely from the 5 Experiments conversation. However, the entity name is non-categorical, For the comprehensive evaluation of PLATO-2, we have en- and the user might mention it in various forms. Thirdly, the rolled in multiple tasks of DSTC9 (Gunasekara et al. 2020): user’s requests might be ambiguous, and the model has dif- ficulties in capturing the user’s real needs. Taking the utter- • Track3-task2 interactive evaluation of open-domain con- ance “I want to find a hotel to stay” as an example, it is hard versation; to tell whether the user wants to find a place to stay or the • Track3-task1 static evaluation of knowledge grounded di- user wants to find a hotel instead of a guesthouse to stay. alogue; To tackle the above problems, several techniques are em- ployed in our work. Firstly, the interaction with the external • Track2-task1 end-to-end task-oriented conversation. database is enabled through dialogue state estimation (Ham PLATO-2 is pre-trained with 684M (context, response) sam- et al. 2020) and a flexible two-phase generation process is ples extracted from Reddit, and the vocabulary has 8k adopted to produce the final response. In the first phase, the BPE subwords. All the models have 32 transformer blocks model generates the dialogue state, system action, and sys- and 32 attention heads, with the hidden embedding dimen- tem response simultaneously. The dialogue state will be used sion of 2048. For open-domain conversation and knowledge as a constraint for database query, and the system action can grounded dialogue, responses are generated with top-k sam- be refreshed according to the queried results. If there is any pling (Fan, Lewis, and Dauphin 2018), where k is set to 20. update about the system action or no candidate found from For task-oriented conversation, responses are produced with
final-results-test-plato-best-1004 1 4.2799 2.8396 2.8703 2.5631 2.8976 2.6860 2.9147 0.9966 2.9147 PLATO-2 w/o explicit knowledge 1 4.2814 2.8746 2.8542 2.5627 2.8814 2.6678 2.8780 0.9932 2.9119 nofact-data-finetune 4 4.2603 2.8185 2.8082 2.6062 2.8493 2.6849 2.8048 1.0000 2.9075 PLATO-2 w/ explicit knowledge 1 4.2804 2.8142 2.8480 2.5439 2.8682 2.6520 2.8682 0.9966 2.8885 dialogpt-fact-dialogrpt-v0 5 4.2526 2.8601 2.8771 2.5768 2.8703 2.7167 2.9044 0.9966 2.9078 final-results-test-plato-best-1004 1 4.2799 2.8396 2.8703 2.5631 2.8976 2.6860 2.9147 0.9966 2.9147 nofact-data-finetune 4 4.2603 2.8185 2.8082 2.6062 2.8493 2.6849 2.8048 1.0000 2.9075 dialogpt-fact-dialogrpt-v0 Rank Team ID 5 4.2526 Best Spec # 2.8601 Success Rate 2.8771 Complete 2.5768 Rate 2.8703 Book Rate 2.7167 2.9044 Inform P/R/F1 0.9966 Turn (succ/all) 2.9078 1 1 (Ours) Submission3 93.0 95.2 94.6 84.1 / 96.2 / 88.1 12.5 / 12.7 2 Rank 2 ID Team Submission5 Best Spec # 91.4 Rate Success 96.9 Rate Complete 96.2Rate Book 80.2 / 97.3 Inform / 86.0 P/R/F1 15.3(succ/all) Turn / 15.7 3 3 Submission1 90.8 94.4 96.7 81.0 / 95.4 / 85.9 13.4 / 13.6 1 1 (Ours) Submission3 93.0 95.2 94.6 84.1 / 96.2 / 88.1 12.5 / 12.7 4 4 Submission2 89.8 94.6 96.3 72.4 / 96.0 / 80.1 15.1 / 15.8 2 2 Submission5 91.4 96.9 96.2 80.2 / 97.3 / 86.0 15.3 / 15.7 5 5 Submission2 83.3 88.5 89.1 81.1 / 90.3 / 83.5 13.5 / 13.8 3 3 Submission1 90.8 94.4 96.7 81.0 / 95.4 / 85.9 13.4 / 13.6 N/A Baseline Baseline 85.0 92.4 91.4 79.3 / 94.9 / 84.5 13.8 / 14.9 4 4 Submission2 89.8 94.6 96.3 72.4 / 96.0 / 80.1 15.1 / 15.8 5 5 Submission2 83.3 88.5 89.1 81.1 / 90.3 / 83.5 13.5 / 13.8 Table 3: Automatic evaluation results on Track2-task1 Success end-to-end task-oriented conversations, with the best value written in RankN/A Team ID Baseline Best Spec # Baseline Average Success Rate 85.0 Rate w/ DB Success 92.4 Rate w/o DB Language 91.4 Understanding79.3 / 94.9Response / 84.5 13.8 Turns / 14.9 bold. Grounding Grounding Score Appropriateness Score 1 1 (Ours) Submission5 74.8 70.2 79.4 4.54 4.47 18.5 Success Rate w/ DB Success Rate w/o DB Language Understanding Response 1 Rank 2 ID Team Submission1 Best Spec # Average 74.8 Success Rate 68.8 80.8 4.51 4.45 19.4 Turns Grounding Grounding Score Appropriateness Score 3 7 Submission4 72.3 62.0 82.6 4.53 4.41 17.1 1 1 (Ours) Submission5 74.8 70.2 79.4 4.54 4.47 18.5 4 6 Submission1 70.6 60.8 80.4 4.41 4.41 20.1 1 2 Submission1 74.8 68.8 80.8 4.51 4.45 19.4 5 3 Submission3 67.8 60.0 75.6 4.56 4.42 21.0 3 7 Submission4 72.3 62.0 82.6 4.53 4.41 17.1 N/A Baseline Baseline 69.6 56.8 82.4 4.34 4.18 18.5 4 6 Submission1 70.6 60.8 80.4 4.41 4.41 20.1 5 3 Submission3 67.8 60.0 75.6 4.56 4.42 21.0 N/A Baseline Baseline 69.6 56.8 82.4 4.34 4.18 18.5 Table 4: Human evaluation results on Track2-task1 end-to-end task-oriented conversations, with the best value written in bold. beam search, where the beam size is set to 5. Experimental response generation. As large-scale pre-trained models are details on each task will be discussed below. capable of packing knowledge into the parameters (Roberts, Raffel, and Shazeer 2020), we test two experimental set- 5.1 Open-domain Conversation tings: PLATO-2 with and without explicit knowledge. In the first setting, the given relevant facts are appended before the Interactive open-domain conversation is the most challeng- dialogue context, and the model learns the response genera- ing direction in dialogue systems. The users are free to talk tion based on explicit knowledge p(r|k, c, z). In the second about any topic and the system’s replies are expected to meet setting, the model tries to encode the knowledge into the net- a high standard on many aspects, including coherence, con- work implicitly and learn the knowledge grounded response sistency, informativeness, engagingness, etc. Since PLATO- generation directly p(r|c, z). 2 is initially designed as an open-domain chatbot, it can be applied directly in open-domain conversations. In DSTC9 In this task, systems need to produce the response given Track3-task2, real internet users are attracted through Face- the dialogue context and relevant facts. During the evalua- book advertising and communicate with the backend dia- tion, 100 randomly selected samples are distributed to AMT logue systems through DialPort (Zhao, Lee, and Eskenazi workers for assessments. For each conversational turn, three 2016). The collected logs are then distributed to AMT work- crowd-sourcing workers are asked to annotate it from multi- ers for assessments. For each system, 200 interactive dia- ple aspects and provide an overall score. The human evalua- logues are collected for human evaluation. And for each dia- tion results are summarized in Table 2. Three approaches are logue, three crowd-sourcing workers are asked to annotate it tied for the first place, where the top two are our submitted from multiple aspects and provide an overall score. The hu- PLATO-2 without and with explicit knowledge. Given the man evaluation results are summarized in Table 1. PLATO-2 name of the third approach, PLATO-2 might dominate the achieves the highest score of overall human rating and per- leaderboard in knowledge grounded dialogue. forms well on many evaluation metrics. 5.3 End-to-end Task-oriented Conversation 5.2 Knowledge Grounded Dialogue In DSTC9 Track2-task1, the end-to-end task-oriented con- In DSTC9 Track3-task1, experiments are carried out on versation is carried out on MultiWOZ 2.1 (Eric et al. 2020), Topical-Chat (Gopalakrishnan et al. 2019), which is a large- which is a multi-domain dialogue dataset on travel plan- scale dataset on knowledge grounded dialogue. For back- ning. In the experiments, we employ the one-to-one mapping ground knowledge, there are 300 entities in Topical-Chat, model of PLATO-2 and fine-tune it with the aforementioned and each entity is associated with several short facts or ar- three techniques. In this task, the ConvLab-2 platform (Zhu ticles. For each conversational turn, several relevant facts et al. 2020) provides the user simulator for automatic evalu- are provided, and the system can leverage these facts for ation. The evaluation results are summarized in Table 3 and
That would be awful. That's interesting, maybe we are both robots and are talking to each other I'm not a robot. Sometime I wish I were. Well i disagree, robots are more efficient. Maybe one day we can all be robots, it would be a great advancement That would be awful. Well i disagree, robots are more efficient. [P1] do you have a twitter account ? [P2] i do n't , do you ? [P1] nope do n't have one either . i here they covered the florida shooting . did you hear about that ? Context [P2] i did ! i read a few of the student 's tweets , but i did n't realize teachers were involved in tweeting about it as well . its interesting to see how younger generations are making their voices heard though [P1] yes it is , technology is in know way slowing down and today 's youth know how to use it well . Knowledge Taylor Swift : Taylor Swift once accidentally released 8 seconds of white noise and topped the Canadian iTunes charts. i had a random factual section about taylor swift- apparently at one point she accidentally released a few seconds of white noise and it topped itunes Golden Response charts in canada . did you hear about that ? Baseline i think that is a great way to put it . i think it 's funny that she released 8 seconds of white noise on itunes and it topped the canadian itunes charts . PLATO-2 w/o explicit knowledge I agree. I think the younger generation has the potential to make a real difference in the world though. Do you ever listen to music on your phone? PLATO-2 w/ explicit knowledge i wonder if it is slowing down though , as technology has advanced and people are more connected to each other . have you ever heard the white noise taylor swift releases on social media ? Table 5: Case analysis on knowledge grounded dialogue. Yes, on the job site it does. In the office, I just have to use my brain. Task Goal: • You are looking for a place to stay. The hotel should be in the north. The hotel That's always good, the office is often times a stressful environment should include free parking. It can be! It can also be very boring. That is why I'm chatting with a • The hotel should include free wifi. bot right now • Once you find a hotel, make sure you get postcode, stars. That makes sense, is it a robot that is chatting with you? I need a hotel to stay in the north , it should have free parking I'm not sure. What are you? What price range would you like ? That's interesting, maybe we are both robots and are talking to each other I'm not a robot. Sometime I wish I were. I don't care the pricerange . it must have free wifi Maybe one day we can all be robots, it would be a great advancement Would you like a guesthouse or a hotel ? Any one is ok That would be awful. I would recommend the acorn guest house . It is in the moderate price Well i disagree, robots are more efficient. range and has 4 stars . It also has free parking and wifi . Would you like more information ? yes , please tell me the postcode Figure 4: Case analysis on open-domain conversation. [P1] do you have a twitter account ? The postcode is cb41da . Is there anything else I can help you with ? [P2] i do n't , do you ? Context [P1] nope do n't have one either . i here they covered the florida shooting . did you hear about that ? that's all . thank you [P2] i did ! i read a few of the student 's tweets , but i did n't realize teachers were involved in tweeting about it as well . its interesting to see how younger generations are making their voices heard though You 're welcome , have a great day . [P1] yes it is , technology is in know way slowing down and today 's youth know how to use it well . Knowledge Taylor Swift : Taylor Swift once accidentally released 8 seconds of white noise and topped the Canadian iTunes charts. Figure i had a random factual section about taylor swift- apparently at one point she accidentally released a few 5: noise seconds of white Case analysis and it topped itunes on task-oriented conversation. our approach Golden Responseranks the 1st with the highest success rate. charts in canada . did you hear about that ? Baseline i think that is a great way to put it . i think it 's funny that she released 8 seconds of white noise on itunes and it topped the canadian itunes charts . PLATO-2 w/o explicit knowledgeI agree. I think the younger generation has the potential to make a real difference in the world though. Do you ever listen to music on your phone? Aside from the automatic evaluation, AMT workers are 5.4 Discussions PLATO-2 w/ explicit knowledge i wonder if it is slowing down though , as technology has advanced and people are more connected to each other . have you ever heard the white noise asked to communicate with taylor theonsystems swift releases social media ? for task completion. When the conversation is finished, AMT workers need to To further dissect the performance of PLATO-2, several give evaluation scores on several aspects. The human evalu- cases from various conversations are provided for analysis. ation results are summarized in Table 4. The average success As shown in Figure 4, one dialogue snippet is selected from rate is calculated as the average value of the following two the interaction between a real user and our system. In the Di- metrics: the success rate without database grounding and the alPort platform, users are informed in advance that they will success rate with database grounding. The success rate with- communicate with AI bots. This dialogue snippet demon- out database grounding is based on the AMT worker’s anno- strates that PLATO-2 is able to produce coherent and en- tation during communication (success or fail). In fact, AMT gaging responses in open-domain conversation. For knowl- workers do not know whether the provided values from the edge grounded dialogue, one example is selected to display system are consistent with the database or not. In compar- in Table 5. As compared with the golden and baseline re- ison, the success rate with database grounding is a more sponses, the responses generated by PLATO-2 are more co- strict and practical metric. The dialogue is considered as a herent with the dialogue context. Instead of changing the success if and only if: 1) AMT worker marks the dialogue topic suddenly or copying the given facts directly, PLATO- as success; 2) the provided request slot values plus inform 2 absorbs the knowledge and conveys the information in a slot values from the system can be found in the database. natural way. For task-oriented conversation, one dialogue Our approach achieves the highest score on success rate with snippet with the corresponding goal is selected and shown database grounding. The first two approaches are placed as in Figure 5. The user is asked to interact with the system to co-champion based on the average success rate in the final accomplish a specific goal. The system needs to find out the ranking. entity that satisfies the user’s requirements. As exhibited in
the case, the system actively communicates with the user to context and response. In the second stage, the discrete latent narrow down the scope of candidates and successfully re- variable is encoded into the network for the one-to-many turns the required information. mapping modeling. One fine-grained generation and one Despite the effectiveness on multiple conversational tasks, evaluation model are further learned for diverse response PLATO-2 still suffers from several limitations of general generation and coherence estimation. The model in the first dialogue models, including factual error, logic inconsis- stage is applicable to task-oriented conversation, while those tency, toxic and biased language, and so on. Recently, some models in the second stage are suitable for open-domain pioneering works have been proposed to alleviate these conversation and knowledge grounded dialogue. Compre- problems. For example, the knowledge provenance from hensive evaluations in DSTC9 demonstrate that PLATO-2 Wikipedia is provided in the retrieval-augmented generation is an effective unified pre-training framework for conversa- (Lewis et al. 2020). Some recipes are explored and discussed tional AI. to increase the safety in open-domain chatbots (Xu et al. 2020). Future work will be carried out along these directions Acknowledgments to boost the model’s capacity. We would like to thank the reviewers for their constructive suggestions; Jingzhou He, and Tingting Li for the help on 6 Related Work resource coordination; Gaopeng Yong, Liankai Huang, and Related works will be discussed on pre-trained dialogue Hua Lu for their generous help. This work was supported by generation models and task-oriented dialogue systems. the Natural Key Research and Development Project of China Pre-trained language models have brought significant (No. 2018AAA0101900). breakthroughs in natural language processing (Devlin et al. 2019; Radford et al. 2019; Brown et al. 2020). To boost the References performance of dialogue generation, DialoGPT (Zhang et al. Adiwardana, D.; Luong, M.-T.; So, D. R.; Hall, J.; Fiedel, 2020) is trained on the basis of GPT-2 (Radford et al. 2019) N.; Thoppilan, R.; Yang, Z.; Kulshreshtha, A.; Nemade, G.; using Reddit comments. To obtain a human-like chatbot, Lu, Y.; et al. 2020. Towards a Human-like Open-Domain Meena (Adiwardana et al. 2020) utilizes more social media Chatbot. arXiv preprint arXiv:2001.09977 . conversations and scales up the network to 2.6B parameters. Bao, S.; He, H.; Wang, F.; Wu, H.; and Wang, H. 2020a. To strengthen the desirable conversational skills, Blender PLATO: Pre-trained Dialogue Generation Model with Dis- (Roller et al. 2020) further fine-tunes the pre-trained model crete Latent Variable. In Proceedings of the 58th Annual with human-annotated conversations. To tackle the one-to- Meeting of the Association for Computational Linguistics, many mapping problem, PLATO-2 (Bao et al. 2020b) en- 85–96. codes discrete latent variable into the network and achieves new state-of-the-art results in open-domain conversations. Bao, S.; He, H.; Wang, F.; Wu, H.; Wang, H.; Wu, W.; Guo, In this work, we demonstrate that the one-to-many mapping Z.; Liu, Z.; and Xu, X. 2020b. PLATO-2: Towards Building models of PLATO-2 can be applied effectively on both open- an Open-Domain Chatbot via Curriculum Learning. arXiv domain conversation and knowledge grounded dialogue. preprint arXiv:2006.16779 . For task-oriented dialogue systems, conventional ap- Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; proaches (Young et al. 2013; Henderson, Thomson, and Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, Williams 2014; Wen et al. 2015) usually adopt pipeline mod- A.; et al. 2020. Language Models are Few-Shot Learners. ules, including natural language understanding (NLU), dia- arXiv preprint arXiv:2005.14165 . logue state tracking (DST), dialogue policy, and natural lan- Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. guage generation (NLG). Recently, some end-to-end neural BERT: Pre-training of Deep Bidirectional Transformers for models (Wen et al. 2017; Ham et al. 2020; Peng et al. 2020) Language Understanding. In Proceedings of the 2019 Con- have been introduced for task-oriented dialogue systems. ference of the North American Chapter of the Association The end-to-end system (Ham et al. 2020) remains the core for Computational Linguistics: Human Language Technolo- concepts of pipeline and generates the dialogue state, system gies, 4171–4186. action, and system response simultaneously. In this work, we demonstrate that the one-to-one mapping model of PLATO- Dinan, E.; Roller, S.; Shuster, K.; Fan, A.; Auli, M.; and We- 2 can be adopted as a powerful basis. With enhanced entity ston, J. 2018. Wizard of Wikipedia: Knowledge-Powered representation and active clarification, PLATO-2 achieves Conversational Agents. In International Conference on new state-of-the-art results in task-oriented conversation. Learning Representations. Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, 7 Conclusion Y.; Gao, J.; Zhou, M.; and Hon, H.-W. 2019. Unified In this work, we explore the application of PLATO-2 on Language Model Pre-training for Natural Language Under- various dialogue systems, including open-domain chit-chat, standing and Generation. In Advances in Neural Information knowledge grounded dialogue, and task-oriented conversa- Processing Systems, 13063–13075. tion. The training of PLATO-2 is carried out via two-stage Eric, M.; Goel, R.; Paul, S.; Sethi, A.; Agarwal, S.; Gao, S.; curriculum learning. In the first stage, the network tries to Kumar, A.; Goyal, A.; Ku, P.; and Hakkani-Tur, D. 2020. fit the simplified one-to-one mapping between the dialogue MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue
Dataset with State Corrections and State Tracking Baselines. Model? In Proceedings of the 2020 Conference on Empiri- In Proceedings of The 12th Language Resources and Evalu- cal Methods in Natural Language Processing, 5418–5426. ation Conference, 422–428. Roller, S.; Dinan, E.; Goyal, N.; Ju, D.; Williamson, M.; Liu, Fan, A.; Lewis, M.; and Dauphin, Y. 2018. Hierarchical Y.; Xu, J.; Ott, M.; Shuster, K.; Smith, E. M.; et al. 2020. Neural Story Generation. In Proceedings of the 56th Annual Recipes for building an open-domain chatbot. arXiv preprint Meeting of the Association for Computational Linguistics, arXiv:2004.13637 . 889–898. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, Gao, J.; Galley, M.; and Li, L. 2018. Neural Approaches to L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At- Conversational AI. In The 41st International ACM SIGIR tention is All you Need. In Advances in Neural Information Conference on Research & Development in Information Re- Processing Systems, 5998–6008. trieval, 1371–1374. Wen, T.-H.; Gasic, M.; Mrkšić, N.; Su, P.-H.; Vandyke, D.; Gopalakrishnan, K.; Hedayatnia, B.; Chen, Q.; Gottardi, A.; and Young, S. 2015. Semantically Conditioned LSTM- Kwatra, S.; Venkatesh, A.; Gabriel, R.; and Hakkani-Tür, D. based Natural Language Generation for Spoken Dialogue 2019. Topical-Chat: Towards Knowledge-Grounded Open- Systems. In Proceedings of the 2015 Conference on Empir- Domain Conversations. In Proceedings of the 20th Annual ical Methods in Natural Language Processing, 1711–1721. Conference of the International Speech Communication As- Wen, T.-H.; Vandyke, D.; Mrkšić, N.; Gasic, M.; Bara- sociation, 1891–1895. hona, L. M. R.; Su, P.-H.; Ultes, S.; and Young, S. 2017. Gunasekara, C.; Kim, S.; D’Haro, L. F.; Rastogi, A.; Chen, A Network-based End-to-End Trainable Task-oriented Di- Y.-N.; Eric, M.; Hedayatnia, B.; Gopalakrishnan, K.; Liu, alogue System. In Proceedings of the 15th Conference of Y.; Huang, C.-W.; Hakkani-Tür, D.; Li, J.; Zhu, Q.; Luo, the European Chapter of the Association for Computational L.; Liden, L.; Huang, K.; Shayandeh, S.; Liang, R.; Peng, Linguistics, 438–449. B.; Zhang, Z.; Shukla, S.; Huang, M.; Gao, J.; Mehri, S.; Xu, J.; Ju, D.; Li, M.; Boureau, Y.-L.; Weston, J.; and Dinan, Feng, Y.; Gordon, C.; Alavi, S. H.; Traum, D.; Eskenazi, M.; E. 2020. Recipes for Safety in Open-domain Chatbots. arXiv Beirami, A.; Eunjoon; Cho; Crook, P. A.; De, A.; Geram- preprint arXiv:2010.07079 . ifard, A.; Kottur, S.; Moon, S.; Poddar, S.; and Subba, R. 2020. Overview of the Ninth Dialog System Technology Young, S.; Gašić, M.; Thomson, B.; and Williams, J. D. Challenge: DSTC9. arXiv preprint arXiv:2011.06486 . 2013. POMDP-Based Statistical Spoken Dialog Systems: A Review. In Proceedings of the IEEE, volume 101, 1160– Ham, D.; Lee, J.-G.; Jang, Y.; and Kim, K.-E. 2020. End- 1179. to-End Neural Pipeline for Goal-Oriented Dialogue Systems using GPT-2. In Proceedings of the 58th Annual Meeting of Zhang, S.; Dinan, E.; Urbanek, J.; Szlam, A.; Kiela, D.; and the Association for Computational Linguistics, 583–592. Weston, J. 2018. Personalizing Dialogue Agents: I have a dog, do you have pets too? In Proceedings of the 56th An- Henderson, M.; Thomson, B.; and Williams, J. D. 2014. The nual Meeting of the Association for Computational Linguis- Second Dialog State Tracking Challenge. In Proceedings of tics, 2204–2213. the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 263–272. Zhang, Y.; Sun, S.; Galley, M.; Chen, Y.-C.; Brockett, C.; Gao, X.; Gao, J.; Liu, J.; and Dolan, B. 2020. DialoGPT: Jang, E.; Gu, S.; and Poole, B. 2017. Categorical Reparam- Large-Scale Generative Pre-training for Conversational Re- eterization with Gumbel-Softmax. In International Confer- sponse Generation. In Proceedings of the 58th Annual Meet- ence on Learning Representations. ing of the Association for Computational Linguistics: Sys- Kim, B.; Ahn, J.; and Kim, G. 2019. Sequential Latent tem Demonstrations, 270–278. Knowledge Selection for Knowledge-Grounded Dialogue. Zhao, T.; Lee, K.; and Eskenazi, M. 2016. DialPort: Con- In International Conference on Learning Representations. necting the Spoken Dialog Research Community to Real Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; User Data. In 2016 IEEE Spoken Language Technology Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, Workshop, 83–90. T.; et al. 2020. Retrieval-augmented generation for Zhao, T.; Zhao, R.; and Eskenazi, M. 2017. Learning knowledge-intensive nlp tasks. In Advances in Neural In- Discourse-level Diversity for Neural Dialog Models using formation Processing Systems. Conditional Variational Autoencoders. In Proceedings of the Peng, B.; Li, C.; Li, J.; Shayandeh, S.; Liden, L.; and Gao, 55th Annual Meeting of the Association for Computational J. 2020. SOLOIST: Few-shot Task-Oriented Dialog with A Linguistics, 654–664. Single Pre-trained Auto-regressive Model. arXiv preprint Zhu, Q.; Zhang, Z.; Fang, Y.; Li, X.; Takanobu, R.; Li, J.; arXiv:2005.05298 . Peng, B.; Gao, J.; Zhu, X.; and Huang, M. 2020. ConvLab-2: Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and An Open-Source Toolkit for Building, Evaluating, and Diag- Sutskever, I. 2019. Language Models are Unsupervised nosing Dialogue Systems. In Proceedings of the 58th Annual Multitask Learners. Technical report, OpenAI . Meeting of the Association for Computational Linguistics: Roberts, A.; Raffel, C.; and Shazeer, N. 2020. How Much System Demonstrations, 142–149. Knowledge Can You Pack Into the Parameters of a Language
You can also read