Summary Grounded Conversation Generation - ACL Anthology
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Summary Grounded Conversation Generation Chulaka Gunasekara, Guy Feigenblat, Benjamin Sznajder, Sachindra Joshi, David Konopnicki IBM Research AI chulaka.gunasekara@ibm.com, {guyf, benjams}@il.ibm.com {jsachind@in, davidko@il∗ }.ibm.com Abstract 137 summary transcripts. CRD3 (Rameshkumar and Bailey, 2020) is a spoken conversation dataset Many conversation datasets have been con- that consists of 159 conversations and summaries. structed in the recent years using crowd- sourcing. However, the data collection pro- Samsum (Gliwa et al., 2019), the only large scale cess can be time consuming and presents many dataset for conversation summarization, contains challenges to ensure data quality. Since lan- over 16, 000 open-domain conversations and sum- guage generation has improved immensely in maries created artificially by humans. recent years with the advancement of pre- Large scale pre-trained language models (PLMs) trained language models, we investigate how (Lewis et al., 2020; Brown et al., 2020; Raffel such models can be utilized to generate entire et al., 2020) have been used in various text gen- conversations, given only a summary of a con- versation as the input. We explore three ap- eration tasks (Budzianowski and Vulić, 2019; Min proaches to generate summary grounded con- et al., 2020; Cachola et al., 2020). In recent studies, versations, and evaluate the generated conver- PLMs are used to generate training data for natu- sations using automatic measures and human ral language processing (NLP) applications. For judgements. We also show that the accuracy of example, Anaby-Tavor et al. (2020); Yang et al. conversation summarization can be improved (2020) use PLMs to create paraphrases for intent by augmenting a conversation summarization classifiers in conversation systems, and show that, dataset with generated conversations. when the original datasets are augmented with the 1 Introduction generated data, performance improves. More re- cently Mohapatra et al. (2020) generated entire Automatic conversation systems require large quan- conversations grounded on instructions that are pro- tities of data to learn task specific language patterns vided to crowd-workers using a modular approach, and underlying conversation policies. Such data ei- where different PLMs are trained for different roles. ther come from human-to-human conversation logs Our Contributions: We investigate how PLMs (Lowe et al., 2015; Hardalov et al., 2018) or is col- can be utilized to generate entire conversations that lected in crowd-sourced environments, where two are grounded on a given summary. We explore or more crowd-workers play specific roles under three approaches: (1) Supervised Learning (SL) some guidelines (Zhang et al., 2018; Budzianowski based conversation generation (SL-Gen): where, et al., 2018). Since real human-to-human conver- a PLM is trained to generate an entire conversa- sation logs are scarce, many datasets have been tion, taking the summary of a conversation as in- created using the latter approach. However, crowd- put, (2) Reinforced Learning (RL) based conversa- sourced conversation data collection is time con- tion generation (RL-Gen): where, we further im- suming, costly and presents multiple challenges to prove the SL-Gen method using the quality of the ensure data quality (Kang et al., 2018). generated conversations as a reward, and (3) Con- Conversation summarization is an emerging re- trolled turn-by-turn conversation generation (CN- search area that has been ill-studied due to the Gen): which allows us to generate conversations lack of large-scale datasets. Most existing public turn-by-turn, constrained on the summary and a datasets in this domain are small, for example, AMI set of pre-defined control parameters. We evalu- meeting corpus (McCowan et al., 2005) contains ate the quality of the generated conversations by ∗ Current address: david.konopnicki@booking.com conducting automatic and human evaluation. We 3748 Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3748–3756 August 1–6, 2021. ©2021 Association for Computational Linguistics
Figure 1: The RL based conversation generation framework also show that once a conversation summarization summary. The similarity score is used as a reward dataset is augmented with the generated conversa- to train the conversation generation model. Our RL tions, the performance of the downstream summa- based generation framework is shown in Figure 1, rization task is improved. and the critical components are described below. Conversation Generator: A trained SL-Gen 2 Summary grounded conversation model is used as the conversation generator, which, generation given an summary can generate a conversation. In the conversation summarization task, a model Summary Generator: We use a lightweight vari- takes a conversation as input, and learns to generate ant of BART (Lewis et al., 2019), named Distil- a summary. We study the inverse of that problem, BART, which is fine-tuned on the Extreme sum- where the input to our model is a summary, and the marization task (Narayan et al., 2018). We further model generates a conversation. In this section, we fine-tune this instance on the conversation summa- propose three models for this task and the hyper- rization data by providing the conversations as the parameters used in training the models are available input and training the model to output summaries. in Section A of the appendix. Reward Model: Once the Summary Generator generates an output summary for the generated con- 2.1 SL based generation (SL-Gen) versation, the reward model compares it with the A seq2seq model can be trained for this task by ground truth summary, which was used to ground providing a summary as the input and generating the conversation generation. As Paulus et al. (2018) a conversation token-by-token. As PLMs have we use ROUGE-2 F1-score as the reward. shown significant improvement over the traditional Policy training: We use proximal policy optimiza- seq2seq architecture for text generation, we use a tion (Schulman et al., 2017) as the optimizer for GPT-2 model and fine-tune it to generate a con- the policy training as it prevents the generator from versation given a summary as the input. Our in- deviating far away from the pretrained LM (Wu put to the model follows the following format: et al., 2020). summary text conversation text. We 2.3 Controlled conversation generation also use different token-type-ids to indicate the summary and the conversation text. The model is We propose another approach, (CN-Gen), for con- trained to optimize Cross Entropy loss. versation generation, which grants more control over the properties of the generated conversations. 2.2 RL based generation (RL-Gen) Here, we generate one utterance of the conversation Many studies train text generation models with RL at a time, as opposed to the RL-Gen, where we gen- (Paulus et al., 2018; Li et al., 2016), where the erate the whole conversation at once. The proper- generator network is optimized with a task spe- ties of the generated conversations is controlled by cific reward. We investigate how the quality of the adding several components to the input sequence to generated conversation can be used as a reward to the model. The following three variables were used improve the generation network. To this end, we as the control parameters, (1) Number of remaining train a summary generator network, which gener- turns to generate in the conversation (Num turns): ates a summary, given a conversation. We measure During the generation of a turn, we indicate the the quality of the generated conversation by iden- remaining number of turns in the conversation. In tifying the similarity between the summary of the generating a n turn conversation, this starts with generated conversation (generated, in turn, by the n for the first turn and reduces by 1 after the gen- summary generator network) and the ground truth eration of each turn, (2) The speaker of the next 3749
Summary: person0 will be late. person1 will order pasta with salmon and basil for her. Model Ave. Turns Ave. Tokens/Turn 2 turn conversation: 3 turn conversation Ground truth 11.55 ± 6.48 7.10 ± 6.29 I’ll be late I’ll be late. SL-Conv-Gen 10.54 ± 6.80 5.69 ± 4.40 I’ll order some pasta I’ll order some pasta with salmon and basil for with salmon and basil for RL-Conv-Gen 8.40 ± 4.78 5.14 ± 3.64 you. you. CN-Conv-Gen 9.70 ± 5.67 5.62 ± 4.05 Thanks a lot! 6 turn conversation 10 turn conversation Table 2: Properties of the generated conversations. Hello, I am going to be I’ll be late late. ok Ok do you want me to order I’ll order some pasta with salmon and basil something for you? pasta? sures and human judgments, and then assess the Ok, sounds good! Thank you! Yes with salmon? performance of the generated conversations in a No problem Yes Ok downstream summarization task after augmenta- how about basil? Yes please! tion. Table 1: Multiple conversations generated by the CN- 3.1 Quality of the generated conversations Gen approach grounded on the same summary We evaluate the quality of the conversations gener- ated by the three approaches that were introduced turn (Speaker): This indicates to the model the in Section 2. In Table 2 we show the properties speaker of the next turn, and (3) The length of the of generated conversations and the ground truth next turn (Turn length): We define, 3 categories of conversations in the test set of Samsum dataset. lengths: Short (≤ 3 tokens), Long (> 10 tokens) Automatic Evaluation: We trained the con- and Medium (otherwise). versation generation models on the Samsum train- We use the following input representation ing set and generated conversations on the test set. to fine-tune a GPT-2 model: summary We compare the generated conversation with the text dialog context Num turns ground truth conversations using the measures used speaker turn length ut- by Sharma et al. (2017) to evaluate conversation terance . Changing these parameters allows system responses. The results shown in Table 3 us to generate different variants of conversations suggest that CN-Gen outperform the SL-Gen and which are grounded on the same summary. During RL-Gen on all measures. training, we obtain the values for the control pa- We also compare the summaries of generated rameters from the ground truth conversations, and conversations (generated by the Summary Gener- at inference we randomly select the next speaker, ator) with the ground truth summaries, and the number of turns of the conversation to be gener- results are shown in Table 4. We believe that this is ated (in a range of 4-15 turns), and the next turn a semantic evaluation of the conversations, as the length. In Table 1 we show conversations of dif- summaries capture the crux of the conversations. ferent lengths that were generated by the CN-Gen According to the results, CN-Gen outperforms the approach grounded on the same summary by chang- other two methods. This, along with the previous ing the control parameters. result suggest that the conversations produced by A summary and a conversation from the Sam- CN-Gen are the most similar to the ground truth sum dataset (Gliwa et al., 2019), along with the con- conversations. versations generated by the three aforementioned Human Evaluation: To evaluate the quality of algorithms are shown in Figure 2. More examples generated conversations, we randomly selected 50 are provided in the Section B of the Appendix. summaries from the Samsum test dataset and gen- erated conversations using the three models. Three 3 Experiments NLP experts were then asked to read the ground truth summary and rank the four conversations (3 We experiment on the Samsum (Gliwa et al., 2019) generated and the ground truth conversation) us- dataset, which, to the best of our knowledge, is the ing a [1-5] scale according to Grammaticality, Co- only public large-scale conversation summarization herency, and Informativeness, with respect to the dataset. We pre-process the dataset by replacing ground truth summary. Results are shown in ta- the personal names (ex: John) with unique tags ble 5. As expected, the ground-truth conversations (ex:). First, we evaluate of the quality obtained the highest scores on all three aspects and of generated conversations using automatic mea- can be considered as an upper bound for this task. 3750
Figure 2: Examples of a conversations grounded on the same summary. The key terms are highlighted in colors. Model BLEU-4 METEOR ROUGE-L Model ROUGE 1 ROUGE 2 ROUGE L SL-Gen 2.81 12.06 21.53 SL-Gen 46.85 25.29 45.97 RL-Gen 3.53 12.29 25.40 RL-Gen 52.51 31.23 51.68 CN-Gen 4.94 15.64 26.22 CN-Gen 53.46 32.52 52.93 Table 3: Evaluation of generated conversations Table 4: Rouge F1 evaluation of summaries of con- against ground truth conversations versations against the ground truth summaries RL-Gen and CN-Gen obtained higher scores than models were applied on the other (y=100-x%) of SL-Gen and relatively good scores compared to the summaries and generated conversations, (3) the Ground Truth conversations. This corroborates Those generated conversations along with the orig- the assumption that our proposed models generate inal summaries were added to the data. Using this high quality conversations. The Welch Two Sam- approach, we can add extra y% (summary, conver- ple t-test (Welch, 1947) shows that both RL-Gen sation) pairs to the training data, (4) The conver- and CN-Gen models outperform the SL-Gen model sation summarization model (discussed in Section statistically significantly with p < 0.0001. How- 2 under ‘Summary Generator‘) was trained on the ever, there is no statistical significance between the augmented data. We compare the performance of results obtained from RL-Gen and CN-Gen. We the conversation summarization model on the orig- report in Table 6 the average quadratic Cohen’s inal dataset and with augmentation. Kappa calculated over the three possible combina- Automatic Evaluation: We compare the three tions of two judges (Toledo et al., 2019). conversation generation methods at different aug- CN-Gen obtained the best scores during the auto- mentation percentages, and the results are shown matic evaluation, while RL-Gen got the best scores in Table 7. At all augmentation levels, the summa- from the human evaluation. The CN-Gen conver- rization models trained with augmented data out- sations are longer than the RL-Gen conversation perform the summarization model trained on the by 1.3 turns on average (see Table 2), and hence original dataset (without augmentation). CN-Gen would contain more word overlap with the ground based augmentation produces the best accuracy truth. This results in better automatic evaluation compared to other two methods. One prevalent pat- scores for the CN-Gen, while the humans prefer tern is that, when augmentation data increases, the short targeted conversations generated by RL-Gen. accuracy seems to increase up to a certain point and then starts to decrease. The best accuracies were 3.2 Evaluation on the summarization task found around 30% data augmentation. We believe To further evaluate the quality of the generate con- that more augmentation leads performance to drop versations, we augmented a conversation summa- due to the following reason. For augmenting with rization dataset with generated conversations and more data, we are left with less data to train the evaluated the summarization model. We followed model for conversation generation (for 10% aug- the following process: (1) We randomly selected mentation, the conversation generation models are x% of the summaries of the dataset and trained our trained on 90% of the data, while for 50% augmen- conversation generation models, (2) The trained tation, the models are trained only on 50% of the 3751
Model Info Gram Cohe Model Info Gram Cohe Ground-Truth 4.56 4.46 4.47 Ground-Truth 0.04 0.22 0.25 SL-Gen 2.22 2.85 2.37 SL-Gen 0.35 0.26 0.42 RL-Gen 3.20 3.50 3.14 RL-Gen 0.47 0.35 0.45 CN-Gen 3.10 3.43 3.09 CN-Gen 0.60 0.40 0.60 Table 5: Human evaluation of generated conversa- Table 6: Average Cohen’s Kappa for human evalua- tions tion of generated conversations Augmentation % ROUGE 1 ROUGE 2 ROUGE L Method score of each system is computed as the percentage 0% (Original) 51.84 30.98 43.98 10% 52.82 31.99 44.89 of times it was chosen as the Best system minus 20% 52.90 32.01 44.97 times it was chosen as Worst. On the Coherency SL-Gen 30% 52.88 32.02 45.01 40% 52.61 31.98 44.96 question, RL-Gen, CN-Gen and No-Aug obtained 50% 52.55 31.98 44.80 scores of 12.6, 6.6 and -4.0 respectively. On the 10% 52.93 32.05 44.92 Focus question RL-Gen, CN-Gen, and No-Aug 20% 53.30 32.15 45.20 RL-Gen 30% 53.81 32.21 45.77 obtained scores of 14.6, 6.0 and -2.6 respectively. 40% 52.86 32.06 44.99 These results confirm that the use of augmentation 50% 52.64 32.07 44.88 10% 53.29 32.36 45.08 improves the quality of the summaries. 20% 53.36 32.53 45.27 CN-Gen 30% 54.02 33.28 46.06 40% 52.14 31.76 44.14 4 Conclusion 50% 52.36 31.75 44.85 We investigated how the PLMs can be utilized to Table 7: ROUGE F-1 evaluation on Samsum test set. generate entire conversations that are grounded on a summary. We propose three approaches for con- data). Therefore as the augmentation increases, the versation generation: SL-Gen, RL-Gen and CN- quality of generated conversations go down. This Gen and conducted multiple automatic and human leads to overall smaller gains in the summariza- evaluations to assess the quality of the generated tion task with increased augmentation after some conversations. Both automatic and human eval- point. To neutralize the effect of increasing the data uations show that when compared to the ground points during augmentation, we experimented with truth conversations, RL-Gen and CN-Gen obtain a baseline which over-samples the original training high scores, suggesting that the proposed models data at different percentages to obtain same num- generate high quality conversations. When a con- ber of training instances as the augmented datasets. versation summarization dataset is augmented with While the ROUGE-2 obtained with the original the generated conversations, the performance of training data is 30.98, oversampling at 10%, 20%, conversation summarization is improved (over to 30%, 40% and 50%, only changes the ROUGE-2 7% improvement in ROUGE-2 F-1), which also to 30.55, 30.38, 30.74, 30.99 and 30.27 respec- suggests that the proposed methods generate high tively. Hence, this suggests that oversampling quality conversations. hardly changes ROUGE scores obtained by train- ing with the original dataset, while the augmenta- 5 Ethics tion according to our algorithms show significantly improved scores (as shown in Table 7). We have used the publicly available Samsum Human Evaluation: We recruited 3 NLP ex- dataset (https://huggingface.co/datasets/ perts to evaluate 50 instances of summaries gener- samsum). For the human evaluation of both ated with data augmentation (RL-Gen, CN-Gen), conversations and summaries, we recruited 3 NLP and respective summaries generated without aug- researchers, who have graduate degree in NLP mentation (No-Aug). Here we consider two as- and Machine Learning. The annotation task itself pects with respect to a ground-truth summary: Co- was executed on Appen.com platform. Before the herency (whether the summary is easy to read) and official annotation, we sampled 10 tasks to get an Focus (whether the summary represents the ground- estimate of the duration of the task, and to make truth summary). Following (Amplayo and Lapata, sure the instructions are clear enough. 2020) we use the Best-Worst Scaling method. The 3752
References Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Reinald Kim Amplayo and Mirella Lapata. 2020. Un- Levy, Veselin Stoyanov, and Luke Zettlemoyer. supervised opinion summarization with noising and 2020. Bart: Denoising sequence-to-sequence pre- denoising. In Proceedings of the 58th Annual Meet- training for natural language generation, translation, ing of the Association for Computational Linguistics, and comprehension. In Proceedings of the 58th An- pages 1934–1945. nual Meeting of the Association for Computational Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Linguistics, pages 7871–7880. Amir Kantor, George Kour, Segev Shlomov, Naama Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Tepper, and Naama Zwerdling. 2020. Do not have Michel Galley, and Jianfeng Gao. 2016. Deep rein- enough data? deep learning to the rescue! In Pro- forcement learning for dialogue generation. In Pro- ceedings of the AAAI Conference on Artificial Intel- ceedings of the 2016 Conference on Empirical Meth- ligence, volume 34, pages 7383–7390. ods in Natural Language Processing, pages 1192– Tom B Brown, Benjamin Mann, Nick Ryder, Melanie 1202. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Ryan Lowe, Nissan Pow, Iulian Vlad Serban, and Neelakantan, Pranav Shyam, Girish Sastry, Amanda Joelle Pineau. 2015. The ubuntu dialogue corpus: A Askell, et al. 2020. Language models are few-shot large dataset for research in unstructured multi-turn learners. arXiv preprint arXiv:2005.14165. dialogue systems. In Proceedings of the 16th An- Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s nual Meeting of the Special Interest Group on Dis- gpt-2-how can i help you? towards the use of pre- course and Dialogue, pages 285–294. trained language models for task-oriented dialogue systems. In Proceedings of the 3rd Workshop on Iain McCowan, Jean Carletta, Wessel Kraaij, Simone Neural Generation and Translation, pages 15–22. Ashby, S Bourban, M Flynn, M Guillemot, Thomas Hain, J Kadlec, Vasilis Karaiskos, et al. 2005. The Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang ami meeting corpus. In Proceedings of the 5th In- Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra- ternational Conference on Methods and Techniques madan, and Milica Gasic. 2018. Multiwoz-a large- in Behavioral Research, volume 88, page 100. Cite- scale multi-domain wizard-of-oz dataset for task- seer. oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Language Processing, pages 5016–5026. Luke Zettlemoyer. 2020. Ambigqa: Answering am- biguous open-domain questions. In Proceedings of Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S the 2020 Conference on Empirical Methods in Nat- Weld. 2020. Tldr: Extreme summarization of sci- ural Language Processing (EMNLP), pages 5783– entific documents. In Proceedings of the 2020 Con- 5797. ference on Empirical Methods in Natural Language Processing: Findings, pages 4766–4777. Biswesh Mohapatra, Gaurav Pandey, Danish Con- tractor, and Sachindra Joshi. 2020. Simulated Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and chats for task-oriented dialog: Learning to gener- Aleksander Wawer. 2019. Samsum corpus: A ate conversations from instructions. arXiv preprint human-annotated dialogue dataset for abstractive arXiv:2010.10216. summarization. EMNLP-IJCNLP 2019, page 70. Shashi Narayan, Shay B Cohen, and Mirella Lapata. Momchil Hardalov, Ivan Koychev, and Preslav Nakov. 2018. Don’t give me the details, just the summary! 2018. Towards automated customer support. In topic-aware convolutional neural networks for ex- International Conference on Artificial Intelligence: treme summarization. In Proceedings of the 2018 Methodology, Systems, and Applications, pages 48– Conference on Empirical Methods in Natural Lan- 59. Springer. guage Processing, pages 1797–1807. Yiping Kang, Yunqi Zhang, Jonathan K Kummerfeld, Romain Paulus, Caiming Xiong, and Richard Socher. Lingjia Tang, and Jason Mars. 2018. Data collec- 2018. A deep reinforced model for abstractive sum- tion for dialogue system: A startup perspective. In marization. In International Conference on Learn- Proceedings of the 2018 Conference of the North ing Representations. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Volume 3 (Industry Papers), pages 33–40. Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the lim- Mike Lewis, Yinhan Liu, Naman Goyal, Mar- its of transfer learning with a unified text-to-text jan Ghazvininejad, Abdelrahman Mohamed, Omer transformer. Journal of Machine Learning Research, Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. 21(140):1–67. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and Revanth Rameshkumar and Peter Bailey. 2020. Story- comprehension. arXiv preprint arXiv:1910.13461. telling with dialogue: A critical role dungeons and 3753
dragons dataset. In Proceedings of the 58th Annual training and inference are shown below. The model Meeting of the Association for Computational Lin- takes around 6 hours to train on 2 V100 GPUs guistics, pages 5121–5134. (single machine). John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal model_name_or_path: gpt2 policy optimization algorithms. arXiv preprint per_gpu_train_batch_size: 4 arXiv:1707.06347. per_gpu_eval_batch_size: 4 gradient_accumulation_steps: 4 Shikhar Sharma, Layla El Asri, Hannes Schulz, and learning_rate: 6.25e-5 Jeremie Zumer. 2017. Relevance of unsupervised adam_epsilon: 1e-8 metrics in task-oriented dialogue for evaluating nat- max_grad_norm: 1.0 ural language generation. CoRR, abs/1706.09799. num_train_epochs: 10 warmup_steps: 500 Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, Roni min_length: 20 Friedman, Elad Venezian, Dan Lahav, Michal Ja- max_length: 512 covi, Ranit Aharonov, and Noam Slonim. 2019. Au- top_k: 0 tomatic argument quality assessment-new datasets top_p: 0.95 and methods. In Proceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- A.2 Summary Generator ence on Natural Language Processing (EMNLP- We use DistilBART instance1 fine-tuned on the IJCNLP), pages 5629–5639. extreme summarization (XSum) task, and we fine- Bernard L Welch. 1947. The generalization ofstu- tune this model further on the Samsum dataset. The dent’s’ problem when several different population model takes around 12 hours to train on 2 V100 variances are involved. Biometrika, 34(1/2):28–35. GPUs (single machine). Thomas Wolf, Lysandre Debut, Victor Sanh, Julien The hyperparameters used for training the Dis- Chaumond, Clement Delangue, Anthony Moi, Pier- tilBART model are as follows: ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, et al. 2019. Huggingface’s transformers: State- train_batch_size: 4 eval_batch_size: 4 of-the-art natural language processing. ArXiv, pages num_train_epochs: 10 arXiv–1910. model_name_or_path: sshleifer/distilbart -xsum-12-6 Qingyang Wu, Lei Li, and Zhou Yu. 2020. Textgail: learning_rate: 3e-5 Generative adversarial imitation learning for text val_check_interval: 0.1 generation. arXiv preprint arXiv:2004.13796. max_source_length: 512 max_target_length: 80 Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji-Ping A.3 Reinforced Learning based conversation Wang, Chandra Bhagavatula, Yejin Choi, and Doug Downey. 2020. G-daug: Generative data augmenta- generation (RL-Conv-Gen) tion for commonsense reasoning. In Proceedings of To train the RL based conversation generation the 2020 Conference on Empirical Methods in Nat- model, we adapted a publicly available Proxi- ural Language Processing: Findings, pages 1008– 1025. mal Policy Optimization (PPO) implementation 2 . The model takes around 12 hours to train on 2 Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur V100 GPUs (single machine). Following hyper- Szlam, Douwe Kiela, and Jason Weston. 2018. Per- sonalizing dialogue agents: I have a dog, do you parameters were used to train the model. have pets too? In Proceedings of the 56th An- steps: 10000 nual Meeting of the Association for Computational batch_size: 16 Linguistics (Volume 1: Long Papers), pages 2204– forward_batch_size: 4 2213. learning_rate: 1.41e-5 init_kl_coef:0.2 A Model Training and Hyperparameter target: 6 horizon:10000 Details gamma:1 lam:0.95 A.1 Supervised Conversation Generation cliprange: 0.2 (SL-Conv-Gen) cliprange_value: 0.2 vf_coef: 0.1 We fine-tune a GPT-2 language model using the 1 https://huggingface.co/sshleifer/ implementation available at HuggingFace (Wolf distilbart-cnn-12-6 2 et al., 2019). The hyper-parameters used during https://github.com/lvwerra/trl 3754
B Sample summaries with corresponding ground-truth Figure 3 shows some samples of dialogs with their corresponding summaries - ground-truth and auto- matic generated ones. 3755
Summary: Person0 closed some deals today. Person1 didn’t manage to do it. Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen: It’s been a long day. Hi, are you still alive? Hey, how are you? Did you see this? Same here, Person0, good Hi, yes, still alive. Good, busy as usual. ¡file other¿ or bad? I closed some deals today. I closed some deals today. Cool. What’s up? Very busy. I closed some Yeah, that’s awesome. Too late, didn’t manage to I closed some deals today deals but I had a lot of Thanks. do it. and I feel good about it. stress also. Sorry. Great. I didn’t manage to Lucky you. I couldn’t do it today. I’m sorry. I close any deal. It an- had no choice. noys me. Some customers I’m sorry. are negotiating for days, I didn’t have the tools. weeks. Maybe they’re just fishing. Want to compare offers. That’s what they do. Summary: Person0 bought a table, six chairs, a vase and a pile of clothes and the second hand shop downtown. She paid 70 euros for everything. Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen: I went to the second hand I bought the second hand Do you remember the sec- Look what I’ve bought! A shop downtown. shop downtown. ond hand in downtown? table, six chairs, a vase. I Cool Person0 What did So what happened? Yes, I do. What’s inside? bought a table, six chairs, a you get? I bought everything. A table, six chairs, a vase vase and a pile of clothes. Lots of stuff. A table, six What do you mean? and a pile of clothes. How much did you pay for chairs, a vase, a pile of I paid 70 euro for every- Is it expensive? everything? clothes thing. 70 euros. I paid 70 euro for every- That’s really a lot of stuff How much? thing. And I thought it Person0. Send us pictures 6 chairs, six chairs, a vase would cost me a fortune! ¡photo file¿ and a pile of clothes. I don’t even know how Beautiful table.Is it wood? Sounds like a bargain. much. Yes, oak. ou should be in business. I am glad. Awesome! These are the Me too. 70 euro for every- clothes? I love the black thing, including a vase!!! dress. It’s absolutely beautiful. As soon as I saw it I knew it would be mine. I’m so happy with what I bought. Summary: Person1 is not at home. Person0 wants Person1 to keep her pasta in the microwave. Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen: U both at home? How are you doing? Are you at home? Are you at home? No. I’ve just got ur msg. I’m not at home No, I’m still at work. No, I’m busy at work. Why did u ask about it? Why? Good, I should be home in What do you want? No reason. Keep my pasta I wanted to keep some about 15 minutes. I want to put some pasta in in the microwave. pasta in the microwave. Can you put my pasta in microwave. I haven’t cooked anything. No need to. the microwave? OK. I’ll try. In the microwave? Sure, I’ll check it. Yes. Thanks! Summary: Person0 needs Person1’s help as he cannot get the application running. Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen: Person1, I need your help? Hey! I have a problem Person1, you there? I cannot get the application What’s up, mate? with the application. Yeah, what’s up? running. I can’t get the application Hello, Person0 Do you need my help? What’s up? running. Could you please help me What’s wrong? Can you get the applica- Have you switched the to get the application run- I can’t get the application tion running? computer on? ning? I cannot. running. You sure you can’t? Try Very funny!!!!!!!!!!!!! Why? Did you try charging it and click on it. OK. Sorry. I can see it’s I have a problem with my first? You’re so funny, Person1. serious. laptop. The application is Yes, I tried that, but it OK. I’m coming. Yeah, man. It is f**cking running on the USB port. doesn’t work. Thanks. serious. I cannot access the com- Try refreshing the I’ll be with you right now. puter. The computer is run- database. Thanks. ning. If I try to access the Damn, thanks! computer, the application will start. I hope that you can help me. Okay. Thank you, Person1. Summary: Person0 and Person1 will meet the new person in an hour. Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen: Can we meet with the new Hi, babe. Did you get that message? Wanna meet the new per- person soon? Hi. No, why? son? Sure. In an hour okay? We’re meeting in an hour? I might have lost it. Sure, I’ll be there in an Perfect. I’d prefer to be there in an Oh no, it’s my mistake. hour. hour. I’m sorry. Perfect! Great! Ok, don’t worry. We can meet in an hour? Ok! Figure 3: Samples of dialogs with their corresponding summaries - ground-truth and automatic generated ones 3756
You can also read