Summary Grounded Conversation Generation - ACL Anthology
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Summary Grounded Conversation Generation
Chulaka Gunasekara, Guy Feigenblat, Benjamin Sznajder, Sachindra Joshi,
David Konopnicki
IBM Research AI
chulaka.gunasekara@ibm.com, {guyf, benjams}@il.ibm.com
{jsachind@in, davidko@il∗ }.ibm.com
Abstract 137 summary transcripts. CRD3 (Rameshkumar
and Bailey, 2020) is a spoken conversation dataset
Many conversation datasets have been con-
that consists of 159 conversations and summaries.
structed in the recent years using crowd-
sourcing. However, the data collection pro- Samsum (Gliwa et al., 2019), the only large scale
cess can be time consuming and presents many dataset for conversation summarization, contains
challenges to ensure data quality. Since lan- over 16, 000 open-domain conversations and sum-
guage generation has improved immensely in maries created artificially by humans.
recent years with the advancement of pre- Large scale pre-trained language models (PLMs)
trained language models, we investigate how (Lewis et al., 2020; Brown et al., 2020; Raffel
such models can be utilized to generate entire
et al., 2020) have been used in various text gen-
conversations, given only a summary of a con-
versation as the input. We explore three ap- eration tasks (Budzianowski and Vulić, 2019; Min
proaches to generate summary grounded con- et al., 2020; Cachola et al., 2020). In recent studies,
versations, and evaluate the generated conver- PLMs are used to generate training data for natu-
sations using automatic measures and human ral language processing (NLP) applications. For
judgements. We also show that the accuracy of example, Anaby-Tavor et al. (2020); Yang et al.
conversation summarization can be improved (2020) use PLMs to create paraphrases for intent
by augmenting a conversation summarization
classifiers in conversation systems, and show that,
dataset with generated conversations.
when the original datasets are augmented with the
1 Introduction generated data, performance improves. More re-
cently Mohapatra et al. (2020) generated entire
Automatic conversation systems require large quan- conversations grounded on instructions that are pro-
tities of data to learn task specific language patterns vided to crowd-workers using a modular approach,
and underlying conversation policies. Such data ei- where different PLMs are trained for different roles.
ther come from human-to-human conversation logs Our Contributions: We investigate how PLMs
(Lowe et al., 2015; Hardalov et al., 2018) or is col- can be utilized to generate entire conversations that
lected in crowd-sourced environments, where two are grounded on a given summary. We explore
or more crowd-workers play specific roles under three approaches: (1) Supervised Learning (SL)
some guidelines (Zhang et al., 2018; Budzianowski based conversation generation (SL-Gen): where,
et al., 2018). Since real human-to-human conver- a PLM is trained to generate an entire conversa-
sation logs are scarce, many datasets have been tion, taking the summary of a conversation as in-
created using the latter approach. However, crowd- put, (2) Reinforced Learning (RL) based conversa-
sourced conversation data collection is time con- tion generation (RL-Gen): where, we further im-
suming, costly and presents multiple challenges to prove the SL-Gen method using the quality of the
ensure data quality (Kang et al., 2018). generated conversations as a reward, and (3) Con-
Conversation summarization is an emerging re- trolled turn-by-turn conversation generation (CN-
search area that has been ill-studied due to the Gen): which allows us to generate conversations
lack of large-scale datasets. Most existing public turn-by-turn, constrained on the summary and a
datasets in this domain are small, for example, AMI set of pre-defined control parameters. We evalu-
meeting corpus (McCowan et al., 2005) contains ate the quality of the generated conversations by
∗
Current address: david.konopnicki@booking.com conducting automatic and human evaluation. We
3748
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3748–3756
August 1–6, 2021. ©2021 Association for Computational LinguisticsFigure 1: The RL based conversation generation framework
also show that once a conversation summarization summary. The similarity score is used as a reward
dataset is augmented with the generated conversa- to train the conversation generation model. Our RL
tions, the performance of the downstream summa- based generation framework is shown in Figure 1,
rization task is improved. and the critical components are described below.
Conversation Generator: A trained SL-Gen
2 Summary grounded conversation model is used as the conversation generator, which,
generation given an summary can generate a conversation.
In the conversation summarization task, a model Summary Generator: We use a lightweight vari-
takes a conversation as input, and learns to generate ant of BART (Lewis et al., 2019), named Distil-
a summary. We study the inverse of that problem, BART, which is fine-tuned on the Extreme sum-
where the input to our model is a summary, and the marization task (Narayan et al., 2018). We further
model generates a conversation. In this section, we fine-tune this instance on the conversation summa-
propose three models for this task and the hyper- rization data by providing the conversations as the
parameters used in training the models are available input and training the model to output summaries.
in Section A of the appendix. Reward Model: Once the Summary Generator
generates an output summary for the generated con-
2.1 SL based generation (SL-Gen) versation, the reward model compares it with the
A seq2seq model can be trained for this task by ground truth summary, which was used to ground
providing a summary as the input and generating the conversation generation. As Paulus et al. (2018)
a conversation token-by-token. As PLMs have we use ROUGE-2 F1-score as the reward.
shown significant improvement over the traditional Policy training: We use proximal policy optimiza-
seq2seq architecture for text generation, we use a tion (Schulman et al., 2017) as the optimizer for
GPT-2 model and fine-tune it to generate a con- the policy training as it prevents the generator from
versation given a summary as the input. Our in- deviating far away from the pretrained LM (Wu
put to the model follows the following format: et al., 2020).
summary text conversation text. We
2.3 Controlled conversation generation
also use different token-type-ids to indicate the
summary and the conversation text. The model is We propose another approach, (CN-Gen), for con-
trained to optimize Cross Entropy loss. versation generation, which grants more control
over the properties of the generated conversations.
2.2 RL based generation (RL-Gen) Here, we generate one utterance of the conversation
Many studies train text generation models with RL at a time, as opposed to the RL-Gen, where we gen-
(Paulus et al., 2018; Li et al., 2016), where the erate the whole conversation at once. The proper-
generator network is optimized with a task spe- ties of the generated conversations is controlled by
cific reward. We investigate how the quality of the adding several components to the input sequence to
generated conversation can be used as a reward to the model. The following three variables were used
improve the generation network. To this end, we as the control parameters, (1) Number of remaining
train a summary generator network, which gener- turns to generate in the conversation (Num turns):
ates a summary, given a conversation. We measure During the generation of a turn, we indicate the
the quality of the generated conversation by iden- remaining number of turns in the conversation. In
tifying the similarity between the summary of the generating a n turn conversation, this starts with
generated conversation (generated, in turn, by the n for the first turn and reduces by 1 after the gen-
summary generator network) and the ground truth eration of each turn, (2) The speaker of the next
3749Summary: person0 will be late. person1 will order pasta with salmon and basil for her. Model Ave. Turns Ave. Tokens/Turn
2 turn conversation: 3 turn conversation
Ground truth 11.55 ± 6.48 7.10 ± 6.29
I’ll be late I’ll be late. SL-Conv-Gen 10.54 ± 6.80 5.69 ± 4.40
I’ll order some pasta I’ll order some pasta
with salmon and basil for with salmon and basil for
RL-Conv-Gen 8.40 ± 4.78 5.14 ± 3.64
you. you. CN-Conv-Gen 9.70 ± 5.67 5.62 ± 4.05
Thanks a lot!
6 turn conversation 10 turn conversation
Table 2: Properties of the generated conversations.
Hello, I am going to be I’ll be late
late. ok
Ok do you want me to order
I’ll order some pasta
with salmon and basil
something for you?
pasta?
sures and human judgments, and then assess the
Ok, sounds good!
Thank you!
Yes
with salmon?
performance of the generated conversations in a
No problem Yes
Ok
downstream summarization task after augmenta-
how about basil?
Yes please!
tion.
Table 1: Multiple conversations generated by the CN- 3.1 Quality of the generated conversations
Gen approach grounded on the same summary We evaluate the quality of the conversations gener-
ated by the three approaches that were introduced
turn (Speaker): This indicates to the model the in Section 2. In Table 2 we show the properties
speaker of the next turn, and (3) The length of the of generated conversations and the ground truth
next turn (Turn length): We define, 3 categories of conversations in the test set of Samsum dataset.
lengths: Short (≤ 3 tokens), Long (> 10 tokens) Automatic Evaluation: We trained the con-
and Medium (otherwise). versation generation models on the Samsum train-
We use the following input representation ing set and generated conversations on the test set.
to fine-tune a GPT-2 model: summary We compare the generated conversation with the
text dialog context Num turns
ground truth conversations using the measures used
speaker turn length ut-
by Sharma et al. (2017) to evaluate conversation
terance . Changing these parameters allows system responses. The results shown in Table 3
us to generate different variants of conversations suggest that CN-Gen outperform the SL-Gen and
which are grounded on the same summary. During RL-Gen on all measures.
training, we obtain the values for the control pa- We also compare the summaries of generated
rameters from the ground truth conversations, and conversations (generated by the Summary Gener-
at inference we randomly select the next speaker, ator) with the ground truth summaries, and the
number of turns of the conversation to be gener- results are shown in Table 4. We believe that this is
ated (in a range of 4-15 turns), and the next turn a semantic evaluation of the conversations, as the
length. In Table 1 we show conversations of dif- summaries capture the crux of the conversations.
ferent lengths that were generated by the CN-Gen According to the results, CN-Gen outperforms the
approach grounded on the same summary by chang- other two methods. This, along with the previous
ing the control parameters. result suggest that the conversations produced by
A summary and a conversation from the Sam- CN-Gen are the most similar to the ground truth
sum dataset (Gliwa et al., 2019), along with the con- conversations.
versations generated by the three aforementioned Human Evaluation: To evaluate the quality of
algorithms are shown in Figure 2. More examples generated conversations, we randomly selected 50
are provided in the Section B of the Appendix. summaries from the Samsum test dataset and gen-
erated conversations using the three models. Three
3 Experiments NLP experts were then asked to read the ground
truth summary and rank the four conversations (3
We experiment on the Samsum (Gliwa et al., 2019) generated and the ground truth conversation) us-
dataset, which, to the best of our knowledge, is the ing a [1-5] scale according to Grammaticality, Co-
only public large-scale conversation summarization herency, and Informativeness, with respect to the
dataset. We pre-process the dataset by replacing ground truth summary. Results are shown in ta-
the personal names (ex: John) with unique tags ble 5. As expected, the ground-truth conversations
(ex:). First, we evaluate of the quality obtained the highest scores on all three aspects and
of generated conversations using automatic mea- can be considered as an upper bound for this task.
3750Figure 2: Examples of a conversations grounded on the same summary. The key terms are highlighted in colors.
Model BLEU-4 METEOR ROUGE-L Model ROUGE 1 ROUGE 2 ROUGE L
SL-Gen 2.81 12.06 21.53 SL-Gen 46.85 25.29 45.97
RL-Gen 3.53 12.29 25.40 RL-Gen 52.51 31.23 51.68
CN-Gen 4.94 15.64 26.22 CN-Gen 53.46 32.52 52.93
Table 3: Evaluation of generated conversations Table 4: Rouge F1 evaluation of summaries of con-
against ground truth conversations versations against the ground truth summaries
RL-Gen and CN-Gen obtained higher scores than models were applied on the other (y=100-x%) of
SL-Gen and relatively good scores compared to the summaries and generated conversations, (3)
the Ground Truth conversations. This corroborates Those generated conversations along with the orig-
the assumption that our proposed models generate inal summaries were added to the data. Using this
high quality conversations. The Welch Two Sam- approach, we can add extra y% (summary, conver-
ple t-test (Welch, 1947) shows that both RL-Gen sation) pairs to the training data, (4) The conver-
and CN-Gen models outperform the SL-Gen model sation summarization model (discussed in Section
statistically significantly with p < 0.0001. How- 2 under ‘Summary Generator‘) was trained on the
ever, there is no statistical significance between the augmented data. We compare the performance of
results obtained from RL-Gen and CN-Gen. We the conversation summarization model on the orig-
report in Table 6 the average quadratic Cohen’s inal dataset and with augmentation.
Kappa calculated over the three possible combina- Automatic Evaluation: We compare the three
tions of two judges (Toledo et al., 2019). conversation generation methods at different aug-
CN-Gen obtained the best scores during the auto- mentation percentages, and the results are shown
matic evaluation, while RL-Gen got the best scores in Table 7. At all augmentation levels, the summa-
from the human evaluation. The CN-Gen conver- rization models trained with augmented data out-
sations are longer than the RL-Gen conversation perform the summarization model trained on the
by 1.3 turns on average (see Table 2), and hence original dataset (without augmentation). CN-Gen
would contain more word overlap with the ground based augmentation produces the best accuracy
truth. This results in better automatic evaluation compared to other two methods. One prevalent pat-
scores for the CN-Gen, while the humans prefer tern is that, when augmentation data increases, the
short targeted conversations generated by RL-Gen. accuracy seems to increase up to a certain point and
then starts to decrease. The best accuracies were
3.2 Evaluation on the summarization task found around 30% data augmentation. We believe
To further evaluate the quality of the generate con- that more augmentation leads performance to drop
versations, we augmented a conversation summa- due to the following reason. For augmenting with
rization dataset with generated conversations and more data, we are left with less data to train the
evaluated the summarization model. We followed model for conversation generation (for 10% aug-
the following process: (1) We randomly selected mentation, the conversation generation models are
x% of the summaries of the dataset and trained our trained on 90% of the data, while for 50% augmen-
conversation generation models, (2) The trained tation, the models are trained only on 50% of the
3751Model Info Gram Cohe Model Info Gram Cohe
Ground-Truth 4.56 4.46 4.47 Ground-Truth 0.04 0.22 0.25
SL-Gen 2.22 2.85 2.37 SL-Gen 0.35 0.26 0.42
RL-Gen 3.20 3.50 3.14 RL-Gen 0.47 0.35 0.45
CN-Gen 3.10 3.43 3.09 CN-Gen 0.60 0.40 0.60
Table 5: Human evaluation of generated conversa- Table 6: Average Cohen’s Kappa for human evalua-
tions tion of generated conversations
Augmentation % ROUGE 1 ROUGE 2 ROUGE L
Method score of each system is computed as the percentage
0% (Original) 51.84 30.98 43.98
10% 52.82 31.99 44.89 of times it was chosen as the Best system minus
20% 52.90 32.01 44.97 times it was chosen as Worst. On the Coherency
SL-Gen 30% 52.88 32.02 45.01
40% 52.61 31.98 44.96
question, RL-Gen, CN-Gen and No-Aug obtained
50% 52.55 31.98 44.80 scores of 12.6, 6.6 and -4.0 respectively. On the
10% 52.93 32.05 44.92 Focus question RL-Gen, CN-Gen, and No-Aug
20% 53.30 32.15 45.20
RL-Gen 30% 53.81 32.21 45.77 obtained scores of 14.6, 6.0 and -2.6 respectively.
40% 52.86 32.06 44.99 These results confirm that the use of augmentation
50% 52.64 32.07 44.88
10% 53.29 32.36 45.08
improves the quality of the summaries.
20% 53.36 32.53 45.27
CN-Gen 30% 54.02 33.28 46.06
40% 52.14 31.76 44.14 4 Conclusion
50% 52.36 31.75 44.85
We investigated how the PLMs can be utilized to
Table 7: ROUGE F-1 evaluation on Samsum test set. generate entire conversations that are grounded on
a summary. We propose three approaches for con-
data). Therefore as the augmentation increases, the versation generation: SL-Gen, RL-Gen and CN-
quality of generated conversations go down. This Gen and conducted multiple automatic and human
leads to overall smaller gains in the summariza- evaluations to assess the quality of the generated
tion task with increased augmentation after some conversations. Both automatic and human eval-
point. To neutralize the effect of increasing the data uations show that when compared to the ground
points during augmentation, we experimented with truth conversations, RL-Gen and CN-Gen obtain
a baseline which over-samples the original training high scores, suggesting that the proposed models
data at different percentages to obtain same num- generate high quality conversations. When a con-
ber of training instances as the augmented datasets. versation summarization dataset is augmented with
While the ROUGE-2 obtained with the original the generated conversations, the performance of
training data is 30.98, oversampling at 10%, 20%, conversation summarization is improved (over to
30%, 40% and 50%, only changes the ROUGE-2 7% improvement in ROUGE-2 F-1), which also
to 30.55, 30.38, 30.74, 30.99 and 30.27 respec- suggests that the proposed methods generate high
tively. Hence, this suggests that oversampling quality conversations.
hardly changes ROUGE scores obtained by train-
ing with the original dataset, while the augmenta- 5 Ethics
tion according to our algorithms show significantly
improved scores (as shown in Table 7). We have used the publicly available Samsum
Human Evaluation: We recruited 3 NLP ex- dataset (https://huggingface.co/datasets/
perts to evaluate 50 instances of summaries gener- samsum). For the human evaluation of both
ated with data augmentation (RL-Gen, CN-Gen), conversations and summaries, we recruited 3 NLP
and respective summaries generated without aug- researchers, who have graduate degree in NLP
mentation (No-Aug). Here we consider two as- and Machine Learning. The annotation task itself
pects with respect to a ground-truth summary: Co- was executed on Appen.com platform. Before the
herency (whether the summary is easy to read) and official annotation, we sampled 10 tasks to get an
Focus (whether the summary represents the ground- estimate of the duration of the task, and to make
truth summary). Following (Amplayo and Lapata, sure the instructions are clear enough.
2020) we use the Best-Worst Scaling method. The
3752References Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Reinald Kim Amplayo and Mirella Lapata. 2020. Un- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
supervised opinion summarization with noising and 2020. Bart: Denoising sequence-to-sequence pre-
denoising. In Proceedings of the 58th Annual Meet- training for natural language generation, translation,
ing of the Association for Computational Linguistics, and comprehension. In Proceedings of the 58th An-
pages 1934–1945. nual Meeting of the Association for Computational
Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Linguistics, pages 7871–7880.
Amir Kantor, George Kour, Segev Shlomov, Naama
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,
Tepper, and Naama Zwerdling. 2020. Do not have
Michel Galley, and Jianfeng Gao. 2016. Deep rein-
enough data? deep learning to the rescue! In Pro-
forcement learning for dialogue generation. In Pro-
ceedings of the AAAI Conference on Artificial Intel-
ceedings of the 2016 Conference on Empirical Meth-
ligence, volume 34, pages 7383–7390.
ods in Natural Language Processing, pages 1192–
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie 1202.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Ryan Lowe, Nissan Pow, Iulian Vlad Serban, and
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Joelle Pineau. 2015. The ubuntu dialogue corpus: A
Askell, et al. 2020. Language models are few-shot
large dataset for research in unstructured multi-turn
learners. arXiv preprint arXiv:2005.14165.
dialogue systems. In Proceedings of the 16th An-
Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s nual Meeting of the Special Interest Group on Dis-
gpt-2-how can i help you? towards the use of pre- course and Dialogue, pages 285–294.
trained language models for task-oriented dialogue
systems. In Proceedings of the 3rd Workshop on Iain McCowan, Jean Carletta, Wessel Kraaij, Simone
Neural Generation and Translation, pages 15–22. Ashby, S Bourban, M Flynn, M Guillemot, Thomas
Hain, J Kadlec, Vasilis Karaiskos, et al. 2005. The
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang ami meeting corpus. In Proceedings of the 5th In-
Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra- ternational Conference on Methods and Techniques
madan, and Milica Gasic. 2018. Multiwoz-a large- in Behavioral Research, volume 88, page 100. Cite-
scale multi-domain wizard-of-oz dataset for task- seer.
oriented dialogue modelling. In Proceedings of the
2018 Conference on Empirical Methods in Natural Sewon Min, Julian Michael, Hannaneh Hajishirzi, and
Language Processing, pages 5016–5026. Luke Zettlemoyer. 2020. Ambigqa: Answering am-
biguous open-domain questions. In Proceedings of
Isabel Cachola, Kyle Lo, Arman Cohan, and Daniel S the 2020 Conference on Empirical Methods in Nat-
Weld. 2020. Tldr: Extreme summarization of sci- ural Language Processing (EMNLP), pages 5783–
entific documents. In Proceedings of the 2020 Con- 5797.
ference on Empirical Methods in Natural Language
Processing: Findings, pages 4766–4777. Biswesh Mohapatra, Gaurav Pandey, Danish Con-
tractor, and Sachindra Joshi. 2020. Simulated
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and chats for task-oriented dialog: Learning to gener-
Aleksander Wawer. 2019. Samsum corpus: A ate conversations from instructions. arXiv preprint
human-annotated dialogue dataset for abstractive arXiv:2010.10216.
summarization. EMNLP-IJCNLP 2019, page 70.
Shashi Narayan, Shay B Cohen, and Mirella Lapata.
Momchil Hardalov, Ivan Koychev, and Preslav Nakov. 2018. Don’t give me the details, just the summary!
2018. Towards automated customer support. In topic-aware convolutional neural networks for ex-
International Conference on Artificial Intelligence: treme summarization. In Proceedings of the 2018
Methodology, Systems, and Applications, pages 48– Conference on Empirical Methods in Natural Lan-
59. Springer. guage Processing, pages 1797–1807.
Yiping Kang, Yunqi Zhang, Jonathan K Kummerfeld, Romain Paulus, Caiming Xiong, and Richard Socher.
Lingjia Tang, and Jason Mars. 2018. Data collec- 2018. A deep reinforced model for abstractive sum-
tion for dialogue system: A startup perspective. In marization. In International Conference on Learn-
Proceedings of the 2018 Conference of the North ing Representations.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Volume 3 (Industry Papers), pages 33–40. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the lim-
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- its of transfer learning with a unified text-to-text
jan Ghazvininejad, Abdelrahman Mohamed, Omer transformer. Journal of Machine Learning Research,
Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. 21(140):1–67.
Bart: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and Revanth Rameshkumar and Peter Bailey. 2020. Story-
comprehension. arXiv preprint arXiv:1910.13461. telling with dialogue: A critical role dungeons and
3753dragons dataset. In Proceedings of the 58th Annual training and inference are shown below. The model
Meeting of the Association for Computational Lin- takes around 6 hours to train on 2 V100 GPUs
guistics, pages 5121–5134.
(single machine).
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec
Radford, and Oleg Klimov. 2017. Proximal model_name_or_path: gpt2
policy optimization algorithms. arXiv preprint per_gpu_train_batch_size: 4
arXiv:1707.06347. per_gpu_eval_batch_size: 4
gradient_accumulation_steps: 4
Shikhar Sharma, Layla El Asri, Hannes Schulz, and learning_rate: 6.25e-5
Jeremie Zumer. 2017. Relevance of unsupervised adam_epsilon: 1e-8
metrics in task-oriented dialogue for evaluating nat- max_grad_norm: 1.0
ural language generation. CoRR, abs/1706.09799. num_train_epochs: 10
warmup_steps: 500
Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, Roni min_length: 20
Friedman, Elad Venezian, Dan Lahav, Michal Ja- max_length: 512
covi, Ranit Aharonov, and Noam Slonim. 2019. Au- top_k: 0
tomatic argument quality assessment-new datasets top_p: 0.95
and methods. In Proceedings of the 2019 Confer-
ence on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
A.2 Summary Generator
ence on Natural Language Processing (EMNLP- We use DistilBART instance1 fine-tuned on the
IJCNLP), pages 5629–5639. extreme summarization (XSum) task, and we fine-
Bernard L Welch. 1947. The generalization ofstu- tune this model further on the Samsum dataset. The
dent’s’ problem when several different population model takes around 12 hours to train on 2 V100
variances are involved. Biometrika, 34(1/2):28–35. GPUs (single machine).
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien The hyperparameters used for training the Dis-
Chaumond, Clement Delangue, Anthony Moi, Pier- tilBART model are as follows:
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
icz, et al. 2019. Huggingface’s transformers: State- train_batch_size: 4
eval_batch_size: 4
of-the-art natural language processing. ArXiv, pages
num_train_epochs: 10
arXiv–1910. model_name_or_path: sshleifer/distilbart
-xsum-12-6
Qingyang Wu, Lei Li, and Zhou Yu. 2020. Textgail: learning_rate: 3e-5
Generative adversarial imitation learning for text val_check_interval: 0.1
generation. arXiv preprint arXiv:2004.13796. max_source_length: 512
max_target_length: 80
Yiben Yang, Chaitanya Malaviya, Jared Fernandez,
Swabha Swayamdipta, Ronan Le Bras, Ji-Ping A.3 Reinforced Learning based conversation
Wang, Chandra Bhagavatula, Yejin Choi, and Doug
Downey. 2020. G-daug: Generative data augmenta- generation (RL-Conv-Gen)
tion for commonsense reasoning. In Proceedings of To train the RL based conversation generation
the 2020 Conference on Empirical Methods in Nat- model, we adapted a publicly available Proxi-
ural Language Processing: Findings, pages 1008–
1025. mal Policy Optimization (PPO) implementation
2 . The model takes around 12 hours to train on 2
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur V100 GPUs (single machine). Following hyper-
Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
sonalizing dialogue agents: I have a dog, do you parameters were used to train the model.
have pets too? In Proceedings of the 56th An- steps: 10000
nual Meeting of the Association for Computational batch_size: 16
Linguistics (Volume 1: Long Papers), pages 2204– forward_batch_size: 4
2213. learning_rate: 1.41e-5
init_kl_coef:0.2
A Model Training and Hyperparameter target: 6
horizon:10000
Details gamma:1
lam:0.95
A.1 Supervised Conversation Generation cliprange: 0.2
(SL-Conv-Gen) cliprange_value: 0.2
vf_coef: 0.1
We fine-tune a GPT-2 language model using the 1
https://huggingface.co/sshleifer/
implementation available at HuggingFace (Wolf distilbart-cnn-12-6
2
et al., 2019). The hyper-parameters used during https://github.com/lvwerra/trl
3754B Sample summaries with corresponding
ground-truth
Figure 3 shows some samples of dialogs with their
corresponding summaries - ground-truth and auto-
matic generated ones.
3755Summary: Person0 closed some deals today. Person1 didn’t manage to do it.
Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen:
It’s been a long day. Hi, are you still alive? Hey, how are you? Did you see this?
Same here, Person0, good Hi, yes, still alive. Good, busy as usual. ¡file other¿
or bad? I closed some deals today. I closed some deals today. Cool. What’s up?
Very busy. I closed some Yeah, that’s awesome. Too late, didn’t manage to I closed some deals today
deals but I had a lot of Thanks. do it. and I feel good about it.
stress also. Sorry. Great. I didn’t manage to
Lucky you. I couldn’t do it today. I’m sorry. I
close any deal. It an- had no choice.
noys me. Some customers I’m sorry.
are negotiating for days, I didn’t have the tools.
weeks.
Maybe they’re just fishing.
Want to compare offers.
That’s what they do.
Summary: Person0 bought a table, six chairs, a vase and a pile of clothes and the second hand shop downtown. She paid 70 euros for everything.
Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen:
I went to the second hand I bought the second hand Do you remember the sec- Look what I’ve bought! A
shop downtown. shop downtown. ond hand in downtown? table, six chairs, a vase. I
Cool Person0 What did So what happened? Yes, I do. What’s inside? bought a table, six chairs, a
you get? I bought everything. A table, six chairs, a vase vase and a pile of clothes.
Lots of stuff. A table, six What do you mean? and a pile of clothes. How much did you pay for
chairs, a vase, a pile of I paid 70 euro for every- Is it expensive? everything?
clothes thing. 70 euros. I paid 70 euro for every-
That’s really a lot of stuff How much? thing. And I thought it
Person0. Send us pictures 6 chairs, six chairs, a vase would cost me a fortune!
¡photo file¿ and a pile of clothes. I don’t even know how
Beautiful table.Is it wood? Sounds like a bargain. much.
Yes, oak. ou should be in business. I am glad.
Awesome! These are the Me too. 70 euro for every-
clothes? I love the black thing, including a vase!!!
dress.
It’s absolutely beautiful.
As soon as I saw it I knew
it would be mine. I’m so
happy with what I bought.
Summary: Person1 is not at home. Person0 wants Person1 to keep her pasta in the microwave.
Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen:
U both at home? How are you doing? Are you at home? Are you at home?
No. I’ve just got ur msg. I’m not at home No, I’m still at work. No, I’m busy at work.
Why did u ask about it? Why? Good, I should be home in What do you want?
No reason. Keep my pasta I wanted to keep some about 15 minutes. I want to put some pasta in
in the microwave. pasta in the microwave. Can you put my pasta in microwave.
I haven’t cooked anything. No need to. the microwave? OK. I’ll try.
In the microwave? Sure, I’ll check it.
Yes. Thanks!
Summary: Person0 needs Person1’s help as he cannot get the application running.
Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen:
Person1, I need your help? Hey! I have a problem Person1, you there? I cannot get the application
What’s up, mate? with the application. Yeah, what’s up? running.
I can’t get the application Hello, Person0 Do you need my help? What’s up?
running. Could you please help me What’s wrong? Can you get the applica-
Have you switched the to get the application run- I can’t get the application tion running?
computer on? ning? I cannot. running. You sure you can’t? Try
Very funny!!!!!!!!!!!!! Why? Did you try charging it and click on it.
OK. Sorry. I can see it’s I have a problem with my first? You’re so funny, Person1.
serious. laptop. The application is Yes, I tried that, but it OK. I’m coming.
Yeah, man. It is f**cking running on the USB port. doesn’t work. Thanks.
serious. I cannot access the com- Try refreshing the
I’ll be with you right now. puter. The computer is run- database.
Thanks. ning. If I try to access the Damn, thanks!
computer, the application
will start. I hope that you
can help me.
Okay.
Thank you, Person1.
Summary: Person0 and Person1 will meet the new person in an hour.
Ground Truth Dialog: SL-Gen: RL-Gen: CN-Gen:
Can we meet with the new Hi, babe. Did you get that message? Wanna meet the new per-
person soon? Hi. No, why? son?
Sure. In an hour okay? We’re meeting in an hour? I might have lost it. Sure, I’ll be there in an
Perfect. I’d prefer to be there in an Oh no, it’s my mistake. hour.
hour. I’m sorry. Perfect!
Great! Ok, don’t worry. We can
meet in an hour?
Ok!
Figure 3: Samples of dialogs with their corresponding summaries - ground-truth and automatic generated ones
3756You can also read