Effectiveness of French Language Models on Abstractive Dialogue Summarization Task

Page created by Theodore Owens

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Effectiveness of French Language Models on Abstractive Dialogue Summarization Task

Effectiveness of French Language Models on
                                                  Abstractive Dialogue Summarization Task

                                                               Yongxin Zhou1 , François Portet1 , and Fabien Ringeval1
                                                 1
                                                     Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France
arXiv:2207.08305v1 [cs.CL] 17 Jul 2022

                                                                                       Abstract

                                                  Pre-trained language models have established the state-of-the-art on various natural
                                                  language processing tasks, including dialogue summarization, which allows the
                                                  reader to quickly access key information from long conversations in meetings,
                                                  interviews or phone calls. However, such dialogues are still difficult to handle with
                                                  current models because the spontaneity of the language involves expressions that are
                                                  rarely present in the corpora used for pre-training the language models. Moreover,
                                                  the vast majority of the work accomplished in this field has been focused on
                                                  English. In this work, we present a study on the summarization of spontaneous oral
                                                  dialogues in French using several language specific pre-trained models: BARThez,
                                                  and BelGPT-2, as well as multilingual pre-trained models: mBART, mBARThez,
                                                  and mT5. Experiments were performed on the DECODA (Call Center) dialogue
                                                  corpus whose task is to generate abstractive synopses from call center conversations
                                                  between a caller and one or several agents depending on the situation. Results
                                                  show that the BARThez models offer the best performance far above the previous
                                                  state-of-the-art on DECODA. We further discuss the limits of such pre-trained
                                                  models and the challenges that must be addressed for summarizing spontaneous
                                                  dialogues.

                                         1   Introduction

                                         The task of automatic text summarization consists of presenting textual content in a condensed
                                         version that retains the essential information. Recently, the document summarization task has seen
                                         a sharp increase in performance due to pre-trained contextualized language models [22]. Unlike
                                         documents, conversations involve multiple speakers, are less structured, and are composed of more
                                         informal linguistic usage [31].
                                         Research on dialogue summarization have considered different domains such as summarization of
                                         meetings [3, 38, 28] and phone calls / customer service [34, 33, 19].
                                         The introduction of the Transformer architecture [37] has encouraged the development of large-scale
                                         pre-trained language models. Compared to RNN-based models, transformer-based models have
                                         deeper structure and more parameters, they use self-attention mechanism to process each token
                                         in parallel and represent the relation between words. Such model have pushed the performances
                                         forward in automatic summarization. For instance, on the popular benchmark corpus CNN/Daily Mail
                                         [11], [22] explored fine-tuning BERT [6] to achieve state-of-the-art performance for extractive news
                                         summarization, and BART [16] has also improved generation quality on abstractive summarization.
                                         However, for dialogue summarization, the impact of pre-trained models is still not sufficiently
                                         documented. In particular, it is unclear if the pre-training dataset mismatch will negatively impact the
                                         summarization capability of these models.

In terms of resources, the dialogue summarization task, has been using well-known corpora such as
CNN-DailyMail [11], Xsum [27], SAMSum chat-dialogues Corpus [10] and AMI Meeting Corpus
[4]. However, most of the available data sets are English language corpora.
In this paper, we present a study to evaluate the impact of pre-trained language models on a dialogue
summarization task for the French language. We had evaluated several pre-trained models of
French (BARThez, BelGPT-2) and multilingual pre-trained models (mBART, mBARThez, mT5).
The experiments were performed on the DECODA (Call Center) dialogue corpus whose task is
to generate abstractive synopses from call centre conversations between a caller and one or more
agents depending on the situation. Our experimental results establish a new state-of-the-art with
the BARThez models on DECODA. The results show that all BART-based pre-trained models have
advanced the performance, while the performance of mT5 and BelGPT-2 is not satisfactory enough.

2 Related Work

2.1 Dialogue Summarization

As for summarization tasks, there are two different methods: extractive methods and abstractive
methods. Extractive methods consist of copying directly content from the original text, usually
sentences. Abstractive methods are not restricted to simply selecting and rearranging content from
the original text, they are able to generate new words and sentences.
With the development of dialogue systems and natural language generation techniques, the resurgence
of dialogue summarization has attracted significant research attention, which aims to condense the
original dialogue into a shorter version covering salient information [8]. Contrary to well-formed
documents, transcripts of multi-person meetings / conversations have various dialogue turns and
possible extended topics, summaries generated by NLG models may thus be unfocused on discussion
topics.
[8] provide an overview of publicly available research datasets and summarize existing works as well
as organize leaderboards under unified metrics, for example, leaderboards of meeting summarization
task on AMI [4] and ICSI [12], as well as a leaderboard of chat summarization task on SAMSum
[10]. Summarized by them, 12 out of 15 major datasets for dialogue summarization are in English.
Taking the chat summarization task on SAMSum dataset as an example, the results showed that
the Pre-trained Language Model-based Methods are skilled at transforming the original chat into a
simple summary realization [8]: the best ROUGE-2 score of the Pre-trained Language Model-based
Methods is 28.79 [9], compared to the best ROUGE-2 score of Abstractive Methods at 19.15 [41]
and the best ROUGE-2 score of Extractive Methods at 10.27 (LONGEST-3), deep neural models and
especially pre-trained language models have significantly improved performance.
According to the domain of input dialogue, summarization on DECODA call-center conversations is
of customer service summarization type, and it is task-oriented. Previous work has been done on the
DECODA dialogue summarization task: [35, 36] used extractive methods and [33] used abstractive
approaches. [33] used domain knowledge to fill hand-written templates from entities detected in the
transcript of the conversation, using topic-dependent rules. Although pre-trained models have shown
superior performance on English, however we have few information about how such models would
behave for other languages.

2.2 French Language Models and Multilingual Models

If the majority of language models were pre-trained on English-language texts, there have recently
been the release of many French oriented pre-trained language models. Furthermore, multilingual
pre-trained models have also emerged and showed very good generalisation capability in particular
with under-resource languages. A brief summary of the French language models and multilingual
models that we used can be found in Table 1.
Regarding French pre-trained models, shortly following the release of the BERT-like models: Camem-
BERT [25] and FlauBERT [15], BARThez [13] was proposed. It is the French equivalent of the
BART model [16] which uses a full transformer architecture (encoder-decoder). Being based on
BART, BARThez is particularly well-suited for generative tasks.

For generation, there are also French GPT-like models. For instance, BelGPT-2 [24], which is a
GPT-2 model pre-trained on a very large and heterogeneous French corpus (∼60Gb). Other French
GPT-like models have been proposed in the literature such as PAGnol [14] which is a collection of
large French language models, GPT-fr [32] which uses the OpenAI GPT and GPT-2 architectures
[29] or even the commercial model Cedille [26]. We used BelGPT-2 in our experiment since it is
open-source and a good trade-off between lightweight model and training size.
Another way to address non-English language tasks is to benefit from multilingual models that were
pre-trained on datasets covering a wide range of languages. For example, mBERT [6], mBART [20],
XLM-R [5], and mT5 [39] are popular models of this type, which are respectively the multilingual
variants of BERT [6], BART [16], RoBERTa [23] and T5 [30].
In this study, we will consider mBART, mT5, as well as mBARThez. The latter was designed by
the same authors as BARThez [13] by fine-tuning a multilingual BART on the BARThez training
objective, which boosted its performance on both discriminative and generative tasks. mBARThez is
thus the most French-oriented multilingual model, this is why we have included it in our experiments.

    Model                 Description                  Pre-training corpus      #params
    BARThez[13]           sequence to sequence         French      part    of   165M (architecture:
                          pre-trained model            CommonCrawl,             BASE, layers: 12)
                                                       NewsCrawl,
                                                       Wikipedia         and
                                                       other smaller corpora
                                                       (GIGA, ORTOLANG,
                                                       MultiUn, EU Book-
                                                       shop)
    BelGPT2[24]           a GPT-2 model pre-           CommonCrawl,             Small (124M)
                          trained on a very            NewsCrawl,
                          large and heteroge-          Wikipedia,       Wik-
                          neous French corpus          isource,       Project
                          (∼60Gb)                      Gutenberg, EuroParl,
                                                       NewsCommentary
    mbart.CC25[21]        mBART model with             trained on 25 lan-       610M
                          12 encoder and de-           guages’ monolingual
                          coder layers                 corpus,      Common
                                                       Crawl (CC25)
    mBARThez[13]          continue the pretrain-       the same as BARThez’     458M (architecture:
                          ing of a multilingual        corpus                   LARGE, layers: 24)
                          BART on BARThez’
                          corpus
    mT5[39]               a multilingual vari-         new        Common        300M – 13B
                          ant of T5, Encoder-          Crawl-based dataset
                          decoder                      covering 101 lan-
                                                       guages,    Common
                                                       Crawl (mC4)

              Table 1: Summary of French Language Models and Multilingual Models.

3     Summarizing Call Center Dialogues in French

3.1    Motivation

As pre-trained models showed advanced performance on summarization tasks in English, we would
like to explore the behavior of French pre-trained models on dialogue summarization task. In
order to deal with it, we choose the corpus DECODA [1] and the task of Call Centre Conversation
Summarization, which had been set during Multiling 2015 [7]. The objective is to generate abstractive
synopses from call center conversations between a caller and one or more agents.

                                                   3

Previous works on DECODA dialogue summarization reported results of extractive methods and
abstractive results, our hypothesis is that by using pre-trained language models, we could get better
results in this task.
Besides, in call center conversations scenarios, the caller (or user) might express their impatience,
complaint and other emotions, which are especially interesting and make DECODA corpus a seldom
corpus for further research related to emotional aspects of dialogue summarization.

3.2 DECODA Corpus and Data Partitioning

DECODA call center conversations are a kind of task-oriented dialogue. Conversations happened
between the caller (the user) and one agent, sometimes there were more than one agent. The synopses
should therefore contain information of users’ need and if their problems have been answered or
solved by the agent.
Top 10 most frequent topics on the DECODA corpus are: Traffic info (22.5%), Directions (17.2%),
Lost and found (15.9%), Package subscriptions (11.4%), Timetable (4.6%), Tickets (4.5%), Special-
ized calls (4.5%), No particular subject (3.6%), New registration (3.4%) and Fare information (3.0%)
[35]. Besides, a call lasts between 55 seconds for the shortest to 16 minutes for the longest.
Table 2 shows an excerpt of a dialogue. Prefix A (resp B) indicate the employee (resp customer)’s
turn.
FR_20091112_RATP_SCD_0826
French Translated
A: bonjour B: oui bonjour madame B: je vous A: hello B: yes hello Mrs B: I’m calling you to
appelle pour avoir des horaires de train en+fait get train timetables in fact it’s not for the metro
c’ est pas pour le métro je sais pas si vous pou- I don’t know if you can give them to me or not
vez me les donner ou pas A: trains B: oui trains A: trains B: yes trains yes A: which line do you
oui A: vous prenez quelle ligne monsieur . . . take sir . . .

Table 2: Excerpt FR_20091112_RATP_SCD_0826 of DECODA conversations.

We used the same data as in the Multiling 2015 CCCS task (Call Centre Conversation Summarization)
[7]: they provided 1000 dialogues without synopses (auto_no_synopses) and 100 dialogues with
corresponding synopses for training. As for test data, there are in total 100 dialogues. We divided
randomly the original 100 training data with synopses into training and validation sets, with separately
90 and 10 samples. Note that for each dialogue, the corresponding synopsis file each may contain up
to 5 synopses written by different annotators. Data statistics are presented in Table 3.

# dialogues # synopsis # dialogues w/o
synopsis
train 90 208 1000
dev 10 23 -
test 100 212 -
Table 3: Number of dialogues and synopses for the train/dev/test and unannotated sets of the DECODA
corpus.

3.3 Processing Pipeline

The overall process for dialogue summarization using pre-trained language models is depicted in
Figure 1. Pre-trained languages models have limits to the maximum input length. For instance, BART
and GPT-2 based models only accept a maximum input length of 1024 tokens, which pose problems
when input sequences are longer than their limits. How to process long dialogue summarization is
an ongoing research topic [40]. In this work, since the objective is to compare pre-trained language
models, we use the simple technique of truncation in order to meet the length requirement of pre-
trained models. Although truncation is not the most advanced approach, in summarization, using
the first part of the document is known to be a strong method and hard to beat, in particular in news

where the main information is conveyed by the first line. In the case of call center conversations, we
can also assume that the most important information is present at the beginning of the conversation
since the client initiates the call by stating his / her problems.

Figure 1: Processing pipeline for dialogue summarization.

3.3.1 Pre-processing
Since DECODA was acquired in a real and noisy environment, the DECODA transcripts contain
various labels related to the speech context, such as , , speech
, etc. There are also speech , which indicates the speech overlap
of two speakers. Since the models are not processing speech but only transcript, we filtered out all
these labels.
Each speaker of the dialogue transcript is preceded by a prefix (e.g. “A:” ). Since the target synopsis
is always written in a way that makes the problem explicit (e.g. what the user wants or what the
user’s problem is) and how it has been solved (by the agent), we wanted to check whether the model
is able to summarize without knowing the prefix. Hence, two versions of the input were made: Spk
presents conversations with speakers’ prefix and NoSpk presents conversations without the prefix. For
this we used special tokens , , . . . to represent a change in the speakers’ turns.

3.3.2 Truncation
For the BART and T5 architectures, the DECODA data files are prepared in JSONLINES format,
which means that each line corresponds to a sample. For each line, the first value is id, the second
value is synopsis and will be used as the summary record, the third value is dialogue and will be used
as the dialogue record.
Since BART [16] is composed of bi-directional BERT encoders and autoregressive GPT-2 decoders,
the BART-based models take the dialogues as input from the encoder, while the decoder gener-
ates the corresponding output autoregressively. For BART-based models (BARThez, mBART and
mBARThez), the maximum total length of the input sequence after tokenization is 1024 tokens,
longer sequences will be truncated, shorter sequences will be padded.
As with mT5, the maximum total length of the input sequence used is also 1024 tokens. We used
source_prefix: “summarize: " as a prompt, so that the model can generate summarized synopses.
When fine-tuning BelGPT-2 for the summarization task, our input sequences are dialogue with a
and the synopsis at the end. For long dialogues, we have truncated the dialogue and
ensured that the truncated dialogue with the separate token and synopsis will not exceed the limit of
1024 tokens. Which means, BelGPT-2 has more constraints than BARThez in the length of dialogues
for summarization tasks.

3.3.3 Summarization
As mentioned before, we split original 100 training data into training and validation sets, we have
90 dialogue examples with annotated synopses for training. For each dialogue, it may have up to 5
synopses written by different annotators, we paired up dialogue with synopses to make full use of

available data for training. In this way, we have 208 training examples, 23 validation examples and
212 test examples.
In experiments, we fine-tuned separately pre-trained models and then used models fine-tuned on
DECODA summarization task to generate synopses for test examples. When input sequences exceed
the maximum length of the model, we performed truncation as well.

4 Experiments
4.1 Experimental Setup

For all pre-trained models, we used their available models in Huggingface. Experiments for each
model were run separately on two versions of the data (Spk and NoSpk), running on 1 NVIDIA
Quadro RTX 8000 48Go GPU server.
For BARThez, mBART, mBARThez and mT5, we used the default parameters for summarization
tasks, which are: initial learning rate of 5e − 5, with a train_batch_size and eval_batch_size of 4,
seed of 42. We used Adam optimizer with a linear learning scheduler.
In details, we fine-tuned BARThez1 (base architecture, 6 encoder and 6 decoder layers) for 10 epochs
and saved the best one having the lowest loss on the dev set. Each training took approximately 25
minutes.
mBART2 (large architecture, 12 encoder and decoder layers) was also fine-tuned for 10 epochs, each
training took approximately 46 minutes.
Regarding mBARThez3 , it is also a large architecture with 12 layers in both encoder and decoder. We
fine-tuned it for 10 epochs, each training took approximately 38 minutes.
As for mT54 (mT5-Small, 300M parameters), we used source_prefix: “summarize: " as a prompt
to make the model generate synopses. mT5 was fine-tuned for 40 epochs, each training took
approximately 65 minutes.
Besides, BelGPT-2 [24] was fine-tuned for 14 epochs, with a learning rate of 5e − 5, gradi-
ent_accumulation_steps of 32. To introduce summarization behavior, we add after
the dialogue and set the maximum generated synopsis length at 80 tokens. Each training took
approximately 4 hours. During synopsis generation, we tried nucleus sampling and beam search, we
reported results with top_k = 10, top_p = 0.5 and temperature = 1, which were slightly better.

4.2 Evaluation

During CCCS 2015 [7], different systems were evaluated using the CCCS evaluation kit, with
ROUGE-1.5.5 and python2.
In our work, models were evaluated with the standard metric ROUGE score (with stemming) [17].
We reported ROUGE-1, ROUGE-2 and ROUGE-L5 , which calculate respectively the word-overlap,
bigram-overlap and longest common sequence between generated summaries and references.

4.3 Quantitative Results

Results on the different pre-trained language models are shown in Table 4.
The table reports the ROUGE-2 scores (Recall only) of the baseline and submitted systems on
French DECODA for the multiling challenge [7]. The first baseline was based on Maximal Marginal
Relevance (Baseline-MMR), the second baseline was the first words of the longest turn in the
conversation, up to the length limit (Baseline-L), while the third baseline is the words of the longest
1
https://huggingface.co/moussaKam/barthez
2
https://huggingface.co/facebook/mbart-large-cc25
3
https://huggingface.co/moussaKam/mbarthez
4
https://huggingface.co/google/mt5-small
5
The ROUGE metric used in Hugging Face is a wrapper around Google Research reimple-
mentation of ROUGE: https://github.com/google-research/google-research/tree/
master/rouge, we used it to report all our results.

# Models ROUGE-1 ROUGE-2 ROUGE-L
[7] Baseline-MMR - 4.5 -
[7] Baseline-L - 4.0 -
[7] Baseline-LB - 4.6 -
[7] NTNU:1 - 3.5 -
[18] LIA-RAG:1 - 3.7 -
[13] BARThez #Spk 35.45 16.85 29.45
[13] BARThez #NoSpk 33.01 15.30 27.92
[24] BelGPT-2 #Spk 17.73 5.15 13.82
[24] BelGPT-2 #NoSpk 18.06 4.55 13.15
[20] mBART #Spk 33.76 14.13 27.66
[20] mBART #NoSpk 31.60 13.69 26.20
[13] mBARThez #Spk 34.95 15.86 28.85
[13] mBARThez #NoSpk 35.31 15.52 28.61
[39] mT5 #Spk 23.90 8.69 20.06
[39] mT5 #NoSpk 27.82 11.08 23.36

Table 4: ROUGE-2 performances of the baseline and submitted systems for the multiling challenge
2015 on French DECODA and results of ROUGE-1, ROUGE-2 and ROUGE-L scores of the pre-
trained language models. The best results are boldfaced.

turn in the first 25% of the conversation, which usually corresponds to the description of the caller’s
problem (Baseline-LB). Those baselines are described in more details in [35]. Regarding previous
computational models, the LIA-RAG system [18] was based on Jensen-Shannon (JS) divergence and
TF-IDF approach. We did not find any information about the NTNU system. We did not find more
recent results or study on the DECODA dataset.
Overall, in the challenge, the baselines were very difficult to beat with a ROUGE-2 of 4.0. This
quite low score shows the great difficulty of the task. Indeed, the transcriptions contain overlapping
turns, fillers, pauses, noise, partial words, etc. Furthermore translators were required to translate
speech phenomena such as disfluencies as closely as possible to the source language maintaining
‘naturalness’ in the target language.
Comparing the performance of different pre-trained models fine-tuned on the DECODA summariza-
tion task, BARThez outperformed other models, with a ROUGE-2 score of 16.85 using the Spk data
version.
In general, all pre-trained models outperform previous results. BART-based pre-trained models
(BARThez, mBART, mBARThez) have better performance than mT5 and BelGPT-2.
The difference between Spk and NoSpk is not significant. For some models like BARThez and
mBART, Spk (using speakers’ prefix) performs slightly better than NoSpk; while for mT5, NoSpk
(without the speakers’ prefix) performs better than Spk.

4.4 Qualitative Analysis

We observed that the BARThez pre-trained model fine-tuned on the Spk version of DECODA led to
the best ROUGE. We then randomly selected 20 test samples and performed an error analysis of the
generated predictions, comparing them to references.
The references were written by different annotators and not all of them summarized the dialogue
in the same way. Furthermore, because synopses are used to quickly convey information about the
conversation between the caller and the RATP agent, they are even more abstract than typical dialogue
summaries.
After comparing the synopses generated by BARThez (Spk) to the references written by annotators,
we observed several common types of errors:

FR_20091112_RATP_SCD_0826
References Request for train schedules from Meaux station to Gare de l’Est at a given time.
(1)
schedules RER E from Meaux to Gare de l’Est (166)
Spk
BARThez Request for train schedules to go from Meaux station to Paris.
BelGPT-2 Request for information on the destination of the Meaux-Chappes-Saint-Lazare
subway line. Bus transfer at Saint-Lazare station. Is the call possible. Is the call
possible to do it. Is the call possible. Is the call possible. Is the call
mBART Request for SNCF train schedules to go to Paris from the station of Meaux from
the station of Gare de Lyon in Paris. Request for timetables to go to the station of
Meaux at sixteen hours forty-nine by leaving the station of Meaux at sixteen hours
forty-nine by leaving the station of Gare de Lyon.
mBARThez train schedules in Île-de-France
mT5 Request for timetable to go to the station of Meaux.
NoSpk
BARThez Request for train schedules to go to Paris.
BelGPT-2 , the journey from the station of Meaux to the Gare de Lyon, passing by the
station of Bercy ) and finally the journey from the station of Bercy to the station of
Bercy, passing by the station of Bercy, the journey from the station of Bercy to the
station of Bercy, and finally the journey from the station of Bercy to the station of
Montparnasse, the journey from the station of Bercy
mBART Request for schedules for an RER line between Meaux Meaux station and Gare de
Lyon station. Contact with the concerned service.
mBARThez RER B schedules to go to Paris from Meaux station
mT5 Request for timetable to go to the station of Meaux.

Table 5: (Translation in English) Doc FR_20091112_RATP_SCD_0826 from DECODA’s test set,
and associated reference and generated summaries by fine-tuning different models. Errors are marked
in red.

• Hallucination: This is a common type of error in neural NLG and is widely present in our
generated synopses, meaning that the generated synopses are inconsistent or unrelated to
the system input.
• Omission: The other most common type of error, which means that some of the information
presented in the references is not presented in the generated synopses.
• Grammatical Error: In our case, the generated synopses are generally smooth, but they
may have incorrect syntax or semantics.

We categorize the errors and count their frequencies in Table 6.

Type of errors Occurrences
Hallucination 15 / 20
Omission 15 / 20
Grammatical error 5 / 20
Table 6: The most common error types of the BARThez (Spk) model compared to the golden reference
over 20 sampled dialogues, with the number of occurrences for each error type.

We noticed that 7 of the 20 generated synopses have a Communication of the relevant service number
at the end, which is not presented in the reference or in the input dialogue. Since we truncated the
length of the input sequences to a maximum of 1024 tokens, this could be the result of truncation: the
lack of information in the final part of the dialogue causes system to sometimes generate a general
sentence.
In addition, for each synopsis generated, we counted the number of errors. Note that for the same
type of errors, we can count more than once. For example, for omission, if two different points of

information are missing from the generated synopsis, we count twice. The results of the number of
errors are reported in Table 7, the total number of errors counts from 1 to 5 for above 20 randomly
selected synopses.

Number of errors Synopses (total: 20)
1 4
2 3
3 10
4 2
5 1
Table 7: Number of synopses generated by BARThez (Spk) with a total number of errors from 1 to 5.

We noticed that half of the synopses generated contain 3 errors, 1 sample contains 5 errors and 4
synopses contain only 1 error. For example, compared with the reference demande de précisions sur
la tarification du métro, mais l’appelant, un particulier, est passé directement par le numéro VGC,
refus du conseiller de le prendre en charge comme grand compte, indication du numéro 3246 pour
qu’il rappelle, the generated synopsis Demande de renseignement sur les tarifs RATP sur Paris ou la
zone parisienne. Communication du numéro de téléphone. has 3 errors :
• 1) omission - mais l’appelant, un particulier, est passé directement par le numéro VGC;
• 2) omission - refus du conseiller de le prendre en charge comme grand compte;
• 3) omission - indication du numéro 3246 pour qu’il rappelle
Besides, while doing error analysis, we also noticed different writing styles of different annotators.
For example, some golden references always begin with ask for (Demande de) while some other
references are much shorter, for example lost phone > but not found (perdu téléphone > mais pas
retrouvé).
Taking FR_20091112_RATP_SCD_0826 from DECODA’s test set as an example, we compared
the summaries generated by fine-tuning different pre-trained models with the references in Table 5
(translated in English, original French version and the dialogue in Appendix). Errors are marked in
red.
For BARThez (Spk) which got the highest ROUGE-2 score, the synopsis generated for this test
example is quite good, however BARThez (NoSpk) does not mention from Meaux station.
In general, for fine-tuned BelGPT-2, the generated synopses are correct in some respects but suffer
from factual inconsistency. The model started to generate repetitive word sequences quickly. Its poor
performance may be due to the type of model (decoder only) and its small size.
mBART (Spk) was good in the beginning, but it started to produce incorrect and unrelated information
later on. As for mBART (NoSpk), it incorrectly predicted Gare de Lyon station instead of Gare de
l’EST.
As for fine-tuned mBARThez (Spk), it seems to be correct but missed some information like from
Meaux to Gare de l’Est. mBARThez (NoSpk) wrongly predicted RER B instead of RER E.
mT5 wrongly predicted the direction, it should be from Meaux station to Gare de l’Est but it wrongly
took to go to the station of Meaux. In addition, the generated synopses lack other information such as
request for train schedules, at a given time and to Gare de l’Est.

5 Discussion and Conclusion
Our experimental results show that the BARThez models provide the best performance, far beyond
the previous state of the art on DECODA. However, even though the pre-trained models perform
better than traditional extractive or abstractive methods, their outputs are difficult to control and suffer
from hallucination and omission, there are also some grammatical errors.
On the one hand, the limit of maximum token length for pre-trained models evokes challenges to
summarize such non-trivial dialogue. For long dialogue summarization, we can use retrieve-then-
summarize pipeline or use end-to-end summarization models [40]. [42] designed a hierarchical

structure named HMNet to accommodate long meeting transcripts and a role vector to depict the
difference between speakers. [2] introduced the Longformer with an attention mechanism that scales
linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
However, these end-to-end summarization models capable of dealing with long sequences problems
are both pre-trained in English language.
On the other hand, pre-trained models were mostly learned on news corpora but far from the
spontaneous conversation (repetitions, hesitation, etc.). Dialogue summarization therefore remains a
challenging task.
Besides, in call center conversations scenarios, the caller (or user) might express their impatience,
complaint or other emotions, we would like to further explore DEOCDA corpus for emotion-related
dialogue summarization.

6   Acknowledgements
This research has been conducted within the THERADIA project supported by the Banque Publique
d’Investissement (BPI) and the Association Nationale de la Recherche et de la Technologie (ANRT),
under grant agreement No.2019/0729, and has been partially supported by MIAI@Grenoble-Alpes
(ANR-19-P3IA-0003) and by COST Action Multi3Generation, CA18231 supported by COST (Euro-
pean Cooperation in Science and Technology).

Appendix
Table 8 presents the references and summaries generated by fine-tuning different models of the test
example FR_20091112_RATP_SCD_0826, Spk and NoSpk versions of the dialogue are also shown
below.
Spk version: This version presents the conversations with the speakers’ information.
 bonjour
 oui bonjour madame
 je vous appelle pour avoir des horaires de train en+fait c’ est pas pour le métro je sais pas
si vous pouvez me les donner ou pas
 trains SNCF
 oui trains SNCF oui
 vous prenez quelle ligne monsieur
 euh la ligne euh enfin en+fait c’ est pas SNCF enfin c’ est Île-de-France quoi je sais pas
comment ils appellent ça
 RER voilà c’ est pour les RER
 voilà et euh je prends la ligne euh Meaux de Meaux pour aller à Paris je sais plus c’ est
laquelle c’ est la
 E je crois
 d’accord et vous voulez donc les horaires euh pour quel jour et à quel moment monsieur
 euh là pour euh tout à l’ heure euh pour euh aux environs de dix-sept heures en partant de
la gare de Meaux
 euh vers la Gare+de+l’Est à Paris
 alors vous partez de Meaux et vous allez donc à la Gare+de+l’Est  et vous
voudriez les horaires  voilà
 vers euh dix-sept heures  alors moi je peux regarder ce+que  ouais
dix-sept heures ouais
 j’ ai comme horaires un instant monsieur  s’il+vous+plaît  d’accord
il y a
 pas de souci
 monsieur s’il+vous+plaît
 oui
 donc voilà ce+que j’ ai comme horaires moi vous avez donc un départ à la gare de Meaux
donc à seize heures quarante-neuf
 seize heures quarante-neuf d’accord
 et après vous avez celui de dix-sept heures dix-neuf

                                                 10

alors seize heures quarante-neuf dix-sept heures dix-neuf OK d’accord
oui
ben je vous remercie
mais je vous en prie bonne journée madame
merci à vous aussi monsieur au+revoir
au+revoir
NoSpk version: This version presents the conversations without the speakers’ information.
bonjour. oui bonjour madame. je vous appelle pour avoir des horaires de train en+fait c’ est pas pour
le métro je sais pas si vous pouvez me les donner ou pas. trains SNCF. oui trains SNCF oui. vous
prenez quelle ligne monsieur. euh la ligne euh enfin en+fait c’ est pas SNCF enfin c’ est Île-de-France
quoi je sais pas comment ils appellent ça. RER voilà c’ est pour les RER. voilà et euh je prends la
ligne euh Meaux de Meaux pour aller à Paris je sais plus c’ est laquelle c’ est la. E je crois. d’accord
et vous voulez donc les horaires euh pour quel jour et à quel moment monsieur. euh là pour euh tout
à l’ heure euh pour euh aux environs de dix-sept heures en partant de la gare de Meaux. euh vers la
Gare+de+l’Est à Paris. alors vous partez de Meaux et vous allez donc à la Gare+de+l’Est et vous
voudriez les horaires voilà. vers euh dix-sept heures alors moi je peux regarder ce+que ouais dix-sept
heures ouais. j’ ai comme horaires un instant monsieur s’il+vous+plaît d’accord il y a. pas de souci.
monsieur s’il+vous+plaît. oui. donc voilà ce+que j’ ai comme horaires moi vous avez donc un départ
à la gare de Meaux donc à seize heures quarante-neuf. seize heures quarante-neuf d’accord. et après
vous avez celui de dix-sept heures dix-neuf. alors seize heures quarante-neuf dix-sept heures dix-neuf
OK d’accord oui. ben je vous remercie. mais je vous en prie bonne journée madame. merci à vous
aussi monsieur au+revoir. au+revoir.

FR_20091112_RATP_SCD_0826
References Demande d’horaires de train de la gare de Meaux à la gare de l’Est à une heure
donnée. (1)
horaires RER E de Meaux à la Gare de l’Est (166)
Spk
BARThez Demande d’horaires de train pour aller de la gare de Meaux à Paris.
BelGPT-2 demande de renseignement sur la destination de la ligne de métro de Meaux-
Chappes-Saint-Lazare. Transfert de bus à la gare de Saint-Lazare. L’appel est-il
possible. L’appel est-il possible de le faire. L’appel est-il possible. L’appel est-il
possible. L’appel est
mBART Demande d’horaires de trains SNCF pour aller à Paris en partant de la gare de
Meaux en partant de la gare de Gare de Lyon à Paris. Demande des horaires pour
se rendre à la gare de Meaux à seize heures quarante-neuf en partant de la gare de
Meaux à seize heures quarante-neuf en partant de la gare de Gare de Lyon.
mBARThez horaires de train en Île-de-France
mT5 Demande de horaires pour aller à la gare de Meaux.
NoSpk
BARThez Demande d’horaires de train pour aller à Paris.
BelGPT-2 , le trajet de la gare de Meaux vers la Gare de Lyon, en passant par la gare de Bercy
) et enfin le trajet de la gare de Bercy à la gare de Bercy, en passant par la gare de
Bercy, le trajet de la gare de Bercy vers la gare de Bercy, et enfin le trajet de la gare
de Bercy vers la gare de Montparnasse, le trajet de la gare de Bercy
mBART Demande d’horaires pour une ligne de RER entre la gare de Meaux de Meaux et la
gare de Gare de Lyon. Mise en relation avec le service concerné.
mBARThez horaires du RER B pour aller à Paris en partant de la gare de Meaux
mT5 Demande de horaires pour aller à la gare de Meaux.

Table 8: Doc FR_20091112_RATP_SCD_0826 from DECODA’s test set, and associated reference
and generated summaries by fine-tuning different models. Errors are marked in red.

References
 [1] Bechet, F., Maza, B., Bigouroux, N., Bazillon, T., El-Bèze, M., De Mori, R., & Arbil-
     lot, E. (2012). DECODA: a call-centre human-human spoken conversation corpus. In
     Proceedings of the Eighth International Conference on Language Resources and Evalua-
     tion (LREC’12) (pp. 1343–1347). Istanbul, Turkey: European Language Resources Asso-
     ciation (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2012/
     pdf/684_Paper.pdf.
 [2] Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document
     transformer. CoRR, abs/2004.05150. URL: https://arxiv.org/abs/2004.05150.
     arXiv:2004.05150.
 [3] Buist, A. H., Kraaij, W., & Raaijmakers, S. (2004). Automatic summarization of meeting data:
     A feasibility study. In in Proc. Meeting of Computational Linguistics in the Netherlands (CLIN).
 [4] Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos,
     V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post,
     W., Reidsma, D., & Wellner, P. (2006). The ami meeting corpus: A pre-announcement. In
     S. Renals, & S. Bengio (Eds.), Machine Learning for Multimodal Interaction (pp. 28–39).
     Berlin, Heidelberg: Springer Berlin Heidelberg.
 [5] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave,
     E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual rep-
     resentation learning at scale. In Proceedings of the 58th Annual Meeting of the Asso-
     ciation for Computational Linguistics (pp. 8440–8451). Online: Association for Com-
     putational Linguistics. URL: https://aclanthology.org/2020.acl-main.747.
     doi:10.18653/v1/2020.acl-main.747.
 [6] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep
     bidirectional transformers for language understanding. In Proceedings of the 2019 Conference
     of the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Minneapolis,
     Minnesota: Association for Computational Linguistics. URL: https://aclanthology.
     org/N19-1423. doi:10.18653/v1/N19-1423.
 [7] Favre, B., Stepanov, E., Trione, J., Béchet, F., & Riccardi, G. (2015). Call centre conversation
     summarization: A pilot task at multiling 2015. In Proceedings of the 16th Annual Meeting
     of the Special Interest Group on Discourse and Dialogue (pp. 232–236). Prague, Czech
     Republic: Association for Computational Linguistics. URL: https://aclanthology.
     org/W15-4633. doi:10.18653/v1/W15-4633.
 [8] Feng, X., Feng, X., & Qin, B. (2021). A survey on dialogue summarization: Recent advances and
     new frontiers. CoRR, abs/2107.03175. URL: https://arxiv.org/abs/2107.03175.
     arXiv:2107.03175.
 [9] Feng, X., Feng, X., Qin, L., Qin, B., & Liu, T. (2021). Language model as an annotator: Explor-
     ing DialoGPT for dialogue summarization. In Proceedings of the 59th Annual Meeting of the As-
     sociation for Computational Linguistics and the 11th International Joint Conference on Natural
     Language Processing (Volume 1: Long Papers) (pp. 1479–1491). Online: Association for Com-
     putational Linguistics. URL: https://aclanthology.org/2021.acl-long.117.
     doi:10.18653/v1/2021.acl-long.117.
[10] Gliwa, B., Mochol, I., Biesek, M., & Wawer, A. (2019). SAMSum corpus: A human-
     annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Work-
     shop on New Frontiers in Summarization (pp. 70–79). Hong Kong, China: Associa-
     tion for Computational Linguistics. URL: https://aclanthology.org/D19-5409.
     doi:10.18653/v1/D19-5409.
[11] Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom,
     P. (2015). Teaching machines to read and comprehend. Advances in neural information
     processing systems, 28, 1693–1701.

                                                12

[12] Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T.,
     Shriberg, E., Stolcke, A., & Wooters, C. (2003). The icsi meeting corpus. In 2003 IEEE
     International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.
     (ICASSP ’03). (pp. I–I). volume 1. doi:10.1109/ICASSP.2003.1198793.
[13] Kamal Eddine, M., Tixier, A., & Vazirgiannis, M. (2021). BARThez: a skilled pretrained French
     sequence-to-sequence model. In Proceedings of the 2021 Conference on Empirical Methods in
     Natural Language Processing (pp. 9369–9390). Online and Punta Cana, Dominican Republic:
     Association for Computational Linguistics. URL: https://aclanthology.org/2021.
     emnlp-main.740. doi:10.18653/v1/2021.emnlp-main.740.
[14] Launay, J., Tommasone, G. L., Pannier, B., Boniface, F., Chatelain, A., Cappelli, A., Poli, I.,
     & Seddah, D. (2021). PAGnol: An Extra-Large French Generative Model. Research Report
     LightON. URL: https://hal.inria.fr/hal-03540159.
[15] Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B.,
     Besacier, L., & Schwab, D. (2020). FlauBERT: Unsupervised language model pre-training
     for French. In Proceedings of the 12th Language Resources and Evaluation Conference
     (pp. 2479–2490). Marseille, France: European Language Resources Association. URL:
     https://aclanthology.org/2020.lrec-1.302.
[16] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V.,
     & Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural
     language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting
     of the Association for Computational Linguistics (pp. 7871–7880). Online: Association for
     Computational Linguistics. URL: https://aclanthology.org/2020.acl-main.
     703. doi:10.18653/v1/2020.acl-main.703.
[17] Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text
     Summarization Branches Out (pp. 74–81). Barcelona, Spain: Association for Computational
     Linguistics. URL: https://www.aclweb.org/anthology/W04-1013.
[18] Linhares, E., Torres-Moreno, J.-M., & Linhares, A. C. (2018). LIA-RAG: a system based
     on graphs and divergence of probabilities applied to Speech-To-Text Summarization. URL:
     https://hal.archives-ouvertes.fr/hal-01779304 working paper or preprint.
[19] Liu, C., Wang, P., Xu, J., Li, Z., & Ye, J. (2019). Automatic dialogue summary generation
     for customer service. In Proceedings of the 25th ACM SIGKDD International Conference
     on Knowledge Discovery & Data Mining KDD ’19 (p. 1957–1965). New York, NY, USA:
     Association for Computing Machinery. URL: https://doi.org/10.1145/3292500.
     3330683. doi:10.1145/3292500.3330683.
[20] Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., & Zettlemoyer, L.
     (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the
     Association for Computational Linguistics, 8, 726–742. URL: https://aclanthology.
     org/2020.tacl-1.47. doi:10.1162/tacl_a_00343.
[21] Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., & Zettlemoyer,
     L. (2020). Multilingual denoising pre-training for neural machine translation. arXiv, https:
     //arxiv.org/abs/2001.08210.
[22] Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. In Proceedings
     of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
     International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3730–
     3740). Hong Kong, China: Association for Computational Linguistics. URL: https://
     aclanthology.org/D19-1387. doi:10.18653/v1/D19-1387.
[23] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettle-
     moyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.
     arXiv:1907.11692.
[24] Louis, A. (2020). BelGPT-2: a GPT-2 model pre-trained on French corpora. https://
     github.com/antoiloui/belgpt2.

                                                13

[25] Martin, L., Muller, B., Ortiz Suárez, P. J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah,
     D., & Sagot, B. (2020). CamemBERT: a tasty French language model. In Proceedings of the
     58th Annual Meeting of the Association for Computational Linguistics (pp. 7203–7219). Online:
     Association for Computational Linguistics. URL: https://aclanthology.org/2020.
     acl-main.645. doi:10.18653/v1/2020.acl-main.645.
[26] Müller, M., & Laurent, F. (2022). Cedille: A large autoregressive french language model. arXiv,
     https://arxiv.org/abs/2202.03371.
[27] Narayan, S., Cohen, S. B., & Lapata, M. (2018). Don’t give me the details, just the summary!
     Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the
     2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium.
[28] Oya, T., Mehdad, Y., Carenini, G., & Ng, R. (2014). A template-based abstractive meeting
     summarization: Leveraging summary and source text relationships. In Proceedings of the 8th
     International Natural Language Generation Conference (INLG) (pp. 45–53). Philadelphia,
     Pennsylvania, U.S.A.: Association for Computational Linguistics. URL: https://www.
     aclweb.org/anthology/W14-4407. doi:10.3115/v1/W14-4407.
[29] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language
     understanding by generative pre-training.
[30] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu,
     P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. J.
     Mach. Learn. Res., 21, 140:1–140:67. URL: http://jmlr.org/papers/v21/20-074.
     html.
[31] Sacks, H., Schegloff, E. A., & Jefferson, G. (1978). A simplest systematics for the organization
     of turn taking for conversation. In Studies in the organization of conversational interaction (pp.
     7–55). Elsevier.
[32] Simoulin, A., & Crabbé, B. (2021). Un modèle transformer génératif pré-entrainé pour le _
     français. In Traitement Automatique des Langues Naturelles (pp. 245–254). ATALA.
[33] Stepanov, E., Favre, B., Alam, F., Chowdhury, S., Singla, K., Trione, J., Béchet, F., & Riccardi,
     G. (2015). Automatic summarization of call-center conversations. In Conference: IEEE
     Automatic Speech Recognition and Understanding Workshop (ASRU 2015) At: Scottsdale,
     Arizona, USA.
[34] Tamura, A., Ishikawa, K., Saikou, M., & Tsuchida, M. (2011). Extractive summarization
     method for contact center dialogues based on call logs. In Proceedings of 5th International
     Joint Conference on Natural Language Processing (pp. 500–508). Chiang Mai, Thailand:
     Asian Federation of Natural Language Processing. URL: https://www.aclweb.org/
     anthology/I11-1056.
[35] Trione, J. (2014). Extraction methods for automatic summarization of spoken conversations
      from call centers (méthodes par extraction pour le résumé automatique de conversations parlées
      provenant de centres d’appels) [in French]. In Proceedings of TALN 2014 (Volume 4: RECITAL
     - Student Research Workshop) (pp. 104–111). Marseille, France: Association pour le Traitement
     Automatique des Langues. URL: https://aclanthology.org/F14-4010.
[36] Trione, J. (2017). Extraction methods for automatic summarization of spoken conversations
     from call centers (Méthodes par extraction pour le résumé automatique de conversations parlées
     provenant de centres d’appels) [in French]. Thèse de doctorat en informatique Aix-Marseille
     université.
[37] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
     Kaiser, L. u., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon,
     U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett
     (Eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc.
     volume 30.     URL: https://proceedings.neurips.cc/paper/2017/file/
     3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

                                                  14

[38] Wang, L., & Cardie, C. (2011). Summarizing decisions in spoken meetings. In Proceedings
     of the Workshop on Automatic Summarization for Different Genres, Media, and Languages
     (pp. 16–24). Portland, Oregon: Association for Computational Linguistics. URL: https:
     //www.aclweb.org/anthology/W11-0503.
[39] Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C.
     (2021). mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings
     of the 2021 Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies (pp. 483–498). Online: Association for Com-
     putational Linguistics. URL: https://aclanthology.org/2021.naacl-main.41.
     doi:10.18653/v1/2021.naacl-main.41.
[40] Zhang, Y., Ni, A., Yu, T., Zhang, R., Zhu, C., Deb, B., Celikyilmaz, A., Awadallah, A. H.,
     & Radev, D. (2021). An exploratory study on long dialogue summarization: What works
     and what’s next. In Findings of the Association for Computational Linguistics: EMNLP
     2021 (pp. 4426–4433). Punta Cana, Dominican Republic: Association for Computational
     Linguistics. URL: https://aclanthology.org/2021.findings-emnlp.377.
     doi:10.18653/v1/2021.findings-emnlp.377.
[41] Zhao, L., Xu, W., & Guo, J. (2020). Improving abstractive dialogue summarization with graph
     structures and topic words. In Proceedings of the 28th International Conference on Computa-
     tional Linguistics (pp. 437–449). Barcelona, Spain (Online): International Committee on Com-
     putational Linguistics. URL: https://aclanthology.org/2020.coling-main.
     39. doi:10.18653/v1/2020.coling-main.39.
[42] Zhu, C., Xu, R., Zeng, M., & Huang, X. (2020). A hierarchical network for abstractive
     meeting summarization with cross-domain pretraining. In Findings of the Association for
     Computational Linguistics: EMNLP 2020 (pp. 194–203). Online: Association for Compu-
     tational Linguistics. URL: https://aclanthology.org/2020.findings-emnlp.
     19. doi:10.18653/v1/2020.findings-emnlp.19.

                                                 15

You can also read