Projection of Turn Completion in Incremental Spoken Dialogue Systems

Page created by Calvin Webb

Careers

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Projection of Turn Completion in Incremental Spoken Dialogue Systems

Projection of Turn Completion in Incremental Spoken Dialogue Systems

                    Erik Ekstedt                                      Gabriel Skantze
            KTH Speech, Music and Hearing                       KTH Speech, Music and Hearing
                 Stockholm, Sweden                                   Stockholm, Sweden
               erikekst@kth.se                                      skantze@kth.se

                      Abstract                              SDSs typically have response delays of around 700-
                                                            1000ms. The reason for this is that they typically
    The ability to take turns in a fluent way (i.e.,
    without long response delays or frequent inter-         rely solely on this silence to determine when to
    ruptions) is a fundamental aspect of any spo-           take the turn, whereas humans also use other cues,
    ken dialog system. However, practical speech            such as prosody, gaze and syntactic completeness
    recognition services typically induce a long re-        (Skantze, 2021). Many studies have investigated
    sponse delay, as it takes time before the pro-          how to include such features in turn-taking mod-
    cessing of the user’s utterance is complete.            els for SDSs (Ferrer et al., 2002; Sato et al., 2002;
    There is a considerable amount of research
                                                            Schlangen, 2006; Raux and Eskenazi, 2008; Meena
    indicating that humans achieve fast response
    times by projecting what the interlocutor will
                                                            et al., 2013; Maier et al., 2017; Lala et al., 2019).
    say and estimating upcoming turn completions.              Another difference between human turn-taking
    In this work, we implement this mechanism in            and SDSs is that humans do not only react to turn-
    an incremental spoken dialog system, by us-             yielding cues from the interlocutor. If they were
    ing a language model that generates possible            simply waiting for a cue and only then started to
    futures to project upcoming completion points.          formulate a response, psycholinguistic research has
    In theory, this could make the system more re-          estimated that the response time would be around
    sponsive, while still having access to semantic
                                                            600-1500ms (Levinson and Torreira, 2015), which
    information not yet processed by the speech
    recognizer. We conduct a small study which              is substantially slower than the observed response
    indicates that this is a viable approach for prac-      times. This indicates that humans also project turn
    tical dialog systems, and that this is a promis-        completions in advance, before the turn is complete
    ing direction for future research.                      (Sacks et al., 1974; Levinson and Torreira, 2015;
                                                            Garrod and Pickering, 2015).
1   Introduction                                               In this paper, we investigate whether the human
One of the most fundamental conversational be-              ability to project future turn completions could be a
haviour of any spoken dialog system (SDS) is that           viable option for conversational systems to achieve
of turn-taking, i.e., to take turns without long re-        more fluent turn-taking. We constrain our approach
sponse delays or frequent interruptions (Skantze,           to the textual domain using a pre-trained conversa-
2021). To achieve this, the system must be able to          tional language model to project future words and
correctly identify when the user is yielding the turn,      turn-completions.
and it is appropriate to make a response, and when             The projection of turn-completions in SDSs can
the user is simply making a mid-utterance pause.            have a number of applications. For example, the
   In their seminal work, Sacks et al. (1974) de-           system could initiate a turn just before the end of
scribe general properties of human-human con-               the user’s utterance to minimize response time, or
versation in which they observe that, overwhelm-            even take the turn with a small overlap. It could
ingly, one speaker talk at a time and the time be-          also give the system more time to generate a re-
tween consecutive turns (response time) is mini-            sponse, or be used to address the problem of pro-
mal. For the English language, a typical response           cessing delays. For example, SDSs rely heavily on
time is around 200ms and similar response pat-              Automatic Speech Recognition (ASR) to extract
terns seem to be consistent across different cul-           the text from the user’s speech. Most ASR services
tures (Stivers et al., 2009). Contrary to this, current     are associated with a certain latency (Baumann

                                                         431
    Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 431–437
                          July 29–31, 2021. ©2021 Association for Computational Linguistics

et al., 2017; Addlesee et al., 2020). For turn-taking, the ongoing utterance if less than 80% has been
this means that even if the system has detected that sent. The interrupted utterance is then repeated for
the user has stopped speaking, it is hard to deter- the system’s next response. If the agent completed
mine whether the turn is yielded or not, since the an utterance and the user is inactive for 5 seconds,
final ASR result is not complete yet. a fallback is triggered and the agent continues the
There has been some previous research on pre- conversation by producing a new utterance.
dicting upcoming activity in dialog, such as recog- For the simplicity of our experiment, the dialog
nizing NLU intents on incomplete user speech (De- manager is defined by a set of predetermined ques-
Vault et al., 2009), projecting prosodic informa- tions, where the only possible deviation occurs if
tion and timing (Ward et al., 2010; Baumann and the user provides a too short utterance. If such a
Schlangen, 2011) as well as estimating future voice short utterance is recognized, the system randomly
activity (Skantze, 2017; Roddy et al., 2018; Ward chooses from a set of paraphrased responses that
et al., 2018). However, we are not aware of any encourages the user to elaborate.
previous studies of how a SDS could predict up- In this study, we implement two different turn-
coming words in the user’s speech, and use this for taking policies: the baseline and the projection
managing turn-taking. model. The baseline defines a user turn as complete
once the VAD module is inactive and the ASR has
2 Conversational agent produced its final hypothesis.
For our study, we implemented a SDS that per-
3 Turn-completion projection model
forms an interview with a user, talking about past
travel memories, similar to Johansson et al. (2016). To make projections, we utilize the TurnGPT model
The reason we chose this domain is that the dialog by Ekstedt and Skantze (2020), which is a pre-
manager can be implemented in a very simple way, trained GPT-2 (Radford et al., 2019) language
while the turn-taking can be challenging, as pauses model (LM) fine-tuned on conversational data. The
within the user’s turn might be more frequent than model was trained on data from seven publicly
in, for example, a Q/A system. An example dialog available dialog datasets listed in Appendix A.2.
can be found in Appendix A.1. The model trained until the validation loss reached
A general first step for modelling responsive a minimum, resulting in an average validation per-
turn-taking is to use an incremental dialog archi- plexity of 17.6.
tecture, where the user’s speech is processed in- The model includes special tokens that encode
crementally, so that decisions can be made in a speaker shifts, which we will refer to as turn-
more continuous fashion (Schlangen and Skantze, completions. As shown by Ekstedt and Skantze
2009). For this study, we build upon the recent (2020), the model does not only consider the on-
Retico (Michael, 2020) framework (implemented going user turn, but also benefits from taking the
in Python1 ), which implements the general, ab- larger dialog context into account (i.e., previous
stract model of incremental dialog processing pro- turns by the system and the user).
posed by Schlangen and Skantze (2009). Given the currently recognized user words (and
The system processes incoming user speech and the dialog context), a set of N possible continua-
outputs audio. The incoming incremental audio tions (of length M ) are generated (using a temper-
chunks are processed by a local voice activity de- ature τ and topk sampling). The number of those
tection (VAD) component and streamed to a re- that include turn-completions are counted, which
mote incremental ASR service (Google). The VAD gives a ratio. This ratio then approximates the prob-
triggers on silences of 200ms which defines inter- ability of an “actual” turn-completion point in the
pausal units (IPU). near future. If the ratio is larger than a threshold R,
A user turn is started when both the VAD detects the turn is predicted to be complete.
ongoing speech and the ASR has provided its first In this setup we strive towards simplicity and
hypothesis. If the VAD module activates during only trigger a projection at the end of each user IPU.
an ongoing agent utterance, an interruption compo- However, if new ASR hypotheses are received after
nent is triggered. This module checks how much of this, new projections are made until the system de-
the planned audio has been transmitted and stops cides to take the turn. The projection model uses a
1 maximum silence threshold T as a fallback, which
https://github.com/Uhlo/retico

432

triggers a response regardless of the projections.          both interactions, the participants were asked to
   These different parameters can potentially be            annotate the recorded dialogues by labeling mo-
fine-tuned for the specific application (or user).          ments where they felt they had been interrupted by
This was not done in our study, and we selected             the system. To do this, they were provided with a
values we found reasonable in preliminary tests,            graphical tool where they could see the waveforms
which are shown in Table 1.                                 of the dialogs and play them, as well as inserting
   An example taken from one of the interactions            labels.
is illustrated in Figure 1                                     The agent interacted directly over Zoom by con-
                                                            necting its microphone to the zoom speakers and
           Parameter                     Value              vice versa. All audio was recorded directly on the
           IPU                            0.2 s             agent side, in the same way as in a live setup.
           Turn-completion ratio (R)        0.4
           Fallback threshold (T )       1.25 s             5   Results
           Sampling
                                                          10 subjects interacted with the system, resulting
           Continuations (N )                10
                                                          in a total of 20 interactions, with an average dura-
           Length (M )                        3
                                                          tion of 3 minutes and 43 seconds. The number of
           topk                               5
                                                          questions varied by the amount of triggered elabo-
           Temperature (τ )                 1.0
                                                          ration requests. The baseline agent asked the users
           max context                       70
                                                          to elaborate 33 times, almost double the amount
          Table 1: The parameters for the model.          of 17 for the projection model. A transcript of an
                                                          interaction is shown in Appendix A.1.
                                                             The total number of agent shifts (transitions be-
                                                          tween the user and the agent) was 220 for the base-
                                                          line and 210 for the projection model. The duration
                                                          of these (i.e., response times) are shown in the his-
                                                          togram in Figure 2. The average response times
                                                          were 1.03 and 0.80 seconds for the baseline and
                                                          projection agent, respectively. While this differ-
                                                          ence is not very large, it should be noted that the
                                                          prediction model has a bimodal distribution (as
                                                          seen in Figure 2), representing early predicted turn
                                                          shifts and fallbacks. Thus, the model is able to take
                                                          the turn quickly at some points, while allowing for
                                                          more time at others.
                                                             The users annotated 18 of the agent shifts as
Figure 1: Illustration of language projection. The blue   interruptions for the baseline, and 28 for the pro-
box represents the agent and the green boxes the recog-   jection model. The estimated average cut-in rate,
nized user words at two projection moments. The red
boxes show a subset of projections made by the LM.
                                                          defined as the annotated interruptions divided by
                                                          the number of agent shifts, was 0.08 for the base-
                                                          line and 0.13 for the projection model.
4       Experiment                                           When evaluating the performance of a turn-
                                                          taking model, both response time and cut-in rate
To evaluate the model, we conducted an experi-            should be taken into account (i.e., both should be
ment over Zoom2 where ten participants had two            minimized) (Raux and Eskenazi, 2008). However,
conversations each with the agent (testing the two        there is typically also a trade-off between these two
turn-taking policies) about two distinct travel mem-      factors. Since both these values were different be-
ories. The participants were asked to choose a            tween the baseline and prediction model, they are
memory prior to each agent interaction. We used           difficult to compare directly.
two sets of paraphrased questions, assigned ran-             One way of doing that is to perform an analysis
domly between the two policies. After completing          of what would happen if we reduce the maximum
    2
        https://zoom.us/                                  allowed response time (for the prediction model

                                                      433

els that are specifically trained with data from the
                                                            target domain. Contrary to this, we have used a
                                                            generic LM (TurnGPT) with a set of basic param-
                                                            eters that were not fine-tuned using domain data.
                                                            If the LM and the parameters would be fine-tuned,
                                                            we could expect further improvements. An anal-
                                                            ysis of the perplexity of the LM on the recorded
                                                            data shows a rather high perplexity (ppl ≈ 80).
                                                            Another obvious improvement would be to also
                                                            include prosodic features.
                                                               An important question we have not addressed
                                                            here is how good the projections are in terms of
Figure 2: A histogram over the response times for each
agent.                                                      predicting the last words more exactly (i.e., not just
                                                            how well the system predicts whether there will be
                                                            a turn completion). Depending on the domain of
this is the parameter T ). As we do this, the average       the system, this might be more or less important.
response time will also be reduced, while the cut-in        In this respect, the comparison of the baseline and
rate will increase, since silences in between user          prediction models (presented in Figure 3), is some-
IPUs longer than T become both additional cut-              what unfair to the prediction model, since we could
ins and agent shifts. The result of this analysis is        not reduce the response time of the baseline model
shown in Figure 3.                                          without also truncating the ASR result.
                                                               The proposed model make turn-completion deci-
                                                            sions exclusively in the textual domain, restricted
                                                            by the latency of the ASR, at the end of user IPUs.
                                                            In practice, this means that we are more likely to
                                                            ”project” the already spoken words currently being
                                                            processed by the ASR, as opposed to the actual
                                                            future activity of the user. This could be mitigated
                                                            by using a more reactive IPU trigger, increasing the
                                                            projection events during a user utterance, and to use
                                                            a longer continuation length, surpassing the latency
                                                            of the ASR. If so, the system could potentially also
                                                            start to respond before the user has stopped speak-
                                                            ing (i.e., producing overlapping speech).
Figure 3: Cut-in rate vs response time. The points rep-        Another important aspect is that the interactions
resent the aggregate values over the interactions and the
                                                            were all conducted over Zoom which introduces
lines the estimated performance given varying values of
T.
                                                            added latencies. This also makes the probability of
                                                            cut-ins even greater than it would have been in a
   This analysis enables a direct comparison of the         live setup.
agents over values where both lines are defined.               7   Conclusion
The figure shows that the prediction agent is more
responsive and produces less interruptions by the           In conversation, humans project future turn-
fact that the green line is strictly below the red. The     completion points in order to achieve faster re-
greatest difference occurs at around 0.48s on the           sponse times. In this paper, we have investigated
x-axis, with a cut-in rate difference of 0.1, given         whether it is possible to implement this ability
threshold values of 0.5 and 0.6 seconds for the             in a SDS. The projections are done in the tex-
baseline and projection agents, respectively.               tual domain by generating future dialog continu-
                                                            ations with a conversational LM (TurnGPT). We
6   Discussion                                              conducted a small study and show, as a proof-of-
To our knowledge, all previous work on end-of-              concept, that this approach is viable. We note that
utterance-detection in SDSs have relied on mod-             there is room for improvements, such as optimizing

                                                         434

the hyperparameters, train and use a task specific Erik Ekstedt and Gabriel Skantze. 2020. TurnGPT:
LM, project turn-completion at finer increments, a transformer-based language model for predicting
turn-taking in spoken dialog. In Findings of the As-
and add prosodic features. However, the idea to
sociation for Computational Linguistics: EMNLP
use a text-based LM to project turn-completions, 2020, pages 2981–2990, Online. Association for
as a way to improve the turn-taking abilities of a Computational Linguistics.
SDS, is something we believe will be common and
Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi,
useful for the future of conversational systems. Sanchit Agarwal, Shuyang Gao, and Dilek Hakkani-
Tür. 2019. Multiwoz 2.1: Multi-domain dialogue
Acknowledgements state corrections and state tracking baselines. CoRR,
abs/1907.01669.
This work is supported by the Swedish research
council (VR) project ”Prediction and Coordination L. Ferrer, E. Shriberg, and A. Stolcke. 2002. Is the
speaker done yet? faster and more accurate end-
for Conversational AI” (2020-03812) and the Bank
of-utterance detection using prosody. pages 2061–
of Sweden Tercentenary Foundation (RJ) project 2064. Cited By 56.
”Understanding predictive models of turn-taking in
Simon Garrod and Martin J. Pickering. 2015. The
spoken interaction” (P20-0484).
use of content and timing to predict turn transitions.
Frontiers in Psychology, 6:751.
References Martin Johansson, Tatsuro Hori, Gabriel Skantze, Anja
Höthker, and Joakim Gustafson. 2016. Making turn-
Angus Addlesee, Yanchao Yu, and Arash Eshghi. 2020. taking decisions for an active listening robot for
A comprehensive evaluation of incremental speech memory training. In Proceedings of the Interna-
recognition and diarization for conversational AI. In tional Conference on Social Robotics, volume 9979
Proceedings of the 28th International Conference LNAI, pages 940–949.
on Computational Linguistics, pages 3492–3503,
Barcelona, Spain (Online). International Committee Divesh Lala, Koji Inoue, and Tatsuya Kawahara. 2019.
on Computational Linguistics. Smooth turn-taking by a robot using an online con-
tinuous model to generate turn-taking cues. In
Timo Baumann, Casey Kennington, Julian Hough, and 2019 International Conference on Multimodal Inter-
David Schlangen. 2017. Recognising Conversa- action, ICMI ’19, page 226–234, New York, NY,
tional Speech: What an Incremental ASR Should Do USA. Association for Computing Machinery.
for a Dialogue System and How to Get There, pages
421–432. Springer Singapore, Singapore. Sungjin Lee, Hannes Schulz, Adam Atkinson, Jianfeng
Gao, Kaheer Suleman, Layla El Asri, Mahmoud
Timo Baumann and David Schlangen. 2011. Predict- Adada, Minlie Huang, Shikhar Sharma, Wendy Tay,
ing the micro-timing of user input for an incremen- and Xiujun Li. 2019. Multi-domain task-completion
tal spoken dialogue system that completes a user’s dialog challenge. In Dialog System Technology
ongoing turn. In Proceedings of the SIGDIAL 2011 Challenges 8.
Conference, pages 120–129, Portland, Oregon. As-
sociation for Computational Linguistics. Stephen C. Levinson and Francisco Torreira. 2015.
Timing in turn-taking and its implications for pro-
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang cessing models of language. Frontiers in Psychol-
Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra- ogy, 6:731.
madan, and Milica Gašić. 2018. MultiWOZ - a
large-scale multi-domain wizard-of-Oz dataset for Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang
task-oriented dialogue modelling. In Proceedings of Cao, and Shuzi Niu. 2017. DailyDialog: A manu-
the 2018 Conference on Empirical Methods in Nat- ally labelled multi-turn dialogue dataset. In Proceed-
ural Language Processing, pages 5016–5026, Brus- ings of the Eighth International Joint Conference on
sels, Belgium. Association for Computational Lin- Natural Language Processing (Volume 1: Long Pa-
guistics. pers), pages 986–995, Taipei, Taiwan. Asian Federa-
tion of Natural Language Processing.
Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai
Sankar, Arvind Neelakantan, Daniel Duckworth, Angelika Maier, Julian Hough, and David Schlangen.
Semih Yavuz, Ben Goodrich, Amit Dubey, Kyu- 2017. Towards deep end-of-turn prediction for situ-
Young Kim, and Andy Cedilnik. 2019. Taskmaster- ated spoken dialogue systems. In Proc. Interspeech
1: Toward a realistic and diverse dialog dataset. 2017, pages 1676–1680.
David DeVault, Kenji Sagae, and David Traum. 2009. Raveesh Meena, Gabriel Skantze, and Joakim
Can I finish? learning when to respond to incre- Gustafson. 2013. A data-driven model for tim-
mental interpretation results in interactive dialogue. ing feedback in a map task dialogue system. In
In Proceedings of the SIGDIAL 2009 Conference, Proceedings of the SIGDIAL 2013 Conference,
pages 11–20, London, UK. Association for Compu- pages 375–383, Metz, France. Association for
tational Linguistics. Computational Linguistics.

435

Thilo Michael. 2020. Retico: An incremental frame-          Gabriel Skantze. 2021. Turn-taking in Conversational
  work for spoken dialogue systems. In Proceedings            Systems and Human-Robot Interaction : A Review.
  of the 21th Annual Meeting of the Special Interest          Computer Speech & Language, 67:101178.
  Group on Discourse and Dialogue, pages 49–52, 1st
  virtual meeting. Association for Computational Lin-       Tanya Stivers, N. J. Enfield, Penelope Brown, Christina
  guistics.                                                   Englert, Makoto Hayashi, Trine Heinemann, Gertie
                                                              Hoymann, Federico Rossano, Jan Peter de Ruiter,
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,            Kyung-Eun Yoon, and Stephen C. Levinson. 2009.
  Dario Amodei, and Ilya Sutskever. 2019. Language            Universals and cultural variation in turn-taking in
  models are unsupervised multitask learners. OpenAI          conversation. Proceedings of the National Academy
  Blog, 1(8):9.                                               of Sciences, 106(26):10587–10592.
Filip Radlinski, Krisztian Balog, Bill Byrne, and           N. G. Ward, D. Aguirre, G. Cervantes, and O. Fuentes.
   Karthik Krishnamoorthi. 2019. Coached conversa-            2018. Turn-taking predictions across languages and
   tional preference elicitation: A case study in un-         genres using an lstm recurrent neural network. In
   derstanding movie preferences. In Proceedings of           2018 IEEE Spoken Language Technology Workshop
   the Annual SIGdial Meeting on Discourse and Dia-           (SLT), pages 831–837.
   logue.
                                                            Nigel G. Ward, Olac Fuentes, and Alejandro Vega.
Hannah Rashkin, Eric Michael Smith, Margaret Li, and          2010. Dialog prediction for a general model of
  Y-Lan Boureau. 2019. Towards empathetic open-               turn-taking. pages 2662–2665. International Speech
  domain conversation models: a new benchmark and             Communication Association.
  dataset. In ACL.
                                                            Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur
Antoine Raux and Maxine Eskenazi. 2008. Optimiz-              Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
  ing endpointing thresholds using dialogue features          sonalizing dialogue agents: I have a dog, do you
  in a spoken dialogue system. In Proceedings of the          have pets too?     In Proceedings of the 56th An-
  9th SIGdial Workshop on Discourse and Dialogue,             nual Meeting of the Association for Computational
  pages 1–10, Columbus, Ohio. Association for Com-            Linguistics (Volume 1: Long Papers), pages 2204–
  putational Linguistics.                                     2213, Melbourne, Australia. Association for Com-
                                                              putational Linguistics.
Matthew Roddy, Gabriel Skantze, and Naomi Harte.
 2018. Investigating Speech Features for Continuous
 Turn-Taking Prediction Using LSTMs. In Proceed-
 ings of Interspeech, Hyderabad, India.
H Sacks, Emanuel Schegloff, and G Jefferson. 1974.
  A simplest systematics for the organization of turn-
  taking for conversation. Language, 50:696–735.
Ryo Sato, Ryuichiro Higashinaka, Masafumi Tamoto,
  Mikio Nakano, and Kiyoaki Aikawa. 2002. Learn-
  ing decision trees to determine turn-taking by spo-
  ken dialogue systems. In 7th International Confer-
  ence on Spoken Language Processing, ICSLP2002
  - INTERSPEECH 2002, Denver, Colorado, USA,
  September 16-20, 2002. ISCA.
David Schlangen. 2006. From reaction to predic-
  tion: experiments with computational models of
  turn-taking. In Proceedings of Interspeech 2006,
  Pittsburgh, PA, USA, 2010-2013.
David Schlangen and Gabriel Skantze. 2009. A gen-
  eral, abstract model of incremental dialogue process-
  ing. In Proceedings of the 12th Conference of the
  European Chapter of the ACL (EACL 2009), pages
  710–718, Athens, Greece. Association for Computa-
  tional Linguistics.
Gabriel Skantze. 2017. Towards a general, continu-
  ous model of turn-taking in spoken dialogue using
  lstm recurrent neural networks. In Proceedings of
  the 18th Annual SIGdial Meeting on Discourse and
  Dialogue, pages 220–230, SaarbrÃ¼cken, Germany.
  Association for Computational Linguistics.

                                                      436

A     Appendix                                                       User:     absolutely[H]
                                                                     Agent: Please, tell me more about that. [elaboration re-
The dialog in A.1 shows a transcript of a user in-                          quest]
teraction with the projection agent. We highlight a                  User:     well because America is[H] such[H] a big coun-
negative turn-completion projection by [H] (hold)                              try it’s a lot of things to see[H] something[H] that
                                                                               suits[S] all yeah[H]
and a positive with [S] (shift). Note that these
                                                                     Agent:               Let’s [interrupted by ”yeah”]
transcriptions and highlights are shown from the
                                                                     Agent: Let’s see, do you think that such a trip is suitable
agent’s perspective (i.e., based on what the ASR                            for families?
has reported), and not on what was actually said.                    User:     yeah[H] I[H] think[H] so[S]
Given a continuation length of M =3, a [S] indi-                     Agent: Did you do anything that you have not done before?
cates that the model estimates that there is less than               User:     yeah we were hiking in the mountains near Grand
3 tokens left in the turn.                                                     Canyon I hadn’t done[S] that before
    The blue text shows words that are recognized                    Agent: Do you think you are going to revisit any time soon?
by the ASR after the decision to take the turn was                   User:     yes[H] I think so[H]
made. That is, blue text to the right of a [S] had                   Agent: So, was it a vacation or work related trip?
been spoken by the user at the time of projection,                   User:     it was a[H] vacation[H]
but was not accessible to the model. The orange                      Agent: For how long were you away?
[...] are used to comment on specific events.                        User:     we were away for three weeks three weeks[H] I[H]
                                                                               think I almost[S] 4 weeks
A.1      Example dialog                                              Agent: How was the journey home?
 Agent: Hi there, in this session I want to talk about one of        User:     it was a long journey home[H] and it was good[S]
        your past travels. Where did you travel?                     Agent: Thank you for answering my questions. This session
 User:     I travel to the United States[H] of[H] America[H]                is over. Goodbye.
 Agent: What mode of transportation did you use to get to
        your travel destination?                                     A.2      Datasets
 User:     we went there by[S] airplane                           We trained TurnGPT on seven publicly avail-
 Agent: Would you say that it was a summer or winter trip?        able dialog datasets namely, Taskmaster (Byrne
 User:     it was a summer[S] trip                                et al., 2019), MetaLWOZ (Lee et al., 2019), Multi-
 Agent: Let’s see, did you try any new foods that you could       woz (Budzianowski et al., 2018; Eric et al., 2019),
        tell me about?                                            The Coached Conversational Preference Elicita-
 User:     and yeah we tried a lot of foods then like[H] a[H]     tion (Radlinski et al., 2019), Persona (Zhang et al.,
           burritos and[H] vegan pizza[H]
                                                                  2018), DailyDialog (Li et al., 2017), and Empa-
 Agent: What is your best memory from the trip?
                                                                  thetic dialogs (Rashkin et al., 2019).
 User:     the best memory from the Westland and me and my
           husband went to Central Park and it was a sunny
           day and we sat there and ate some[H] noodles[S]
 Agent: So, did you visit any tourist attractions?
 User:     yeah we[H] did a lot of tourist attractions like the
           World Trade Center[H]
 Agent: Can you tell me [interrupted by user cough]
 Agent: Can you tell me anything interesting about your
        trip?
 User:     yes it was very interesting to go by car to the[H]
           different states we went to with a car[H] from from
           Denver in Colorado all the way to Los Angeles
           in[H] California[S]
 Agent: Do you have any recommendations for people trav-
        eling there?
 User:     yeah I would definitely definitely recommend them
           to go to New York and[H] the Grand Canyon[H]
 Agent: Let’s see, how was the weather during your trip?
 User:     it was very warm[H] it was in the summer[H]
 Agent: So, could you talk about who you traveled with?
 User:     yeah I traveled with my husband[H] Rasmus[S]
 Agent: Yeah, would you recommend others to go there?

                                                               437

You can also read