How "open" are the conversations with open-domain chatbots? A proposal for Speech Event based evaluation - SIGdial

Page created by Debbie Sherman
 
CONTINUE READING
How “open” are the conversations with open-domain chatbots?
               A proposal for Speech Event based evaluation

                     A. Seza Doğruöz                                 Gabriel Skantze
                  LT3 , Dept. Translation,                       KTH Speech, Music and Hearing
             Interpreting and Communication                           Stockholm, Sweden
                Ghent University, Belgium                            skantze@kth.se
               as.dogruoz@ugent.be

                      Abstract                               appealing, there is a lack of clarity on what ex-
                                                             actly “open” means. In their paper introducing the
    Open-domain chatbots are supposed to con-
                                                             Meena chatbot, Adiwardana et al. (2020) provide
    verse freely with humans without being re-
    stricted to a topic, task or domain. How-
                                                             the following definition: “Unlike closed-domain
    ever, the boundaries and/or contents of open-            chatbots, which respond to keywords or intents to
    domain conversations are not clear. To clar-             accomplish specific tasks, open-domain chatbots
    ify the boundaries of “openness”, we conduct             can engage in conversation on any topic”. Is this
    two studies: First, we classify the types of             really the case? The answer to this question will
    “speech events” encountered in a chatbot eval-           depend on the contexts in which the chatbot is put
    uation data set (i.e., Meena by Google) and              to test.
    find that these conversations mainly cover the
                                                                Based on Wittgenstein’s (1958) concept of “lan-
    “small talk” category and exclude the other
    speech event categories encountered in real              guage games”, Levinson (1979) argues that in order
    life human-human communication. Second,                  to interpret an utterance and generate a meaningful
    we conduct a small-scale pilot study to gen-             response, the “activity type” in which the exchange
    erate online conversations covering a wider              takes place is vital. The activity can be described in
    range of speech event categories between two             various terms (e.g., setting, participants, purpose,
    humans vs. a human and a state-of-the-art                norms) and it provides the necessary constraints on
    chatbot (i.e., Blender by Facebook). A human
                                                             the interpretation space (e.g., which implications
    evaluation of these generated conversations in-
    dicates a preference for human-human con-
                                                             can be made, what can be considered to be a co-
    versations, since the human-chatbot conversa-            herent and meaningful contribution to the activity).
    tions lack coherence in most speech event cat-           From this perspective, every conversation has an
    egories. Based on these results, we suggest (a)          assumed purpose or motive. This purpose could be
    using the term “small talk” instead of “open-            characterized as being more “task-oriented” (e.g.,
    domain” for the current chatbots which are not           planning a trip together or buying a ticket) or just
    that “open” in terms of conversational abilities         passing time by conversing with each other without
    yet, and (b) revising the evaluation methods
                                                             a specific task in mind (e.g., gossiping, recapping
    to test the chatbot conversations against other
    speech events.                                           the day’s events). Regardless of the activity type,
                                                             there is always some shared knowledge about the
1   Introduction                                             context and expectations of the participants (i.e.,
                                                             humans) from each other (e.g., avoid being rude un-
There has been a recent surge in the research and            less it is intentional) in real-world settings. Among
development of so-called “open-domain” chatbots              other terms (e.g., speech genre, joint action, so-
(e.g., Adiwardana et al. 2020; Roller et al. 2020;           cial episode, frame), Goldsmith and Baxter (1996)
Dinan et al. 2020). These chatbots are typically             refers to activity types as “speech events” and we
trained in an end-to-end fashion on large datasets           will use this term in our paper as well.
retrieved from different Internet sources such as               The nature of the activity or the purpose of the
publicly available online discussion forums (e.g.,           interaction is often not clear in a typical setting
Reddit). While the idea of an “open-domain” chat-            for testing open-domain chatbots. Usually, a user
bot (not engineered towards a specific task, but             (often a crowd worker) is asked to “chat about any-
just trained in an agnostic fashion from data), is           thing” with an agent they have never met before

                                                         392
     Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 392–402
                           July 29–31, 2021. ©2021 Association for Computational Linguistics
without any further instructions. In most cases,        2018). However, it is not entirely clear how the
neither the user nor the chatbot has any prior re-      term “social” should be understood in this context.
lationship with each other and they will probably       We argue that the notion of “speech events”, used
never talk again. This type of context is unusual       in this paper, provides a much richer taxonomy
for real-world conversations between humans. The        for the various forms of conversations that people
closest example is perhaps engaging in “a small         engage in.
talk” to pass time while waiting at the bus stop or        The recent trend of modelling open-domain con-
at a dinner party where we might be placed next         versations was sparked by early attempts to use
to a new acquaintance. However, even in those           the same kind of sequence-to-sequence models that
situations, we do have some shared context. Since       had been successful in machine translation (Vinyals
the context restricts the types of speech events that   and Le, 2015). Another driving force is the com-
will arise, it is hard to know whether such a chatbot   petitions to develop chatbots that can engage in an
is actually able to engage in conversations freely      open-domain conversation coherently for a certain
(i.e., on any topic).                                   period of time, such as the Alexa Prize Challenge
   Therefore, we explore how “speech events”            (Ram et al., 2018) and the Conversational Intelli-
(Goldsmith and Baxter, 1996) can be applied to          gence Challenge (ConvAI) (Dinan et al., 2020).
the analysis and evaluation of chatbots with two           In terms of evaluation criteria, task-oriented con-
studies. First, we classify the types of “speech        versations are typically evaluated with task suc-
events” encountered in a chatbot evaluation data        cess and efficiency (Walker et al., 1997). However,
set. Second, we conduct a small-scale pilot study       there is also need for other criteria to evaluate open-
to evaluate how well a state-of-the-art chatbot can     domain chatbots which do not necessarily have a
handle conversations representing a more diverse        clear task. The most famous (and earliest) form
set of speech event categories. Before describing       of evaluation is perhaps the Imitation Game (often
the studies and results in detail, we first provide     referred to as the Turing Test) proposed by Turing
an overview of the development of open-domain           (1950). However, it is not clear that being able to
chatbots in the next section.                           distinguish a chatbot from a human is a sensible
                                                        test, as this might lead to a focus on handling trick
2   Open-domain chatbots                                questions rather than modelling a human conver-
The term “chatbot” (and its predecessor “chatter-       sation. Another form of evaluation is used in the
bot”) has been used since the early 1990’s to de-       Loebner Prize Competition, where judges are asked
note systems that interact with users in the form       to rate the chatbot responses after asking a fixed
of a written chat. Early examples of such systems       set of questions to each chatbot (Mauldin, 1994).
include TINYMUD (Mauldin, 1994) and ALICE               This approach faces the problem of not testing the
(Wallace, 2009). However, the term has not been         chatbot’s ability to have a coherent and engaging
used in academia until recently (Adamopoulou and        interaction over multiple turns. To put more empha-
Moussiades, 2020). Instead, “dialogue system”           sis on the user experience, the users in the Alexa
was a more common term for systems that inter-          Prize Challenge were asked to rate the chatbot after
act with users in a (written or spoken) conversa-       the interaction, on a scale between 1 and 5, which
tion. Nowadays, the meaning of the term “chatbot”       was then used as a direct measure of performance.
seems to vary, and it is also used interchangeably      However, such a scale is difficult to interpret since
with the term “dialogue system” (Deriu et al., 2020;    it may not reflect the specific criteria utilized by
Adamopoulou and Moussiades, 2020). Deriu et al.         each user to evaluate the relative merits of each
(2020) make a distinction between task-oriented,        chatbot transparently.
conversational and question-answering chatbots,            Another method is to let the human users inter-
where conversational chatbots “display a more un-       act with the chatbot and let a third-party (i.e., other
structured conversation, as their purpose is to have    humans) rate the conversations on different dimen-
open-domain dialogues with no specific task to          sions either on a turn-by-turn basis or on the level
solve”. These chatbots are built to “emulate so-        of whole conversations. Adiwardana et al. (2020)
cial interactions” (ibid.). Another term sometimes      used a turn-by-turn assessment, and argued that the
used to differentiate these chatbots from more task-    most important factors in such an evaluation are
oriented systems is “social chatbots” (Shum et al.,     the “sensibleness” and “specificity” of responses.

                                                    393
A common problem with end-to-end chatbots is a          pendent on the contexts and participants we should
tendency to give very generic responses (e.g., “I       consider. Thus, it is important to approach the prob-
don’t know”), which might be evaluated as sensible      lem in a systematic way. In this paper, we will base
(or coherent), but not very specific (and therefore     our work on Goldsmith and Baxter (1996), who
not very engaging). A limitation of turn-by-turn        use the term “speech events” for different types of
assessments is that they do not capture the global      conversations. As a note, speech events describe
coherence of the conversation as a whole. Deriu         the conversation as a whole, and should not be con-
et al. (2020) criticise the term “high quality” for     fused with the term “speech act”, which describes
evaluating human-chatbot conversations due to its       the intention of a single conversational turn.
subjectivity. Instead, they propose appropriateness,       In a series of four studies, Goldsmith and Bax-
functionality and target audience as more objec-        ter (1996) developed a descriptive taxonomy of
tive measures. Li et al. (2019) proposed a method       speech events between humans in everyday con-
called ACUTE-Eval, in which human judges are            versations. First, they collected 903 open-ended
asked to make a binary choice between questions         diary log entries provided by 48 university students
(e.g., “Who would you prefer to talk to for a long      who monitored their daily conversations with other
conversation?”, “Which speaker sounds more hu-          people for a 1-week period. Using this data, the
man?”) for two human-chatbot conversations that         authors identified the speech events, labeled and
are displayed next to each other.                       grouped them in a systematic fashion. These cat-
   Regardless of the exact evaluation criteria used,    egories were also analyzed according to a set of
the general assumption behind open-domain chat-         dimensions, including formality, involvement and
bots seems to be a set-up where human users are         positivity.
exposed to the chatbot without much introduction           In total, 39 speech events were identified, which
or guidance on what to talk about. For example,         we group into three major categories (see the defi-
in the Alexa Prize Challenge, the users are encour-     nitions for each category in the Appendix):
aged to start chatting with their Alexa devices by
just saying “Let’s chat”. To make the conversa-           • Informal/Superficial talk: Small talk, Cur-
tion more engaging and provide some guidance,               rent events talk, Gossip, Joking around, Catch-
some chatbots are given a “persona” (Zhang et al.,          ing up, Recapping the day’s events, Getting
2018). While such personas can provide some more            to know someone, Sports talk, Morning talk,
context, they do not provide much guidance to-              Bedtime talk, Reminiscing
wards the purpose of the conversation. In order           • Involving talk: Making up, Love talk, Rela-
to compensate for “naturalness” and “relevance to           tionship talk, Conflict, Serious conversation,
real-world use cases”, Shuster et al. (2020) experi-        Talking about problems, Breaking bad news,
mented with letting the humans interact with open-          Complaining
domain chatbots in a fantasy game. Although the
authors praise the ease of data collection through        • Goal-directed talk: Group discussion, Per-
gaming, the extent to which these conversations             suading conversation, Decision-making con-
reflect real-world scenarios is questionable.               versation, Giving and getting instructions,
   Considering the vagueness around the definitions         Class information talk, Lecture, Interrogation,
of an open-domain chatbot and the variation in ap-          Making plans, Asking a favor, Asking out
plications, what does “open domain” mean for con-
versations with the chatbots? How do we expect             Our study is based on this categorization of
the chatbots to handle these conversations? We          speech events. However, we do not argue that this
will explore the answers to these questions by intro-   is the ultimate way to categorize speech events and
ducing and explaining “speech events” in the next       acknowledge that the exact set of categories are
section.                                                likely to be influenced by the demographics of the
                                                        group which was under study (university students),
3   Speech events                                       and the limited time during which the data was
                                                        collected. Bearing these restrictions in mind, this
Exploring and categorizing different types of con-
                                                        wider set of speech event categories is appropriate
versations is a daunting task, as the number of po-
                                                        to illustrate our points in this study.
tential categories is (in theory) unlimited, and de-

                                                    394
4     Study I                                           of talk to pass time and avoid being rude”) and 1
                                                        conversation was labeled as “joking around” as the
In the first study, we aimed at categorizing the
                                                        main speech events by both annotators. 14 conver-
types of speech events that occur in the current
                                                        sations were assigned “getting to know someone”
evaluations of open-domain chatbots based on the
                                                        as a second category of speech events (mostly to-
categories defined by Goldsmith and Baxter (1996)
                                                        gether with “small talk” as the main category) by at
as described above.
                                                        least one of the annotators. Only in 3 conversations,
4.1    Data                                             was there a disagreement about the main speech
                                                        event between the annotators, and 2 conversations
For this study, we use publicly available conversa-
                                                        were labeled as “N/A” due to the resemblances
tion data from the evaluation of the Meena chatbot
                                                        with the Turing test.
(Adiwardana et al., 2020) developed by Google.
                                                           Out of 50 human-human conversations, 49 cases
It uses an Evolved Transformer with 2.6B param-
                                                        were assigned the “Small Talk” label as the main
eters, simply trained to minimize perplexity on
                                                        speech event category by both annotators. The
predicting the next token in text. The model was
                                                        “Getting to know someone” category was assigned
trained on data (40B words) mined and filtered
                                                        as a secondary label either by one or both annota-
from the public domain social media conversations
                                                        tors in 30 cases. Overall, there was an agreement
(e.g., Reddit).
                                                        between the two annotators for the main speech
   To evaluate the chatbot, the authors performed
                                                        event category in 94% of all conversations.
both a static evaluation, where a snippet of a dia-
logue (including 2-3 turns) was assessed by crowd         4.4    Discussion
workers, and an interactive evaluation, where
                                                        The results of our categorization indicate that most
crowd workers were asked to interact with the chat-
                                                        human-chatbot conversations are very limited in
bot. Conversations started with an informal greet-
                                                        terms of the types of speech events. More specif-
ing (e.g., “Hi!”) by the chatbot. The crowd workers
                                                        ically, they mainly consist of conversations that
were asked to interact with it without no further ex-
                                                        correspond to the “Small Talk” category, and other
plicit instructions about the domain and/or the topic
                                                        speech event categories (see Section 3) are rarely
of the conversation. A conversation was required
                                                        observed. However, we also observe a similar find-
to last 14-28 turns.
                                                        ing for the human-human conversations (i.e., they
   The model was evaluated based on two crite-
                                                        were also limited to one speech event category,
ria (i.e., sensibleness and specificity) and it was
                                                        “Small Talk”). Thus, the dominance of “Small Talk”
also compared to other state-of-the-art chatbots
                                                        is primarily not an effect of the humans’ concep-
(XiaoIce, Mitsuku, DialoGPT, and Cleverbot). Adi-
                                                        tions about the agent and the agent’s capabilities,
wardana et al. (2020) also collected 100 human-
                                                        but rather an effect of how the conversations are ar-
human conversations, “following mostly the same
                                                        ranged. If the only instruction in the experimental
instructions as crowd workers for every other chat-
                                                        set-up is “just talk to the chatbot/each other”, the
bot”. In other words, there were no instructions
                                                        conversations will usually be limited to the “Small
about the conversation or the topic.
                                                        Talk” category and other speech event categories
4.2    Method                                           will not naturally arise (at least not given the lim-
                                                        ited number of turns in the conversation).
Two independent annotators assigned a main
speech event (a second one was optional if nec-           5     Study II
essary) to the first 50 human-chatbot (Meena) and
the first 50 human-human conversations in the pub-      Although the results of Study I indicate a tendency
licly released dataset (Adiwardana et al., 2020),       for one type of speech event (“Small Talk”), we
based on the speech event categories described by       still do not know how well chatbots could handle
Goldsmith and Baxter (1996).                            other speech event categories and a more thorough
                                                        analysis of this for various chatbots is beyond the
4.3    Results                                          scope of this paper. However, to get an idea about
Of the 50 human-chatbot (Meena) conversations,          what such an evaluation procedure could look like,
44 were assigned the “Small Talk” category (de-         we designed a small-scale pilot study with the fol-
fined by Goldsmith and Baxter (1996) as “a kind         lowing research questions in mind: What would be

                                                    395
an alternative evaluation scheme involving other         life and/or online environments. The Tester was
speech event categories? How would a state-of-           instructed to insist on pursuing the speech event,
the-art open-domain chatbot perform in such an           even if the chatbot would not provide coherent an-
evaluation?                                              swers (within the context of the given speech event
                                                         category).
5.1   Chatbot used: Blender                                 For comparison, we also set up a similar chat
Since the Meena chatbot is not currently available       experiment between the same Tester and another
for public testing, we could not include it in this      human interlocutor. The Tester and the human
study. Instead, we tested another state-of-the-art       interlocutor did not know each other and were un-
chatbot, “Blender” (Roller et al., 2020). Blender        aware of each others’ identities (e.g., gender, age,
(released by Facebook in April 2020) also uses           education, employment etc). Moreover, the hu-
a Transformer-based model (with up to 9.4B pa-           man interlocutor was asked to “erase his/her mem-
rameters), trained on public domain conversations        ory” of the previous conversations when starting a
(1.5B training examples). However, unlike Meena,         new one, to maximize the similarity with a human-
Blender is trained (in a less agnostic fashion) to       chatbot conversation. The human interlocutor was
achieve a set of conversational “skills” (e.g., to use   also not provided with the speech event category
its personality and knowledge (Wikipedia) in an          beforehand, and was not briefed about the notion of
engaging way, to display empathy, and to be able         speech events or the purpose of the study. However,
to blend these skills in the same conversation).         s/he was instructed to be cooperative.
   For Study II, we used the 2.7B parameter ver-            After all chat sessions were completed, the con-
sion of Blender through the ParlAI platform (Miller      tents of the chat conversations were normalised
et al., 2017). According the evaluation by Roller        (e.g., removing the spelling mistakes, normalising
et al. (2020), the performance of this version should    the capitalisations) so that it would not be possible
be quite close to the 9.4B version. Therefore,           to distinguish the chatbot responses from the hu-
the computationally less demanding version was           man ones based on formatting. The Tester-Blender
deemed sufficient. For Study II, we only used the        and Tester-Human chats were roughly equal in
neutral persona of Blender and muted other per-          length (18.9 vs. 18.1 turns on average). The length
sonas. We should still note that each response took      of the Tester’s turns were also similar (8.9 vs. 9.1
about 30 seconds (fairly slow) on our computer           words on average). However, Blender’s responses
during conversations with Blender.                       were somewhat longer than the responses of the
   Roller et al. (2020) evaluated Blender using the      Human interlocutor (16.4 vs. 11.6 words on aver-
ACUTE-Eval method described in Section 2 above           age).
(i.e., by asking crowd workers to compare two               The resulting conversations were then assessed
conversations, either from different versions of         based on the ACUTE-Eval method (Li et al., 2019):
Blender, or against other chatbots). The best ver-       by letting third-party human judges compare the
sion of Blender was preferred over Meena in 75%          conversations pairwise for each speech event. The
of the cases. They also compared it against the          two versions of the conversations (Tester-Blender
human-human chats that had been collected for the        vs. Tester-Human) were presented to the human
Meena evaluation (and which we used for Study            judges as if they were conversations between a hu-
I above). In that comparison, the best version of        man and two different chatbots (Chatbot A and
Blender was preferred in 49% of the cases, which         Chatbot B). To evaluate the conversations, the hu-
indicates a near human-level performance, given          man judges were asked three questions: “Which
their evaluation framework.                              chat was most coherent?”, “Which chatbot sounds
                                                         more human?” and “Which chatbot would you
5.2   Method
                                                         prefer to talk to for a long conversation?”. Un-
A (human) Tester interacted with the Blender chat-       like the binary answers used in the ACUTE-Eval
bot based on 16 categories of the speech events          method, we used a 7-grade scale (ranging from
discussed in Section 3 (as listed in Table 1). The       “Definitely A” to “Definitely B”). The association
speech events were selected based on how well they       between which version (Human or Blender) was A
could be applied to a chat conversation between          and which was B was alternated between speech
two interlocutors who did not know each other and        events. In addition, the judges were also asked to
had not interacted with each other earlier in real-

                                                     396
Speech event             Coherent   Humanlike   Prefer      a similar informal greeting and continue with an in-
 Talking about problems   3 (2.2)    2 (1.2)     3 (1.6)
 Asking for favor         3 (2.8)    3 (2.8)     3 (1.8)
                                                             troduction to the topic of the conversation (i.e., how
 Breaking bad news        3 (2.6)    3 (2.8)     3 (3.0)     to spend 1000 eur/dollars together). In Example
 Recapping                3 (2.2)    3 (2.6)     3 (2.6)     2, the two humans discuss the alternative ways of
 Complaining              2 (1.4)    3 (1.6)     2 (1.4)     spending the money through making suggestions
 Conflict                 3 (3.0)    3 (2.6)     3 (2.8)
 Giving instructions      3 (2.4)    3 (2.6)     3 (2.2)     and presenting alternative scenarios to each other.
 Gossip                   1 (1.4)    2 (1.4)     2 (1.2)     Within the given context, they exhibit a collabora-
 Joking around            3 (1.8)    3 (2.0)     3 (2.0)     tive behavior by asking each other’s opinions while
 Decision-making          3 (1.6)    2 (1.4)     2 (1.4)
                                                             presenting possible scenarios that could be applied
 Making plans             3 (2.0)    2 (1.8)     3 (1.8)
 Making up                3 (2.2)    3 (1.8)     2 (1.8)     for a solution of the given challenge. The content
 Persuading               1 (1.0)    1 (0.0)     0 (0.0)     of the conversation is coherent in general.
 Recent events            3 (3.0)    3 (2.8)     3 (3.0)        When the Tester introduces the topic of the
 Relationship talk        3 (2.8)    3 (2.8)     3 (2.4)
                                                             conversation (i.e., making a decision about how
 Small talk               3 (2.2)    3 (2.2)     3 (1.8)
                                                             to spend 1000 euros), in the conversation with
Table 1: The rating of the different speech events. The      Blender (Example 1), the chatbot responds with
scale is from -3 (Definitely the Blender chatbot) to         an enthusiastic reply, asks for the hobbies of the
3 (Definitely the human). Median values of the five          human interlocutor, and mentions that its favourite
judges are shown first, and the average in parenthe-         hobby is playing video games. This answer could
sis. Since no values are negative, almost all ratings are
                                                             be interpreted as an illocutionary speech act for
(strongly) in favor of the human interlocutor.
                                                             making a suggestion about how to spend the des-
                                                             ignated money in an indirect way. However, when
briefly motivate their ratings. Five judges per pair         the Tester insists on announcing his/her plan ex-
of dialogues were used, each of them rating eight            plicitly, Blender abandons its initial enthusiastic
pairs of dialogues (i.e., 10 different judges in total).     agreement due to a misinterpretation of context. To
                                                             keep the coherence of the conversation, the ques-
5.3   Results and Discussion
                                                             tion (“How do we split the money?”) should have
The results are shown in Table 1. Since the di-              been interpreted within the given context of the
rection of the 7-grade scale was alternated (i.e.,           current speech event instead of a more generic one.
towards the chatbot or the human) between con-               It seems like Blender does not know the context
versations representing different speech events, we          for the current speech event and misinterprets the
have here adjusted those values so that -3 indicates         question, which in turn breaks the flow and leads
a strong preference for the chatbot and 3 indicates          to an incoherent conversation.
a strong preference for the human. In general, there            Example 3 illustrates another example of an inco-
is a strong preference for the human-human conver-           herent conversation between the Tester and Blender
sations for all three questions. The only exception          which involves the “Asking a favor” speech event
to this strong tendency can be found in the cat-             focusing on grocery shopping. In line 4, Blender
egories of “Gossip” and “Persuading” as speech               misinterprets the question and assumes that the
events. This is in stark contrast with the findings of       Tester wants to know something about the chatbot.
Roller et al. (2020), where the human partner was            Tester tolerates this misunderstanding and repeats
preferred over Blender in only 51% of the chats.             the question (line 4). Between lines 5-15, Tester
When we asked about the motivation of the human              and Blender discusses the arrangements. However,
judges about their judgements, they describe the             Blender breaks the flow of the conversation (Line
human-human conversations as more coherent and               16) by providing an incoherent answer with the
having a better flow than the human-chatbot conver-          rest of the conversation and ignoring the Tester’s
sations, which seem less coherent and inconsistent           request. Example 4 illustrates a similar conversa-
due to abrupt interruptions and frequent changes of          tion around the same speech event (i.e., “Asking a
topic.                                                       favor”) between two humans (i.e., Tester-Human).
   Example 1 and 2 provide examples of the two               In comparison to Example 3, this conversation is
conversations based on the “Decision-making”                 much more coherent without any serious misunder-
speech event category. Tester is the same human              standings or communication breakdowns.
interacting with Blender in Example 1 and another
human in Example 2. Both conversations start with

                                                           397
1.Tester:   Hi!                                                     1.Tester:   Hi!
2.Blender: Hi! How are you? I just got home from work. I            2.Human: Hi! How is it going?
           work at a grocery store.                                 3.Tester:   It is ok. busy as usual. so we need to make a
3.Tester:   I am ok. We have 1000 dollars. Let’s decide                         decision about something now.
            how we spend it together.                               4.Human: Tell me what’s on your mind
4.Blender: That sounds like a good idea. Do you have any            5.Tester:   We have 1000 eur together, how should we spend
           hobbies? I like to play video games.                                 it?
5.Tester:   Let’s focus on our decision now. What is your           6.Human: Well, if we must spend it, I would suggest to
            plan?                                                            spend it either on a fun vacation or invest it in
6.Blender: I think I’m going to buy a new video game. What                   something more long-term. What do you think?
           do you like to do for fun?                               7.Tester:   Well, 1000 eur is not that much for long-term
7.Tester:   How do we split the money?                                          investment, right?
8.Blender: I am not sure. Maybe we can go to the bank and           8.Human: That depends, I suppose, on what you’d invest it
           get a cashier’s check?                                            in
9.Tester:   I don’t think we need that. How much do you             9.Tester:   Lets think about our options both for short term
            want?                                                               and long term.
10.Blender: I don’t know. I’m not sure how much it costs to         10.Human: It’s always good to put some money aside for
            get a check cashed.                                               repairs and such, invest in your own property.
11.Tester: We have already cash, no need for that.                            By short-term do you mean an investment as well,
                                                                              or something more immediately tangible?
12.Blender: That is very nice of you. Do you have any pets?
            I have a cat named mittens.
                                                                Example 2: Start of the Tester-Human chat on the
                                                                “Decision-making” speech event
Example 1: Start of the Tester-Blender chat on the
“Decision-making” speech event
                                                                    larger context.
6   General Discussion                                                 As we have discussed earlier, it is not entirely
                                                                    clear what the term “open-domain” means. It could
It is possible to interpret our findings as a problem               perhaps refer to a conversation which touches upon
of data, in the sense that the models for the chatbots              various topics (e.g., music, literature or sports).
could have been trained mostly on data resembling                   However, even under such a definition, these topics
the “Small Talk” speech event. This could then help                 can be relevant for many different speech events. It
to explain why previous evaluations of state-of-                    also seems reasonable to expect an open-domain
the-art chatbots (where primarily the “Small Talk”                  chatbot to engage in more diverse speech events
speech event has been tested) have shown human-                     than small talk only. Thus, we would argue that
level performance, while our pilot study (which                     using the term “open-domain” for the current chat-
exposed them to more diverse speech events) in-                     bots is inadequate. However, we should also stress
dicates a much weaker performance. However, it                      that modelling “small talk” is not less meaningful
could also be the case that the current models are                  than modelling conversations representing other
not very good at maintaining a coherent topic or                    speech events. For example, Bickmore and Cas-
a conversational purpose within the given speech                    sell (2000) explore the use of “small talk” in the
event category for more than a few turns, which in-                 context of an embodied conversational real-estate
dicates a problem with the memory of the chatbots                   agent, and argue that such conversation can be used
(also acknowledged by Roller et al. (2020)).                        to build initial rapport between participants before
    Small talk naturally involves brief exchanges of                proceeding to a more task-oriented interaction.
questions and expressions of opinion. In such set-
tings, jumping between topics is also tolerable. Ac-                6.1   Limitations and future work
cording to Goldsmith and Baxter (1996), it is the               One clear limitation of the pilot study (Study II)
most superficial (least involving) type of speech               was that we only had a single Tester. Therefore, the
event, with an emphasis on avoiding to be rude.                 number of interactions are limited. Nevertheless,
Thus, if one of the speakers would occasionally                 the provided examples, and the judges’ assessments
say something less coherent during the conversa-                of them, indicate clearly that the introduction of
tion, the other speaker would tolerate this grace-              more speech events poses a problem for state-of-
fully. Therefore, it should be much easier to main-             the-art chatbots. We are currently planning more
tain such a conversation without keeping track of a             thorough evaluations using this approach to verify

                                                              398
3.Tester:   I am ok, can I ask you a favor?                           5.Tester:   It is very busy for me today. Can I ask you a
4.Blender: Sure, what do you want to know? I’m a cashier                          favor?
           at a supermarket.                                          6.Human: Sure. What do you need?
5.Tester:   I don’t want to know anything specific. I was             7.Tester:   Can you do some grocery shopping for me?
            wondering if you can do me a favor?                       8.Human: You’re in luck as I have some things I need to
6.Blender: Sure! What do you need? I can do it for free. I                     get myself. Do you have a list prepared?
           love to help people.                                       9.Tester:   Let me think, 1 kg. apples, 2 bananas, a
7.Tester:   I am very busy today. Can you do shopping for                         dozen eggs, 1 bottle milk, 1 pack of chocolates,
            me?                                                                   sausages, 1 pack of spaghetti.
8.Blender: I sure can! I love helping people. What kind of            10.Human: Ok I’ll write that down. What kind of sausages
           shopping are you looking to get done?                                were you considering?
9.Tester:   Grocery shopping                                          11.Tester: They have these new vegetarian sausages with
10.Blender: I love grocery shopping. It is one of my favorite                    green packaging, those ones. Where are you
            things to do in my spare time.                                       gonna buy them from?
11.Tester: So how should we arrange it?                               12.Human: Do you mean where I’m going to get groceries?
12.Blender: You can pick me up or I can pick you up.
            Whichever is easiest for you.                         Example 4: Excerpt from Tester-Human Chat on the
                                                                  “Asking a favor” speech event
13.Tester: No, I am very busy. You will do it as a favor to
           me, remember?
14.Blender: Of course. I will make sure you have plenty of
            snacks and drinks to keep you going.
                                                                  versations are limited to isolated conversations,
15.Tester: I don’t need snacks. How do you want to arrange
                                                                  with no memory of past conversations. In real-
           the shopping for me?                                   world, some of the speech events occur only be-
16.Blender: I will take you to my favorite grocery store and      tween humans who already know each other or
            you can pick out what you want there.                 have shared some history and/or experiences to-
                                                                  gether (e.g., “Catching up”). Since the current
Example 3: Excerpt from Tester-Blender Chat on the
“Asking a favor” speech event
                                                                  chatbots do not have such a long-term memory, it
                                                                  is not easy to build a long-lasting and human-like
                                                                  relationship with them. Therefore, it may be dif-
the findings. Another limitation of our method in                 ficult to include some of these speech events in
Study II) is that the Tester was aware of whether                 the evaluation of chatbots. In that respect, both
the interlocutor was a chatbot or a human. This                   Adiwardana et al. (2020) and Roller et al. (2020)
could potentially influence the way the conversa-                 acknowledge some of these deficiencies for their
tion unfolds, and whether the chatbot and human                   chatbots as well.
are treated equally (even though care was taken
to ensure this as much as possible). For practical                    7   Conclusion
reasons (one being that the responses from Blender                The results of our two studies show that the typical
took at least 30 seconds or more), it was not pos-                setting for chatbot evaluations (where the Tester
sible to address this limitation in our study. We                 is asked to “just chat with the chatbot”) tends to
also note that previous studies in which human-                   limit the conversation to the “Small Talk” speech
human and human-chatbot interactions have been                    event category. Therefore, the reported results from
compared (e.g., Adiwardana et al. 2020; Hill et al.               such evaluations will only be valid for this type
2015) suffer from the same problem. Even if the                   of speech event. In a pilot study where a Tester
Tester would not know the identity of the interlocu-              was instructed to follow a broader set of speech
tor, s/he might be able to guess it early on, which               event categories, the performance of the state-of-
could then still influence the conversation. One                  the-art chatbot (i.e., Blender in this case) seems to
way of addressing this problem in future studies is               degrade considerably, as the chatbot struggles to
to use the human-human conversations as a basis,                  keep a coherent conversation in alignment with the
and then feed those conversation up to a certain                  purpose of the conversation.
point to the chatbot, and ask the chatbot to generate                We thus propose that developers of “open-
a response, based on the previous context. Finally,               domain chatbots” either explicitly state that their
the judges would be asked to compare that response                goal is to model small talk (and perhaps use the
to the actual human response.                                     term “small talk chatbots” instead of “open domain
   Another limitation is that current chatbot con-                chatbots”), or that they change the way these chat-

                                                                399
bots are evaluated. For the second option, these              Stephen C. Levinson. 1979. Activity types and lan-
chatbots could be tested against a wider repertoire              guage. Linguistics, 17(5-6):365–400.
of speech events, before claiming that they can               Margaret Li, Jason Weston, and Stephen Roller. 2019.
communicate like humans or have a human-level                  Acute-eval: Improved dialogue evaluation with op-
performance. In addition, there is a pressing need             timized questions and multi-turn comparisons. In
to develop better evaluation frameworks, where                 NeurIPS workshop on Conversational AI.
the Tester is provided with a context for a specific          Michael L Mauldin. 1994. Chatterbots, tinymuds, and
speech event and clear instructions. If this route is           the turing test: Entering the loebner prize competi-
followed, it would be possible to evaluate the the              tion. In AAAI, volume 94, pages 16–21.
performance of the chatbot for the specific speech            Alexander H Miller, Will Feng, Adam Fisch, Jiasen Lu,
event category.                                                 Dhruv Batra, Antoine Bordes, Devi Parikh, and Ja-
                                                                son Weston. 2017. Parlai: A dialog research soft-
Acknowledgements                                                ware platform. arXiv preprint arXiv:1705.06476.

We are very thankful for the reviewers’ encourag-             Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu
ing and helpful comments.                                       Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn,
                                                                Behnam Hedayatnia, Ming Cheng, Ashish Nagar,
                                                                et al. 2018. Conversational ai: The science behind
                                                                the alexa prize. arXiv preprint arXiv:1801.03604.
References
                                                              Stephen Roller, Emily Dinan, Naman Goyal, Da Ju,
Eleni Adamopoulou and Lefteris Moussiades. 2020.
                                                                 Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott,
  Chatbots: History, technology, and applications.
                                                                 Kurt Shuster, Eric M. Smith, Y. Lan Boureau, and
  Machine Learning with Applications, 2:100006.
                                                                 Jason Weston. 2020. Recipes for building an open-
Daniel Adiwardana, Minh-Thang Luong, David R.                    domain chatbot. arXiv preprint arXiv: 2004.13637.
  So, Jamie Hall, Noah Fiedel, Romal Thoppilan,
                                                              Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018.
  Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade,
                                                                From eliza to xiaoice: challenges and opportunities
  Yifeng Lu, and Quoc V. Le. 2020. Towards a
                                                                with social chatbots. Frontiers of Information Tech-
  Human-like Open-Domain Chatbot. arXiv preprint
                                                                nology & Electronic Engineering, 19(1):10–26.
  arXiv: 2001.09977.
                                                              Kurt Shuster, Jack Urbanek, Emily Dinan, Arthur
Timothy Bickmore and Justine Cassell. 2000. ”How
                                                                Szlam, and Jason Weston. 2020. Deploying life-
  about this weather?” Social Dialogue with Embod-
                                                                long open-domain dialogue learning. arXiv preprint
  ied Conversational Agents. In Proc. AAAI Fall Sym-
                                                                arXiv: 2008.08076.
  posium on Socially Intelligent Agents.
                                                              Alan Turing. 1950. Computing machinery and intelli-
Jan Deriu, Alvaro Rodrigo, Arantxra Otegi, Guillermo
                                                                gence. Mind, 59(236):433–460.
   Echegoyen, Sophie Rosset, Eneko Agirre, and Mark
   Cielibak. 2020. Survey on evaluation methods for           Orioi Vinyals and Quoc V. Le. 2015. A Neural Conver-
   dialogue systems. Artificial Intelligence Review,            sational Model. In ICML Deep Learning Workshop
   54:755–810.                                                  2015, volume 37.
Emily Dinan, Varvara Logacheva, Valentin Ma-                  Marilyn Walker, Diane J Litman, Candace A Kamm,
  lykh, Alexander Miller, Kurt Shuster, Jack Ur-               and Alicia Abella. 1997. PARADISE: a framework
  banek, Douwe Kiela, Arthur Szlam, Iulian Serban,             for evaluating spoken dialogue agents. In Proc. of
  Ryan Lowe, Shrimai Prabhumoye, Alan W. Black,                the 35th Annual Meeting of the Association for Com-
  Alexander Rudnicky, Jason Williams, Joelle Pineau,           putational Linguistics (ACL ’98), Stroudsburg, PA,
  Mikhail Burtsev, and Jason Weston. 2020. The sec-            USA.
  ond conversational intelligence challenge (convai2).
  In The NeurIPS ’18 Competition, pages 187–208,              Richard S Wallace. 2009. The anatomy of alice. In
  Cham. Springer International Publishing.                      Parsing the Turing Test, pages 181–210. Springer.
Daena J. Goldsmith and Leslie A. Baxter. 1996. Con-           Ludwig Wittgenstein. 1958. Principal Investigations.
  stituting relationships in talk: A taxonomy of speech         Blackwell Publishing.
  events in social and personal relationships. Human
  Communication Research, 23(1):87–114.                       Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur
                                                                Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
Jennifer Hill, W. Randolph Ford, and Ingrid G. Farreras.        sonalizing dialogue agents: I have a dog, do you
   2015. Real conversations with artificial intelligence:       have pets too? ACL 2018 - 56th Annual Meeting of
  A comparison between human-human online conver-               the Association for Computational Linguistics, Pro-
   sations and human-chatbot conversations. Comput-             ceedings of the Conference, 1:2204–2213.
   ers in Human Behavior, 49:245–250.

                                                        400
A     Appendices                                      Decision-making conversation: A common goal
                                                          of making a decision about some task
A.1    Speech events
                                                      Giving and getting instructions: One person
This is a comprehensive list of the speech events         gives another person information or directions
that were identified by Goldsmith and Baxter              about how to do some task
(1996).                                               Class information talk: Informal conversations
                                                          in which you find out about class assignments,
Small talk: Passing time and avoid being rude
                                                          exams, or course material
Current events talk: Talking about news and cur-
                                                      Lecture: One person tells another person how to
    rent events
                                                          act or what to do (one-way conversation)
Gossip: Exchanging opinions or information about
                                                      Interrogation: One person grills the other person
    someone else when that person isn’t present
                                                          with questions
Joking around: A playful kind of talk to have fun
                                                      Making plans: Talking to arrange a meeting or
    or release tension
                                                          arrange to do something with someone
Catching up: Talking about the events that have
                                                      Asking a favor: Getting someone to do something
    occurred since you last spoke
                                                          for you
Recapping the day’s events: Telling about what’s
                                                      Asking out: One person asks another person out
    up and what happened to each person during
                                                          on a date
    the day
Getting to know someone: Getting acquainted               A.2    Additional examples of conversations
    with someone                                                 from Study II
Sports talk: Playing or watching a sporting event
                                                          1.Blender: Hi!
Morning talk: Routine talk when waking up in the
                                                          2.Tester:    Hi! I heard that you are looking for a recipe.
    morning                                                            Shall I teach you how to make soup?
Bedtime talk : Routine talk right before going to         3.Blender: That would be great! I love soup! What kind of
    bed                                                              soup do you like to make?
Reminiscing: Sharing events you experienced to-           4.Tester:    It will be lentils soup, so first you need two cups
                                                                       of lentils.
    gether in the past
                                                          5.Blender: I’ve never had lentil soup before. I’ll have to try
Making up: One or both people apologize for vio-                     it sometime!
    lating some expectations                              6.Tester:    Do you want to learn how to make it?
Love talk: Expressing love and give attention and         7.Blender: Yes, I would love to learn. Do you have a favorite
    affection                                                        recipe you would like to share?
Relationship talk : Talking about the nature and          8.Tester:    Well, that is what I told you earlier.
    state of a relationship                               9.Blender: I’m sorry, I must have missed that. What is your
Conflict: Conversations in which the two people                      favorite soup recipe then?
    disagree                                              10.Tester: Lentils
Serious conversation: In-depth discussion or ex-          11.Blender: Lentils are so versatile. They can be used in
                                                                      soups, stews, and even as a meat substitute.
    change of feelings, opinions, or ideas about
                                                          12.Tester: But that is not our topic, do you want the recipe
    some personal and important topic                                or not?
Talking about problems: Telling about some                13.Blender: Sure, I’d love to try it. I’ve never had lentil soup
    problem, while the other person tries to help                     before.
Breaking bad news: One person tells some bad              14.Tester: Ok, do you have pen and paper?
    news the other person doesn’t know about              15.Blender: Yes, I do. Do you have any other recipes that
Complaining: Expressing negative feelings, frus-                      you would like me to try?
    trations, gripes, or complaints toward some           16.Tester: Yes, but do you want the soup recipe or not? I
                                                                     am confused.
    common experience
                                                          17.Blender: No, I don’t. I was just asking if you had any
Group discussion: Exchange information, per-                          recipes you wanted to share with me.
    suade other people, or make decisions in a
    group                                                 Example 5: Tester-Blender Chat on the “Giving and
Persuading conversation: One person is convinc-           getting instructions” speech event
    ing the other person to do something

                                                    401
1.Tester:   Hi!
2.Human: Hello!
3.Tester:   Just had some dinner.
4.Human: What did you have? Was it good?
5.Tester:   Yeah, I had some left overs and combined them.
            Had a long day so I was hungry. How about
            you?
6.Human: I haven’t had dinner yet, but I should probably
         start thinking about what to have
7.Tester:   I can give you instructions about how to cook
            something, would you like that?
8.Human: Sure, but what would your meal recommendation
         be?
9.Tester:   Do you have a preference?
10.Human: If possible something vegetarian
11.Tester: Ok great! You can make red lentil soup, very
           easy. are you taking notes?
12.Human: Sounds good! I’ll grab pen and paper
13.Tester: Let me know when you are ready
14.Human: Got it! Go ahead
15.Tester: Ok, first you need two cups of red lentils. ok?
16.Human: Two cups, noted!
17.Tester: Wash them thoroughly multiple times
18.Human: Ok
19.Tester: Then, chop 1 onion + 1 clove of garlic into pieces
           and start fring them in some olive oil in a cook-
           ing pan.. Sorry, frying!
20.Human: Noted!
21.Tester: After that add 4 cups of water, some tomato
           paste and let them boil
22.Human: Got it!
23.Tester: Finally, add the lentils to the boiled water and
           wait until they get soft. About 15 min
24.Human: Sounds simple enough. One question:how do I
          know how much tomatoe paste to use?
25.Tester: Well, just a table spoon should be ok, you can
           also add some chilly pepper as well. Oh and
           don’t forget the salt!
26.Human: Seasoning, noted!
27.Tester: Yeah! so, it is quite easy
28.Human: This shouldn’t be too difficult indeed! Thanks
          for the tip and the instructions!
29.Tester: No problem! do you have any other questions
30.Human: Not right now, everything seems clear!
31.Tester: Great! Bye. Done!

Example 6: Tester-Human Chat on the “Giving and get-
ting instructions” speech event

                                                                402
You can also read