NAACL-HLT 2021 Natural Language Processing for Medical Conversations The Proceedings of the Second Workshop - Association for Computational ...

Page created by George Stanley

Lifestyle

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

NAACL-HLT 2021 Natural Language Processing for Medical Conversations The Proceedings of the Second Workshop - Association for Computational ...

NAACL-HLT 2021

Natural Language Processing for Medical Conversations

       The Proceedings of the Second Workshop

                    June 6, 2021

©2021 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

             Association for Computational Linguistics (ACL)
             209 N. Eighth Street
             Stroudsburg, PA 18360
             USA
             Tel: +1-570-476-8006
             Fax: +1-570-476-0860
             acl@aclweb.org

ISBN 978-1-954085-24-4

                                            ii

Introduction

Welcome to the second workshop on natural language processing for medical conversations.

Technological advancements have been transforming healthcare rapidly in the past several years. This
has been further catalyzed by the COVID-19 pandemic. Several policy changes have been made by the
government with added flexibility to enable remote treatment of patients. COVID-19, its symptoms, and
medications are being widely discussed on social media. These discussions are also being analyzed by
researchers from various perspectives. Moreover, with the availability of wearable fitness devices, these
interactions are not limited to a pandemic but go much further. While medical discussions on public
forums were prevalent in the past, their prevalence is now highlighted due to the scale of the pandemic.

To address healthcare consumers, Electronic Health Record (EHR) companies have been working to
make health data of patients easily available to patients. More recently, technology companies are also
stepping in. Healthcare providers are also making use of automatic speech recognition (ASR) and natural
language understanding to understand doctor-patient conversations and generate medical documentation
automatically. Finally, smart speakers are now common in households and users interact with them about
personal and public health issues.

While applying NLP to open domain is getting increasingly popular, medical conversations present
unique challenges and opportunities for impact. After our successful event last year, we are excited
to continue the cross-pollination between NLP researchers and medical practitioners. The goal of this
workshop is to discuss state-of-the-art approaches in conversational AI, as well as share insights and
challenges when applied in healthcare. This is critical in order to bridge existing gaps between research
and real-world product deployments, this will further shed light on future directions.

We received 19 submissions this year, and accepted 9 reviewed papers in the proceedings of the
workshop. This will be a one-day workshop including keynotes, spotlight talks, posters, and panel
sessions.

iii

Organizing Committee

• Chaitanya Shivade (Amazon)

• Rashmi Gangadharaiah (Amazon)

• Spandana Gella (Amazon)

• Sandeep Konam (Abridge)

• Shaoqing Yuan (Amazon)

• Yi Zhang (Amazon)

• Parminder Bhatia (Amazon)

• Byron Wallace (Northeastern University)

                                            v

Table of Contents

Would you like to tell me more? Generating a corpus of psychotherapy dialogues
    Seyed Mahed Mousavi, Alessandra Cervone, Morena Danieli and Giuseppe Riccardi . . . . . . . . . . . 1

Towards Automating Medical Scribing : Clinic Visit Dialogue2Note Sentence Alignment and Snippet
Summarization
    Wen-wai Yim and Meliha Yetisgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Gathering Information and Engaging the User ComBot: A Task-Based, Serendipitous Dialog Model for
Patient-Doctor Interactions
     Anna Liednikova, Philippe Jolivet, Alexandre Durand-Salmon and Claire Gardent. . . . . . . . . . . . .21

Automatic Speech-Based Checklist for Medical Simulations
     Sapir Gershov, Yaniv Ringel, Erez Dvir, Tzvia Tsirilman, Elad Ben Zvi, Sandra Braun, Aeyal Raz
and Shlomi Laufer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Assertion Detection in Clinical Notes: Medical Language Models to the Rescue?
     Betty van Aken, Ivana Trajanovska, Amy Siu, Manuel Mayrdorfer, Klemens Budde and Alexander
Loeser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Extracting Appointment Spans from Medical Conversations
     Nimshi Venkat Meripo and Sandeep Konam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Building blocks of a task-oriented dialogue system in the healthcare domain
     Heereen Shim, Dietwig Lowet, Stijn Luca and Bart Vanrumste . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Joint Summarization-Entailment Optimization for Consumer Health Question Understanding
     Khalil Mrini, Franck Dernoncourt, Walter Chang, Emilia Farcas and Ndapa Nakashole . . . . . . . . 58

Medically Aware GPT-3 as a Data Generator for Medical Dialogue Summarization
    Bharath Chintagunta, Namit Katariya, Xavier Amatriain and Anitha Kannan . . . . . . . . . . . . . . . . . . 66

                                                                                        vii

Conference Program

Sunday, June 6, 2021

 9:00 – 9:15     Opening Remarks

 9:15 – 9:50     Invited Talk 1

 10:00 – 10:15   Paper Presentation: Gathering Information and Engaging the User ComBot: A
                 Task-Based, Serendipitous Dialog Model for Patient-Doctor Interactions
                 Anna Liednikova, Philippe Jolivet, Alexandre Durand-Salmon and Claire Gar-
                 dent
 10:15 – 10:30   Paper Presentation: Automatic Speech-Based Checklist for Medical Simulations
                 Sapir Gershov, Yaniv Ringel, Erez Dvir, Tzvia Tsirilman, Elad Ben Zvi, Sandra
                 Braun, Aeyal Raz and Shlomi Laufer

 10:30 – 11:00   Break

 11:00 – 11:35   Invited Talk 2

 11:45 – 12:00   Paper Presentation: Assertion Detection in Clinical Notes: Medical Language
                 Models to the Rescue?
                 Betty van Aken, Ivana Trajanovska, Amy Siu, Manuel Mayrdorfer, Klemens
                 Budde and Alexander Loeser
 12:00 – 12:15   Paper Presentation: Medically Aware GPT-3 as a Data Generator for Medical
                 Dialogue Summarization
                 Bharath Chintagunta, Namit Katariya, Xavier Amatriain and Anitha Kannan

 12:15 – 13:15   Lunch Break

 13:15 – 13:50   Spotlight Talk Sponsor 3M

 14:00 – 15:00   Poster Session

 15:00 – 15:30   Break

 15:30 – 16:05   Invited Talk 3

 16:15 – 16:30   Best Paper Awards

                                              ix

Would you like to tell me more?
                     Generating a corpus of psychotherapy dialogues

    Seyed Mahed Mousavi1 , Alessandra Cervone2∗, Morena Danieli1 , Giuseppe Riccardi1
              1
                Signals and Interactive Systems Lab, University of Trento, Italy
                                      2
                                        Amazon Alexa AI
              {mahed.mousavi, giuseppe.riccardi}@unitn.it

                       Abstract                                      The major reasons for such limitations are the
                                                                  complexity of conversations, the lack of dialogue
    The acquisition of a dialogue corpus is a key
    step in the process of training a dialogue                    data and domain knowledge. The conversations
    model. In this context, corpora acquisitions                  about mental state issues are very complex be-
    have been designed either for open-domain in-                 cause they usually encompass personal feelings,
    formation retrieval or slot-filling (e.g. restau-             user-specific situations, different spaces of entities,
    rant booking) tasks. However, there has been                  and emotions. In this domain, the state-of-the-art
    scarce research in the problem of collecting                  data-driven frameworks are not applicable and do-
    personal conversations with users over a long
                                                                  main knowledge is very scarce. The two main ap-
    period of time. In this paper we focus on the
    types of dialogues that are required for men-                 proaches to collect dialogue data for the purpose of
    tal health applications. One of these types                   developing data-driven dialogue agents are either
    is the follow-up dialogue that a psychothera-                 acquiring user interaction data via user simulators
    pist would initiate in reviewing the progress                 and hand-designed policies (Li et al., 2016), or to
    of a Cognitive Behavioral Therapy (CBT) in-                   collect large sets of human-human conversations
    tervention. The elicitation of the dialogues is               in different user-agnostic settings (Budzianowski
    achieved through textual stimuli presented to                 et al., 2018; Gopalakrishnan et al., 2019; Zhang
    dialogue writers. We propose an automatic al-
                                                                  et al., 2018). These approaches have been used for
    gorithm that generates textual stimuli from per-
    sonal narratives collected during psychother-                 goal-oriented agents (e.g. reservations of restau-
    apy interventions. The automatically gener-                   rants) or open-domain agents answering questions
    ated stimuli are presented as a seed to dialogue              about a finite set of topics (e.g. news, music,
    writers following principled guidelines. We                   weather, games etc.). However, neither of the above
    analyze the linguistic quality of the collected               approaches can address the need for personal con-
    corpus and compare the performances of psy-                   versations which include user-specific recollections
    chotherapists and non-expert dialogue writers.
                                                                  of events, objects, entities and their relations. Last
    Moreover, we report the human evaluation of
    a corpus-based response-selection model.                      but not least, state-of-the-art conversational agents
                                                                  cannot carry out engaging and appropriate single-
1    Introduction                                                 user multi-session conversations. However, per-
                                                                  sonal conversations’ requirements include the abil-
The idea of developing conversational agents
                                                                  ity of carrying out multi-session conversations over
as Personal Healthcare Agents (PHA) (Riccardi,
                                                                  several weeks or months.
2014) has gained growing attention in recent years
for various domains including mental health (Fitz-                   In this paper, we propose a novel methodology
patrick et al., 2017; Abd-alrazaq et al., 2019; Ali               to collect corpora of follow-up dialogues for the
et al., 2020). Most of the conversational agents in               mental health domain (or domains with the sim-
the mental health domain are created using rule-                  ilar characteristics). Psychotherapists deliver in-
based and simple predefined tree-based dialogue                   terventions over a long period of time and need
flows, resulting in limited understanding of the user             to monitor or react to patients’ input. In this do-
input and repetitive responses by the agent. These                main, dialogue follow-ups are a critical resource
limitations lead to shallow conversations and weak                for psychotherapists to learn about the life events
user engagement (Abd-Alrazaq et al., 2021).                       of the narrator as well as his/her corresponding
     ∗
       The work was done while at the University of Trento,       thoughts and emotions in a timely manner. In Fig-
prior to joining Amazon Alexa AI.                                 ure 1 we describe the proposed workflow for the
                                                              1
        Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations, pages 1–9
                              July 6, 2021. ©2021 Association for Computational Linguistics

Figure 1: The workflow for the elicitation of follow-up dialogues starting from the personal narratives collected
during psychotherapy (left-hand side) interventions. The stimulus generation algorithm creates a textual stimulus
from personal narratives as a seed to dialogue writers. Dialogue writers use the textual stimulus and principled
guidelines to generate the follow-up dialogues (right-hand side). The dialogue follow-ups may be used to train
dialogue models, response-selection models and natural language generators.

acquisition of personal dialogue data aimed at train- sequence of personal narratives and recollec-
ing dialogue models. We first collect a dataset of tions, with a similar structure that psychother-
personal narratives written by the users who are apists use when reviewing the progress with
receiving Cognitive Behavioral Therapy (CBT) to the patient.
handle their personal distress more effectively1 . In
• We evaluate the collected dialogue corpus in
the next step, the narratives are used to generate
terms of the quality of the obtained data, as
stimuli for the follow-up conversations with an au-
well as the impact of domain expertise on writ-
tomatic algorithm. The first part of the stimulus,
ing the follow-up dialogues.
the common-ground statement, contains the sum-
mary of the narrative the user has previously left • We investigate the suitability of the collected
and the associated emotions and the second part is corpus for developing conversational agents
a follow-up question aimed at reviewing the users in the mental health domain by automatic
life events. In the last step, the stimuli are pre- and human evaluation of a baseline response-
sented to writers and they are asked to generate selection model.
a conversation based on the provided stimulus by
impersonating themselves as both sides of the con- 2 Literature Review
versation, an approach introduced firstly by Krause
et al. (2017), where in our setting the sides are the Knowledge grounded dialogue corpora Previ-
PHA and the patient. ously published research have addressed the prob-
lem of collecting dialogue data starting from world
The main contributions of this paper can be sum-
knowledge facts or predefined persona descrip-
marized as follows:
tions. In this regard, Zhang et al. (2018) collected
• We present a methodology for data collection a dataset of conversations conditioned on synthetic
and elicitation of follow-up dialogues in the persona descriptions for each side of the dialogue
mental health domain. using Amazon Mechanical Turk (AMT) workers.
Gopalakrishnan et al. (2019) collected a dataset of
• We present an algorithm for automatically dialogues grounded in world knowledge by pair-
generating conversation stimuli for follow-up ing AMT workers to have a conversation based
dialogues in the mental health domain from a on selected reading sets from Wikipedia and The
1
This data collection has been approved by the Ethical Washington Post over various topics. Furthermore,
Committee of the University of Trento. Rashkin et al. (2019) have crowdsourced a dataset
2

of conversations with implied user feelings in the
context, using AMT workers where a worker writes
a personal situation associated to an emotion and
in the next step is paired with another worker to
have a conversation about the mentioned situation.
While useful for chitchat and open-domain con-
versations, unfortunately these resources are not a
good fit to address the needs of the mental health
support domain.
   Mental health support dialogue corpora The
research in this domain is very recent and resources
are scarce. “Counseling and Psychotherapy Tran-
scripts” published by Alexander Street Press2 is a
dataset of 4000 therapy session transcriptions on
various topics, used as a resource for therapists-
in-training. Pérez-Rosas et al. (2016) collected a
dataset of 277 Motivational Interviewing (MI) ses-
sion videos and obtained the transcriptions for each
session either directly from the data source, or by
recruiting AMT workers. Guntakandla and Nielsen
(2018) conducted a data collection process of thera-
peutic dialogues in Wizard of Oz manner where the
therapists were impersonating a Personal Health-
                                                             Figure 2: The user interface of the mobile applica-
care Agent. The authors recorded 324 sessions of             tion designed for collecting personal narratives (En-
therapeutic dialogues which were then manually               glish translations). The patients were asked to describe
transcribed. Furthermore, in the physical health             events, persons, situations that explained their emo-
coaching domain, Gupta et al. (2020) collected a             tional arousal while answering the ABC questions de-
dataset of conversations where the expert imperson-          signed by psychotherapists.
ates a PHA that engages the users into a healthier
life style. For this purpose, a certified health coach
interacted with 28 patients using a messaging ap-            asked to write notes about the daily events that acti-
plication.                                                   vated their emotional state. CBT is a psychotherapy
                                                             technique based on the intuition that it is not the
3       Dialogue Follow-Up Generation                        events that directly generate certain emotions but
        Methodology                                          how these events are cognitively processed and
                                                             evaluated and how irrational or dysfunctional be-
The type of dialogues that we aim at obtaining is            liefs influence this process (Oltean et al., 2017). A
different from what has been reported in the litera-         technique commonly used in CBT treatment is the
ture. While previous works attempted to collect in-          ABC (Antecedent, Belief, Consequences). In this
the-field therapeutic interactions and convert them          technique, the psychotherapist tends to identify the
into dialogue datasets, we present an elicitation            event that has caused the patient a certain emotion
methodology to generate a dataset of follow-up di-           by a set of questions to define A) what, when and
alogues in the mental health domain, grounded in             where the event happened, B) the patient’s thoughts
the personal narratives and with the same conver-            and beliefs about the event and C) the emotion the
sational structure that the psychotherapists use in          patient has experienced regarding the event. Once
order to review the events with the patients in a            dysfunctional thoughts are identified, the patient is
timely manner.                                               guided on how to change them or find more rational
3.1      Collection of Personal Narratives                   and/or functional thoughts (Sarracino et al., 2017).
                                                                We recruited 20 users who would meet with their
A group of 20 Italian native speakers who were re-
                                                             human psychotherapists one session a week and
ceiving Cognitive Behavioral Therapy (CBT) were
                                                             asked them to write notes about the day-life events
    2
        https://alexanderstreet.com/                         that caused them an emotional arousal between one
                                                         3

Figure 3: The heat-map of frequent nouns used by the patients in collected personal narratives (English transla-
tions). The x-axis represents the nouns extracted from the 5-most frequent list used by each user while the y-axis
and z-axis represent the users and the noun frequency, respectively.

session and the following one. For this purpose,              used in the narratives are user-specific. Figure 3
a mobile application was designed that the users              plots the recurrence of the 5 most frequent nouns
could interact with for a period of three months, to          used by each user in the notes, translated into En-
answer the questions designed by the psychothera-             glish. As the figure shows, each word has been
pists for the ABC technique, and assign an emotion            used frequently by one user and seldom by other
to the note if possible. The emotions could be se-            users, indicating the personal space of entities and
lected from a predefined set, equal for all users,            characteristics of the conversations in the mental
including the six basic emotions used in psycho-              health domain since the topic of these conversa-
logical experiments (Happiness, Anger, Sadness,               tions, i.e. the life events and situations, varies from
Fear, Disgust and Surprise) (Ekman, 1992), and                one patient to the other.
two other complex emotional states (Embarrass-
ment and Shame) that were considered relevant for             3.2   Generation of Personal Stimuli
this setting. Figure 2 shows the user interface of            We extracted one sentence from each of the 92
the application designed for this purpose.                    selected narratives using an out-of-the-shelf extrac-
   By the end of this step, 224 ABC notes were ob-            tive summarizer4 , and under the supervision of the
tained from 20 users of which 92 notes (written by            psychotherapists, designed 5 templates to convert
13 different subjects) are complete, i.e. the users           each summary and its assigned emotion or automat-
has answered all the questions completely, and are            ically detected sentiment into a coherent stimulus
selected for the generation of the stimuli. Consider-         consisting of a common ground and a follow-up
ing the fact that each note, that is the answers to the       question. For each 18 one-line narrative summaries
ABC questions, is about a unique real-life event,             [Summary] with an assigned emotion [Emotion] by
we concatenate the answers in each note under the             the user, two templates are defined as;
psychotherapists’ supervision to convert the notes
into personal narratives of one piece. Out of the 92                In the notes you left previously, I read [Sum-
complete narratives, 18 narratives are assigned an                  mary]. You told me you felt [Emotion] for that.
emotion by the user, and 74 notes are not labeled                   Do you still feel [Emotion]?
by any emotions. A lexicon-based sentiment ana-                     I remember you told me that you felt [Emo-
lyzer developed by The OpeNER project3 is used                      tion] because of [Summary]. How do you feel
to detect the polarity of the 74 narratives without                 now?
any expressed emotions, which labeled 61 narra-
tives as either negative or positive and 13 of them           while, for the 61 one-line narrative summaries with
as neutral.                                                   automatically determined polarity [Sentiment], two
   Lexical analysis on the selected narratives                templates are defined as;
demonstrates that the language and vocabulary
                                                                4
                                                                  sumy Automatic text summarizer,
   3
       https://www.opener-project.eu/                         https://pypi.org/project/sumy/
                                                          4

Previously, you had a [Sentiment] feeling                                                                 Total
                                                                 Stimulus Type      Category        Count
      about what I read in your note [Summary].                                                                 Count
      How do you feel about it now?                                                 Fear                2
      I remember you had a [Sentiment] feeling                                      Happiness           9
      about what I read in your note [Summary].                                     Sadness            10
                                                                 with Emotion                                     32
      Do you have any new thoughts or considera-                                    Anger               7
      tions about it now?                                                           Disgust             2
                                                                                    Surprise            2
and, for the 13 one-line narrative summaries with-                                  Positive           57
                                                                 with Valence                                    107
out any assigned emotion or determined polarity,                                    Negative           50
one template is defined as;                                      Neutral            -                   -         11

      I read in your note about [Summary]. Do you            Table 1: The distribution of the stimuli used for follow-
      want to tell me more about it now?                     up dialogue collection, obtained by the automatic ag-
                                                             gregation of extracted one-line summaries, the tem-
   Using this methodology, we obtained 171 stim-             plates and the assigned emotion or automatically de-
                                                             tected sentiment valence.
uli from the 92 selected narratives, of which 150
stimuli are used as the grounding and conversation
context for follow-up dialogue generation while              PHA. The closure turn is an important part of the
21 stimuli (approximately equal to 10% of the set)           generated dialogue because these sentences play
are selected by stratified sampling, as a reserved           the role of the acknowledgment and grounding of
subset. Table 1 shows the statistics regarding the           the dialogue between the user and the PHA, and at
distribution of the stimuli type used for the dialogue       the same time may increase the user willingness to
generation process.                                          use the PHA. The number of turns for the dialogues
                                                             was not fixed. However, the dialogue writers were
3.3   Generation of Dialogue Follow-Ups                      suggested to write 4 dialogue turns for each stimu-
Two dialogue writer groups were recruited for the            lus, resembling 2 turns for the user and 2 turns for
dialogue generation. The first group included 4              the PHA (excluding the stimulus) with the last turn
psychotherapists experienced in ABC therapy tech-            as the closure by the PHA. Furthermore, in order
nique, and the second group included 4 non-expert            to minimize cognitive workload, the writers were
writers. Each writer was presented with a detailed           suggested to distribute the work by taking a break
guideline including the task description as well as          after each 10 stimuli.
several examples of correct and incorrect annota-               Initially, 10 stimuli were selected by stratified
tion outcomes. For each provided stimulus, the               sampling as the Qualification Batch and were pro-
writers were asked to firstly review and validate            vided to all the writers for the purpose of training
the stimulus for possible “Grammatical Error” or             and resolving possible misunderstandings. The out-
“Inter-sentence Incoherence” and in case of an in-           come of the Qualification Batch was then manually
valid stimulus, to apply necessary modifications             controlled and few adjustments were made with 2
to correct it. Following the validation, the writ-           of the writers. Afterwards, the rest of the stimuli
ers were asked to write a short dialogue follow-up           were distributed such that 30% of the stimuli are
based on the stimulus, assuming that the stimulus            annotated by all 8 writers and the rest of the stim-
was asked by a Personal Healthcare Agent (PHA)               uli are annotated by two psychotherapists and two
to a user about his/her previous narrative.                  non-expert writers.
   The writers were asked to respect three manda-
tory requirements while generating the dialogues as          4        Evaluation
1) The conversation must be based on and consis-             Using the introduced elicitation methodology, we
tent with the stimulus; 2) The flow of the conversa-         collected a corpus of follow-up conversations from
tion must be such that the user elaborates about the         the two writer groups5 . We then performed an anal-
event introduced in the stimulus and provides more           ysis on the obtained conversations to evaluate the
information about the event and its objects (person,              5
                                                                  We are currently applying for further funds to anonymize
location etc.) or his/her emotion to the PHA; and 3)         the corpus and publish a version of the corpus that respects
The conversation must contain a closure turn by the          patients’ privacy and deontological requirements.
                                                         5

Non-Experts        Therapists            Dialogue Act        Non-Experts       Therapists
 # Dialogues                 400               400                inform                 1487             1777
 # Turns                    1714              1494                answer                  768              925
 # Unique Tokens            3146              4251                auto-positive           591              333
 Avg. Turns                                                       question                396              452
                               4.2              3.7
 per Dialogue                                                     request                 217              194
                                                                  suggest                 162              167
Table 2: The statistics of the collected corpus of follow-        offer                   117               26
up dialogues using the proposed elicitation methodol-
                                                                  confirm                 65               36
ogy per each writer group, non-experts and psychother-
apists.                                                           disconfirm               56               63
                                                                  address-suggest          40               17
                                                                  address-request           2                9
elicitation methodology and to investigate the im-                other                    77               11
pact of domain expertise on the collected dialogues
by comparing the performances of psychotherapists                Table 3: The distribution of the Dialogue Acts in
and non-expert writers.                                          the generated follow-up conversations by each writer
                                                                 group using ISO standard DA tagging in Italian (Roc-
4.1   Validation of the Generated Stimuli                        cabruna et al., 2020). Less frequent DAs to the task
                                                                 as accept-apology, apology, promise, accept-offer, and
In the first subtask, while 34.2% of the provided                Feedback dimension DAs auto-negative, allo-negative
stimuli to the non-expert writers were labeled as                and allo-positive are presented as "other" in the Table
invalid, this percentage by the psychotherapist                  (Bunt et al., 2010).
group was 44.5%. Furthermore, the inter-annotator
agreement measured by Fleiss κ coefficient (Fleiss,
1971) was higher in the latter group (0.26) as op-               extracted summary and detected polarity. The mod-
posed to the non-expert group (0.06). This dis-                  ifications on the summary sentence included refac-
crepancy in the validation subtask suggests that the             toring the structure, re-positioning sections of the
assessment of the stimuli by each writer is affected             summary or restoring the punctuation. As for the
by their level of competence in the domain and a                 modifications on the detected sentiment, while the
more precise assessment of the stimuli as an effect              modifications done by the non-expert writers were
of domain expertise. Therefore, domain expertise                 about changing negative and positive polarity with
seems to be an important requirement for the qual-               one another, the experts tended to be more con-
ity of validation annotation in the mental health                servative in expressing a sentiment for the stimuli
domain. Nevertheless, by representing each writer                as they mostly changed the stimuli with detected
group by their consensus vote over the subset of                 sentiment to neutral ones without any polarity.
stimuli for which we have a consensus decision, the                 In less than 10% of the cases the writers, mostly
inter-group agreement over this subset of 27 stim-               the psychotherapists, modified the template and
uli was 0.6639, measured by Cohen’s κ coefficient                specifically the follow-up question. In these cases,
(Cohen, 1960), suggesting that even though domain                the questions were changed to a more summary-
knowledge and expertise results in a fine-grained                specific ones such as "...What was the distorted
assessment, it is still feasible to obtain a course-             thought that came to your mind?".
grained validation over the generated stimuli with a
group of non-expert writers with appropriate guide-              4.2   Analysis of the Dialogue Data Collection
lines.                                                           As the result of elicitation process, we collected a
   While the expert group labeled 60% of the in-                 dataset of follow-up dialogues in the mental health
valid stimuli due to “Inter-sentence Incoherence”                domain, presented in Table 2, consisting of 800 dia-
with respect to the automatic generation and com-                logues written by both groups. The number of turns
bination of the stimuli elements (the summary, the               and the number of unique tokens for each group
sentiment, and the template), “Grammatical Error”                indicate that the experts tended to write shorter
was the assigned error in most of the stimuli labeled            conversations while they used a wider range of vo-
as invalid, 69%, by the non-expert group. Regard-                cabulary in writing the conversations compared to
ing the corrections applied to the invalid stimuli,              the non-expert group. Regarding the length of the
modifications were mostly about the automatically                generated dialogues, in 627 conversations the writ-
                                                             6

Figure 4: The heat-map of frequent nouns used by the dialogue writers in the generated conversations (English
translations). The x-axis represents the nouns extracted by merging the lists of 20 most frequent nouns used per
each writer. The y-axis and z-axis represent the writers and the noun frequency per each writer respectively.

ers respected the suggestion of writing 4 turns per          question, request and suggest), there is a diversity
dialogue, with exceptions of 90 dialogues written in         in the type and the frequency of the DAs used by
two turns where the user replies to the stimulus and         non-expert group (such as offer, address-suggest
the PHA ends the conversation with a closure turn,           and other less relevant DAs to the domain) with
and 83 dialogues where the user and the PHA dis-             respect to the professionals, suggesting that the pro-
cuss further about the event and the user’s thoughts         fessionals hold a more structured conversation with
before ending the conversation.                              respect to the other group.

4.2.1   Linguistic Analysis                                  4.2.2   Response-Selection Baseline
In order to gain insights about the differences in the       We investigated the appropriateness of the col-
dialogues written be each group, we looked into              lected dialogue corpus for developing conversa-
the vocabulary of the nouns and entities used by             tional agents in the mental health domain by train-
each writer. Figure 4 shows the frequency heat-              ing a TF-IDF response-selection baseline model.
map of the 20 most frequent nouns used by each               The model was trained on 90% of the collected con-
writer in generated dialogues, translated into En-           versations with a similar training setting to Lowe
glish. The results indicate that the language and            et al. (2015), and evaluated on the remaining 10%
vocabulary used in the expert group is specific for          of the data as test set using Recall@k family of
each therapist and varies from one expert to the             metrics, presented in Table 4. The model was then
other, while non-expert writers have a more com-             integrated in the application introduced in subsec-
bined vocabulary with less inter-annotator novelty           tion 3.1 to select the correct PHA response for each
in lexicon, suggesting that the domain expertise has         user turn. 10 test users were recruited to inter-
an influence on language and the use of vocabulary           act with our application and write narratives about
in generating conversations for the mental health            their life events by answering the ABC questions
domain.                                                      for 50 days. Each narrative was then automatically
   Furthermore, we developed a Dialogue Act tag-             converted to a personal dialogue stimuli after one
ger to compare the conversations by their set of             day, using the introduced methodology in subsec-
Dialogue Acts (DA). For this purpose, we anno-               tion 3.2, to initiate a follow-up dialogue with the
tated 370 of the collected dialogue follow-ups               test user for two exchanges (4 turns) with natural
(1514 turns, approximately equal to 45% of the               language responses from the users and retrieved re-
dataset) with the ISO standard DA tagging in Ital-           sponses from the system. Regarding the evaluation
ian (Roccabruna et al., 2020) and trained an en-             of the dialogues, we asked the test users to assess
coder–decoder model (Zhao and Kawahara, 2019)                the appropriateness and coherence of each system
to segment each turn to its functional units and label       turn (including the stimulus) during the conversa-
them by their DAs. The results, presented in Table           tion with thumbs-up (appropriate) or thumbs-down
3, show that despite the similarity in the use of the        (inappropriate) for each turn, and to evaluate the
top 6 frequent DAs (inform, answer, auto-positive,           quality of the conversation as-a-whole by voting
                                                         7

TF-IDF                            Through an analysis of the collected resource
              1 in 2 R@1       0.49                          following our proposed methodology, it emerged
              1 in 10 R@1      0.21                          that the task of validating responses and generat-
              1 in 10 R@2      0.36                          ing dialogues in the mental healthcare domain can
              1 in 10 R@5      0.55                          be performed both by using psychotherapists and
              1 in 50 R@1      0.14                          non-expert dialogue writers. Therefore, it suggests
              1 in 50 R@2      0.18                          the possibility of training a larger number of non-
              1 in 50 R@5      0.26                          expert dialogue writers using appropriate guide-
                                                             lines to obtain a valid dataset with less cost while
Table 4: The performance of the response-selection           ensuring consistency in the results.
baseline on the collected dialogue follow-ups for dif-
                                                                Furthermore, we investigated the appropriate-
ferent recall metrics.
                                                             ness of the collected corpus for developing conver-
                                  Count                      sational agents in the mental health domain. We re-
                                                             ported automatic and human evaluation of a corpus-
          # Dialogues              217
                                                             based response-selection baseline. We found that
          # 5-star              130 (60%)
                                                             the test users who interacted with the model over a
          # 4-star              26 (12%)
                                                             long-term period (50 days) considered on average
          # 3-star              41 (19%)
                                                             91% of system turns as appropriate and coherent,
          # 2-star                8 (3%)
                                                             resulting into 72% of dialogues with acceptable
          # 1-star               12 (6%)
                                                             quality.
          # PHA Turns              651
                                                                We believe the proposed methodology can be
          # Thumps-Up           594 (91%)
                                                             used to tackle the problem of resource scarcity
          # Thumps-Down          57 (9%)
                                                             in the mental health domain. In particular, our
Table 5: The results of human evaluation of the              methodology can be used to obtain corpora of dia-
response-selection model in follow-up dialogues. The         logues grounded in personal recollections for devel-
users rated each response on a binary scale (Thumbs-         oping dialogue models in the mental health domain.
Up and Thumbs-Down) as well as the whole dialogue
with 1-5 star score.                                         Acknowledgements
                                                             The research leading to these results has received
from 1-star (very bad) to 5-stars (very good) for            funding from the European Union – H2020 Pro-
each dialogue.                                               gramme under grant agreement 826266: COAD-
   The results of human evaluation on the baseline           APT.
dialogue model, shown in Table 5, indicate that
91% of the system turns were considered appro-
priate and coherent by the test users, resulting in          References
more than 70% of the dialogues with acceptable               Alaa A Abd-alrazaq, Mohannad Alajlani, Ali Abdal-
quality, thus suggesting the usefulness and suitabil-          lah Alalwan, Bridgette M Bewick, Peter Gardner,
ity of the generated dialogues using the proposed              and Mowafa Househ. 2019. An overview of the
                                                               features of chatbots in mental health: A scoping re-
methodology for developing PHAs in the mental
                                                               view. International Journal of Medical Informatics,
health domain.                                                 132:103978.

5   Conclusions                                              Alaa A Abd-Alrazaq, Mohannad Alajlani, Nashva Ali,
                                                               Kerstin Denecke, Bridgette M Bewick, and Mowafa
In this work, we address the need for suitable dia-            Househ. 2021. Perceptions and opinions of patients
logue corpora to train Personal Healthcare Agents              about mental health chatbots: Scoping review. Jour-
                                                               nal of Medical Internet Research, 23(1):e17828.
in the mental health domain. We present an elicita-
tion methodology for dialogues in the mental health          Mohammad Rafayet Ali, Seyedeh Zahra Razavi, Raina
domain grounded in personal recollections. Using              Langevin, Abdullah Al Mamun, Benjamin Kane,
the proposed methodology, we collected a dataset              Reza Rawassizadeh, Lenhart K. Schubert, and
                                                              Ehsan Hoque. 2020. A virtual conversational agent
of follow-up dialogues that psychotherapists would            for teens with autism spectrum disorder: Experimen-
hold with the patients to review the personal events          tal results and design lessons. In Proceedings of the
and emotions during a CBT intervention.                       20th ACM International Conference on Intelligent
                                                         8

Virtual Agents. Association for Computing Machin- Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong
ery. Li, Jianfeng Gao, and Yun-Nung Chen. 2016. A
user simulator for task-completion dialogues. arXiv
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang preprint arXiv:1612.05688.
Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra-
madan, and Milica Gasic. 2018. Multiwoz-a large- Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle
scale multi-domain wizard-of-oz dataset for task- Pineau. 2015. The Ubuntu dialogue corpus: A large
oriented dialogue modelling. In Proceedings of the dataset for research in unstructured multi-turn dia-
2018 Conference on Empirical Methods in Natural logue systems. In Proceedings of the 16th Annual
Language Processing, pages 5016–5026. Meeting of the Special Interest Group on Discourse
and Dialogue, pages 285–294. Association for Com-
Harry Bunt, Jan Alexandersson, Jean Carletta, Jae- putational Linguistics.
Woong Choe, Alex Chengyu Fang, Koiti Hasida,
Kiyong Lee, Volha Petukhova, Andrei Popescu- Horea-Radu Oltean, Philip Hyland, Frédérique Val-
Belis, Laurent Romary, Claudia Soria, and David lières, and Daniel Ovidiu David. 2017. An empir-
Traum. 2010. Towards an iso standard for dialogue ical assessment of rebt models of psychopathology
act annotation. Seventh conference on International and psychological health in the prediction of anxiety
Language Resources and Evaluation (LREC’10). and depression symptoms. Behavioural and cogni-
tive psychotherapy, 45(6):600–615.
Jacob Cohen. 1960. A coefficient of agreement for Verónica Pérez-Rosas, Rada Mihalcea, Kenneth Resni-
nominal scales. Educational and psychological mea- cow, Satinder Singh, and Lawrence An. 2016. Build-
surement, 20(1):37–46. ing a motivational interviewing dataset. In Proceed-
ings of the Third Workshop on Computational Lin-
Paul Ekman. 1992. Are there basic emotions? Psycho- guistics and Clinical Psychology, pages 42–51.
logical Review, 99(3):550–553.
Hannah Rashkin, Eric Michael Smith, Margaret Li, and
Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Y-Lan Boureau. 2019. Towards empathetic open-
Vierhile. 2017. Delivering cognitive behavior ther- domain conversation models: A new benchmark and
apy to young adults with symptoms of depression dataset. In Proceedings of the 57th Annual Meet-
and anxiety using a fully automated conversational ing of the Association for Computational Linguis-
agent (woebot): a randomized controlled trial. JMIR tics, pages 5370–5381. Association for Computa-
mental health, 4(2):e19. tional Linguistics.
Joseph L Fleiss. 1971. Measuring nominal scale agree- Giuseppe Riccardi. 2014. Towards healthcare per-
ment among many raters. Psychological bulletin, sonal agents. In Proceedings of the 2014 Workshop
76(5):378. on Roadmapping the Future of Multimodal Interac-
tion Research including Business Opportunities and
Karthik Gopalakrishnan, Behnam Hedayatnia, Challenges, pages 53–56.
Qinglang Chen, Anna Gottardi, Sanjeev Kwatra,
Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tür, Gabriel Roccabruna, Alessandra Cervone, and
and Amazon Alexa AI. 2019. Topical-chat: Towards Giuseppe Riccardi. 2020. Multifunctional iso
knowledge-grounded open-domain conversations. standard dialogue act tagging in italian. Seventh
In INTERSPEECH, pages 1891–1895. Italian Conference on Computational Linguistics
(CLiC-it).
Nishitha Guntakandla and Rodney Nielsen. 2018. An-
notating reflections for health behavior change ther- Diego Sarracino, Giancarlo Dimaggio, Rawezh
apy. In Proceedings of the Eleventh International Ibrahim, Raffaele Popolo, Sandra Sassaroli, and
Conference on Language Resources and Evaluation Giovanni M Ruggiero. 2017. When rebt goes
(LREC 2018). difficult: applying abc-def to personality disorders.
Journal of Rational-Emotive & Cognitive-Behavior
Itika Gupta, Barbara Di Eugenio, Brian Ziebart, Therapy, 35(3):278–295.
Aiswarya Baiju, Bing Liu, Ben Gerber, Lisa Sharp, Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur
Nadia Nabulsi, and Mary Smart. 2020. Human- Szlam, Douwe Kiela, and Jason Weston. 2018. Per-
human health coaching via text messages: Corpus, sonalizing dialogue agents: I have a dog, do you
annotation, and analysis. In Proceedings of the 21th have pets too? In Proceedings of the 56th Annual
Annual Meeting of the Special Interest Group on Dis- Meeting of the Association for Computational Lin-
course and Dialogue, pages 246–256. guistics (Volume 1: Long Papers), pages 2204–2213.
Association for Computational Linguistics.
Ben Krause, Marco Damonte, Mihai Dobre, Daniel
Duma, Joachim Fainberg, Federico Fancellu, Em- Tianyu Zhao and Tatsuya Kawahara. 2019. Joint dialog
manuel Kahembwe, Jianpeng Cheng, and Bonnie act segmentation and recognition in human conver-
Webber. 2017. Edina: Building an open domain sations using attention to dialog context. Computer
socialbot with self-dialogues. 1st Proceedings of Speech & Language, 57:108–127.
Alexa Prize (Alexa Prize 2017).
9

Towards Automating Medical Scribing : Clinic Visit Dialogue2Note
             Sentence Alignment and Snippet Summarization

                  Wen-wai Yim                                             Meliha Yetisgen
                  Augmedix Inc                                        University of Washington
           wenwai.yim@augmedix.com                                     melihay@uw.edu

                      Abstract                                 note
                                                               She declines the
                                                                                    dialogue
                                                                                    [QA-1] Doctor: Have you had a pneumonia vaccine?
                                                               pneumonia vaccine.   [QA-1] Patient: No, I don’t think so.
    Medical conversations from patient visits are                                   [QA-2] Doctor: Alright, do you want one?
    routinely summarized into clinical notes for                                    [QA-2] Patient: No.
    documentation of clinical care. The automatic
    creation of clinical note is particularly chal-                          Table 1: Alignment example
    lenging given that it requires summarization
    over spoken language and multiple speaker
    turns; as well, clinical notes include highly             injected with structured information, e.g. labs. Fi-
    technical semi-structured text. In this paper,            nally, parts of clinical notes may be transcribed
    we describe our corpus creation method and                from dictations; or clinicians may issue commands
    baseline systems for two NLP tasks, clinical              to adjust changes in the text, e.g. “change the tem-
    dialogue2note sentence alignment and clinical
    dialogue2note snippet summarization. These
                                                              plate”, “nevermind disregard that.”
    two systems, as well as other models created                 In earlier work (Yim et al., 2020), we introduced
    from such a corpus, may be incorporated as                a new annotation methodology that aligns clinic
    parts of an overall end-to-end clinical note gen-         visit dialogue sentences to clinical note sentences
    eration system.                                           with labels, thus creating sub-document granular
1   Introduction                                              snippet alignments between dialogue and clinical
                                                              note pairs (e.g. Table 1, 2). In this paper, we extend
As a side effect of widespread electronic medi-               this annotation work on a real corpus and provide
cal record adoption spurred by the HITECH Act,                the first baselines for clinic visit dialogue2note au-
clinicians have been burdened with increased doc-             tomatic sentence alignments. Much like machine
umentation demands (Tran et al.). Thus for each               translation (MT) bitext corpora alignment is in-
visit with a patient, clinicians are required to input        strumental to the progress in MT; we believe that
order entries and referrals; most importantly, they           dialogue2note sentence alignment will be a critical
are charged with the creation of a clinical note. A           driver for AI assisted medical scribing. In the dia-
clinical note summarizes the discussions and plans            logue2note snippet summarization task, we provide
of a medical visit and ultimately serves as a clinical        our baselines for generating clinical note sentences
communication device, as well as a record used for            from transcript snippets. Technology developed
billing and legal purposes. To combat physician               from these tasks, as well as other models gener-
burnout, some practices employ medical scribes                ated from this annotation, can contribute as part of
to assist in documentation tasks. However, hiring             a larger framework that ingests automatic speech
such assistants to audit visits and to collaborate            recognition (ASR) output from clinician-patient
with medical staff for electronic medical record              visits and generates clinical note text end-to-end
documentation completion is costly; thus there is             (Quiroz et al., 2019).
great interest in creating technology to automati-
cally generate clinical notes based on clinic visit           2   Background
conversations.
   Not only does the task of clinical note creation           Table 2 depicts a full abbreviated clinical note with
from medical conversation dialogue include sum-               marked associated dialogue transcript sentences.
marizing information over multiple speakers, often            To understand the challenges of alignment
the clinical note document is created with clinician-         (creation of paired transcript-note input-output)
provided templates; clinical notes are also often             and generation (creation of the note sentence from
                                                         10
      Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations, pages 10–20
                             July 6, 2021. ©2021 Association for Computational Linguistics

Table 2: Example annotations (right) for corresponding clinical note (left) and dialogue (middle). The same colors indicate
matched associations.

the dialogue snippet), it is important to consider versation, it may appear across multiple turns many
several differences in textual mediums: sentences apart with contextually inferred subjects.

Semantic variations between spoken dia- Order of appearance between source and tar-
logue and written clinical note narrative. get are not consistent. The order of information
Spoken language in clinic visits have vastly and organization of data in a clinical note may not
different representations than in highly technical match the order of discussion in a clinic visit dia-
clinical note reports. Dialogue may include logue. This provides additional challenges in the
frequent use of vernacular and verbal expressions, alignment process. Table 2 shows corresponding
along with disfluencies, filler words, and false note and dialogue information with the same color.
starts. In contrast, clinical note text is known to use Content incongruency. Relationship-building is
semi-structured language, e.g. lists, and is known a critical aspect of clinician-patient visits. There-
to have a much higher degree of nominalization. fore visit conversations may include discussion un-
Moreover, notes frequently contain medical related to patient health, e.g. politics and social
terminology, acronyms, and abbreviations, often events. Conversely, not all clinical note content
with multiple word senses. necessarily corresponds to a dialogue content. In-
Information density and length. Whereas clin- formation may come from a clinical note template
ical notes are highly dense technical documents, or various parts of the electronic medical record.
conversation dialogue are much longer than clini-
Clinical note creation from conversation amal-
cal notes. In fact, in our data, dialogues were on
gamates interweaving subtasks. Elements in a
average three times the note length. Key informa-
clinic visit conversation (or accompanying speech
tion in conversations are regularly interspersed.
introduction) are intended to be recorded or acted
Dialogue anaphora across multiple turns is per- upon in different ways. For example, some spoken
vasive. Anaphora is the phenomenon in which language may be directly copied to the clinical note
information can only be understood in conjunction with minor pre-determined edits, such as in a dic-
with references to other expressions. Consider in tation, e.g. “three plus cryptic” will be converted
the dialogue example : “Patient: I have been having to “3+ cryptic”. However some language is meant
swelling and pain in my knee. Doctor: How often to express directives, pertaining to adjustments to
does the knee bother you?” It’s understood that the the note, e.g. “please insert the risks and benefits
second reference of “knee” pertains to the knee- template for tonsillectomy.” Some information is
related swelling and pain. A more complex exam- meant to be interpreted, e.g. “the pe was all nor-
ple is shown in Table 2 note line 6. While anaphora mal” would allow a note sentence “CV: normal
occurs in all naturally generated language, in con- rhythm” as well as “skin: intact, no lacerations”.
11

Finally, there are different levels of abstractive sum- Clinical Language Generation from Conversa-
marization over multiple statements, questions and tion (Finley et al., 2018) produced dictation parts
answers as shown in the Table 2 examples. of a report, measuring performance both on gold
standard transcripts and raw ASR output using sta-
3 Related Work tistical MT methods. In (Liu et al., 2019), the
authors labeled a corpus of 101K simulated con-
Clinical Conversation Language Understand- versations and 490 nurse-patient dialogues with
ing Language understanding of clinical conversa- artificial short semi-structured summaries. They
tion can be traced to a plethora of historical work experimented with different LSTM sequence-to-
in conversation analysis regarding clinician-patient sequence methods, various attention mechanisms,
interactions (Byrne and Long, 1977; Raimbault pointer generator mechanisms, and topic informa-
et al., 1975; Drass, 1982; Cerny, 2007; Wang et al., tion additions. (Enarvi et al., 2020) performed sim-
2018). More recent work has additionally included ilar work with sequence-to-sequence methods on a
classification of dialogue utterances into seman- corpus of 800K orthopaedic ASR generated tran-
tic categories. Examples include classifying dia- scripts and notes; (Krishna et al., 2020) on a corpus
logue sentences into either the target SOAP sec- of 6862 visits of transcripts annotated with clinical
tion format or by using abstracted labels consis- note summary sentences. Unlike most of previous
tent with conversation analysis (Jeblee et al., 2019; works, our task generates clinical note sentences
Schloss and Konam, 2020; Wang et al., 2020). The from labeled transcript snippets, which are at times
work of (Lacson et al., 2006) framed identifying overlapping and discontinuous. (Krishna et al.,
relevant parts of hemodialysis 118 nurse-patient 2020)’s CLUSTER2SENT oracle system does use
phone conversations as an extractive summariza- gold standard transcript “clusters”, though differ-
tion task. There has also been numerous works ent from our setup, outputs entire sections. While
related to identifying topics, entities, attributes, and this strategy presupposes an upstream conversation
relations from clinic visit conversation – using var- topic segmentation system1 as well as some extrac-
ious schemas (Jeblee et al., 2019; Rajkomar et al., tive summarization, generation based on smaller
2019; Du et al., 2019). Though clinic conversa- text chunks can lead to more controllable and accu-
tion language understanding is not explored in this rate natural language generation, critical character-
work, our automatic or manual sentence alignments istics in health applications.
methods produce the language understanding labels
that may to used to (a) model dialogue relevance, 4 Corpus Creation
(b) cluster dialogue topics, and (c) classify speak-
ing mode, e.g. dictation versus question-answers. Data The data set was constructed from clinical
encounter visits from 500 visits and 13 providers.
Clinic Visit Dialogue2note Sentence Alignment The data for each visit consisted of a visit audio
Creating a corpus of aligned clinic visit conversa- and clinical note. For each visit audio, speaker
tion dialogue sentences with corresponding clinical roles (e.g. clinician patient) were segmented and
note sentences is instrumental for training language labeled. Automatically generated speech to text for
generation systems. Early work in this domain in- each audio was manually corrected by annotators.
cludes that of (Finley et al., 2018), which uses Table 3 gives the summary statistics of the extracted
an automated algorithm based on some heuristics, visit audio. For all specialties, the average number
e.g. string matches, and merge conditions, to align of turns and sentences for transcript was 175 ±
dictation parts of clinical notes. In (Yim et al., 111 and 341 ± 214, for a total of 87725 turns and
2020), we annotated manual alignments between 170546 sentences. The number of sentences for
dialogue sentences and clinical note sentences for clinical note was 47 ± 24, for a total of 23421
the entire visit; however, the dataset was small and sentences. Table 4 shows the number of turns and
artificial (66 visits). Here we utilize this approach sentences per different types of speakers.
on real data and additionally provide an automatic We also combined our data with external data,
sentence alignment baseline system. To our knowl- the mock patient visit (MPV) dataset, from (Yim
edge, this is the first work to propose an automated
sentence alignment system for entire clinic visit 1
A system that divides conversations into segments accord-
dialogue and note pairs. ing to topics
12

et al., 2020) to create a total of 566 visits.2

    specialty    providers      visits   duration   speakers
      ENT            1            68      10 ± 4      4±1
     HAND            1            43      10 ± 4      3±1
    ORTHO            1            27      11 ± 5      4±1
   PODIATRY          4           174       7±4        3±1
                                                                               Figure 1: Annotation match tree
   PRIMARY           6           188      17 ± 9      4±1
     TOTAL          13           500      12 ± 8      4±1

             Table 3: Source audio statistics
                                                                 “it lasted about a week.”
                                                                  QA: Questions and answers spoken by any
Annotations Each annotation is based on a                         participant in a clinic visit in natural conversation,
clinical note sentence association with multiple                  e.g. “how long has the runny nose lasted? about a
transcript sentences. A note sentence can be                      week.”
associated with zero transcript sentences and an                  INFERRED-OUTSIDE: Clinical note sentences
INFERRED-OUTSIDE label for default template                       for which information comes from a known tem-
values, e.g. “cv: normal”. One may also be                        plate’s default value rather than the conversation,
associated with sets of transcript sentences and                  e.g.“skin: intact.”
a set tag, e.g. DICTATION or QA (described
below). Finally, when multiple sets have anaphoric                   If after applying all possible associations and still
references, they may be tied together using a                        there is information in the note sentence not avail-
GROUP label. Given this hierarchy, the annotation                    able from the transcript, then an INCOMPLETE
related to a single note sentence can be represented                 tag is added. A note sentence is left unmarked if
as a tree as shown in Figure 1.                                      no information can be found from the transcript.
                                                                     Table 2 shows label annotations with color coding
 Set labels                                                          for a full abbreviated transcript-note pair.
 COMMAND: Spoken by the clinician to the scribe                         To measure interannotaor agreement, we cal-
 to make a change to the clinical note structure, e.g.               culated the triple, path, and span metrics intro-
“add skin care macro.”                                               duced in (Yim et al., 2020), briefly described
 DICTATION: Spoken by the clinician to the scribe                    again here. The triple, path, and span metrics
 where the output text is expected to be almost                      were defined based on instances constructed from
 verbatim, though with understood changes in                         the annotation tree representation. Specifically,
 abbrevations, number expressions, and language                      for the triple metric, which measures unlabeled
 formatting commands, e.g. “return in four to five                   note to dialogue sentence match, instances are
 days period.”                                                       defined by note sentence id and transcript sen-
 STATEMENT2SCRIBE: Spoken by the clinician                           tence id per visit, e.g. ‘visitid_01|note_0|3’. The
 to the scribe where information is communicated                     second metric, similar to the leaf-ancestor met-
 informally, e.g. “okay so put down heart and lungs                  ric used in parsing, takes into account the full
 were normal”                                                        path from one note sentence to one dialogue sen-
 STATEMENT: Statements spoken by any partici-                        tence, e.g. ‘visitid_01|note_0|GROUP|QA|3’. The
 pant in a clinic visit in natural conversation, e.g.                span metric, similar to that of PARSEVAL, mea-
    2
      To normalize for annotation differences between the
                                                                     sures a node-level labeled span of dialogue sen-
 Mock Patient Visits (MPV) and our corpus, we removed                tences, e.g. for the top group node would be
 INFERRED-DIALOGUE labels, reattached REPEATS to a                   ‘visitid_01|note_0|GROUP|[10,12,13,14]’ (Samp-
 higher node, and moved all GROUP labels to the highest node.
                                                                     son and Babarczy, 2003). When testing agreement,
                                                                     labels for each annotator are decomposed to these
                 speaker         sentences    turns                  instance collections; true positive, false positive,
            clinician_primary      99421      42480                  and false negatives may be counted by the matches
                  patient          56052      36059
                   other           15073      9186                   and mismatches between annotators. F1 score is
                 TOTAL            170546      87725                  calculated as usual. The different definitions allow
                                                                     both relaxed (triple) and stricter (path and span)
                Table 4: Speaker statistics                          agreement measurements.
                                                                13

You can also read