Unremarkable AI: Fitting Intelligent Decision Support into Critical, Clinical Decision-Making Processes - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Unremarkable AI: Fitting Intelligent Decision Support into Critical, Clinical Decision-Making Processes Qian Yang Aaron Steinfeld John Zimmerman HCI Institute Robotics Institute HCI Institute Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University yangqian@cmu.edu steinfeld@cmu.edu johnz@cs.cmu.edu ABSTRACT 1 INTRODUCTION arXiv:1904.09612v1 [cs.HC] 21 Apr 2019 Clinical decision support tools (DST) promise improved health- The idea of leveraging machine intelligence in healthcare care outcomes by offering data-driven insights. While effec- in the form of decision support tools (DSTs) has fascinated tive in lab settings, almost all DSTs have failed in practice. healthcare and AI researchers for decades. These tools often Empirical research diagnosed poor contextual fit as the cause. promise insights on patient diagnosis, treatment options, and This paper describes the design and field evaluation of a rad- likely prognosis. With the adoption of electronic medical ically new form of DST. It automatically generates slides for records and the explosive technical advances in machine clinicians’ decision meetings with subtly embedded machine learning (ML) in recent years, now seems a perfect time for prognostics. This design took inspiration from the notion of DSTs to impact healthcare practice. Unremarkable Computing, that by augmenting the users’ rou- Interestingly, almost all these tools have failed when mi- tines technology/AI can have significant importance for the grating from research labs to clinical practice in the past users yet remain unobtrusive. Our field evaluation suggests 30 years [5, 7, 8]. In a review of deployed DSTs, healthcare clinicians are more likely to encounter and embrace such a researchers ranked the lack of HCI considerations as the DST. Drawing on their responses, we discuss the importance most likely reason for failure [11, 22]. This includes a lack and intricacies of finding the right level of unremarkable- of consideration for clinicians’ workflow and the collabora- ness in DST design, and share lessons learned in prototyping tive nature of clinical work. The interaction design of most critical AI systems as a situated experience. clinical decision support tools instead assumes that individ- ual clinicians will recognize when they need help, walk up CCS CONCEPTS and use a system that is separate from the electronic health • Human-centered computing → User centered design; record, and that they want and will trust the system’s output. We are collaborating with biomedical researchers on the KEYWORDS design of a DST supporting the decision to implant an ar- Decision Support Systems, Healthcare, User Experience. tificial heart. The artificial heart, VAD (ventricular assist device), is an implantable electro-mechanical device used to ACM Reference Format: partially replace heart function. For many end-stage heart Qian Yang, Aaron Steinfeld, and John Zimmerman. 2019. Unre- failure patients who are not eligible for or able to receive a markable AI: Fitting Intelligent Decision Support into Critical, Clinical Decision-Making Processes. In CHI Conference on Human heart transplant, VADs offer the only chance to extend their Factors in Computing Systems Proceedings (CHI 2019), May 4–9, lives. Unfortunately, many patients who received VADs die 2019, Glasgow, Scotland Uk. ACM, New York, NY, USA, 11 pages. shortly after the implant [2]. In this light, a DST that can https://doi.org/10.1145/3290605.3300468 predict the likely trajectory a patient will take post-implant, should help identify the patients who are mostly likely to Permission to make digital or hard copies of all or part of this work for benefit from the therapy. personal or classroom use is granted without fee provided that copies We draw insight from a field study investigating the VAD are not made or distributed for profit or commercial advantage and that decision processes, searching for opportunities where ML copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must might help [25]. The findings revealed that clinicians are be honored. Abstracting with credit is permitted. To copy otherwise, or unlikely to encounter or to actively engage with a DST for republish, to post on servers or to redistribute to lists, requires prior specific help at the time and place of decision making. For most permission and/or a fee. Request permissions from permissions@acm.org. cases, they did not find the implant decision challenging; CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk thus, they had no desire for computational support. In ad- © 2019 Copyright held by the owner/author(s). Publication rights licensed dition, the extremely hierarchical healthcare culture strati- to ACM. ACM ISBN 978-1-4503-5970-2/19/05. . . $15.00 fied senior physicians who make implant decisions and the https://doi.org/10.1145/3290605.3300468
CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk Q. Yang et al. mid-level clinicians who use computers. Almost no VAD Despite success in labs, the vast majority of clinician- decision-making took place in front of a computer. facing DSTs failed when moving to clinical practice. Clin- Embracing the rich context of the implant decision, we icians rarely use them [4, 5, 23], and therefore they have designed a radically new DST that automatically generates shown no significant improvement on healthcare outcomes. slides for the required decision meeting. The design em- Healthcare researchers identified a lack of HCI consideration, beds prognostic decision supports into the corner of decision rather than poor technical performance, as the main cause meeting slides. We wanted decision makers to encounter the of these failures [15, 19]. These HCI considerations include computational advice at a relevant time and place across the workflow integration, integration with social context, and decision process, and we wanted this support to only slow clinicians’ lack of motivation to use a DST. them down for the few cases where the DST adds value to Relatively few research projects have focused on how the decision. This design draws inspiration from Tolmie et to integrate DST output into clinical contexts. Within HCI al.’s notion of Unremarkable Computing, that technology there are initiatives to engage with real clinical users and needs to have the right level of remarkableness to valuably contexts, yet the lack of meaningful access remains a major situate itself in people emerging routines and becoming the barrier. Researchers have shared that they were not allowed glue of their everyday lives [21]. to evaluate incomplete designs, that evaluations took months This paper presents this DST’s interaction design as well or even years, that iterative design or repeated evaluation as a field evaluation at three VAD implant hospitals. We was not possible, and finally, that design evaluations had to also spoke with physicians working on clinical decisions be conducted by healthcare professionals rather than by HCI outside of VAD implant, probing whether this design might professionals [3, 9, 14]. generalize to other critical, clinical decisions. Our findings suggest that clinicians are more likely to encounter and em- VAD Decision-Making and Its Context brace a DST that binds “unremarkable" decision supports Due to many of the aforementioned barriers, investigations with their current work routine. Drawing on clinicians’ re- of the VAD implant decision making and field assessment of sponses, we discuss the importance and intricacies of finding DST designs are rare. An exception is a field study conducted the right level of unremarkableness in a DST design. We dis- at three VAD programs [25]. Researchers made a number of cuss lessons learned and unexpected challenges in evaluating observations that informed this work: critical AI systems as a situated experience. First, clinicians perceived no need for computational sup- This paper makes two contributions. First, we offer one port; They considered most patient cases as textbook cases concrete solution to the long-standing challenge of effec- that follow a standard, systematic process of therapy escala- tively situating DSTs in clinical practice. Second, we offer a tion and a staged unfolding of decision considerations. rare description of clinicians’ responses to a DST situated in Second, clinicians made implant decisions during daily their workflow. This surfaced intriguing insights valuable rounding of patient wards, during hallway conversations, for future investigations of critical AI systems. and in multidisciplinary VAD decision meetings. Decisions were rarely discussed or made in front of a computer. 2 RELATED WORK Finally, the clinical workplace culture was strongly hi- erarchical yet highly collaborative. Cardiologists and sur- Clinical Decision Support Tools in Practice geons, who function at the top of the hierarchy, decided who Clinical decision support tools (DSTs) are computational gets classified as a difficult case and who gets discussed dur- systems that support one of three tasks: diagnosing patients, ing the required multidisciplinary meeting which the whole selecting treatments, or making prognostic predictions of VAD team attends. This cultural context poses a two-fold the likely course of a disease or outcome of a treatment [24]. challenge for DST use. First, decision makers (physicians) This project focuses on clinician-facing, prognostic DSTs. and computer users (the midlevel clinicians, including nurse A significant strand of recent HCI work focused on critical practitioners, social workers and VAD coordinators) rarely issues in this area, including AI interpretability and fairness, overlap at any point of the decision-making process. Second, data visualization, accuracy of risk communication, and more physicians have great trust in their colleagues’ suggestions, [17, 18, 20]. The significance of this body of work has led much more so than in computational support. some to describe it as “the rise of design science in clinical DST research" [1]. These studies typically investigated DST 3 DESIGN PROCESS AND RATIONALE in lab settings, using prototypes that are dedicated to a single We set out to design a new form of DST for VAD patient clinical decision. Clinicians came out of their day-to-day selection to explore how to overcome its real-world adoption workflow, used these systems for a pre-identified task, then barriers that many prognostic DSTs face. Drawing upon prior provided feedback on the system design. work, we had two design goals:
Unremarkable AI CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk 1 - Embedding DST in current workflow: Clinicians, espe- touch point where most clinicians involved in the decision cially cardiologists and surgeons, need to naturally encounter are present, and they are actively forming a collective de- the DST within their current decision-making workflow, be- cision about patient treatment. Second, it is one of the few cause they are unlikely to recognize when they might need decision points where a computer is present and being used. help and then walk up to a computer for help; Third, decision meetings are common across hospital sites. 2 - Slowing down decision-making only when necessary: VAD centers in the US are legally required to take a mul- The DST outputs need to be easily ignored in most patient tidisciplinary approach to patient care, therefore regularly cases that are textbook. However, it should also be present scheduled meetings are common. Globally, these meetings enough to slow the decision-making down when there is a are also recommended [16]. Fourth, multidisciplinary meet- meaningful disagreement between the clinicians’ view and ings have become an increasingly common best practice in the DSTs view of the situation; organ transplantation [13]. Designing DST for decision meet- These orientations are very different from the convention ings therefore could potentially generalize beyond VADs to of DST design in which decision supports are always avail- include a number of other clinical decisions. able, waiting for clinicians to walk up and use at any point Next, we considered how to fit the DST comfortably within across the decision-making process. Instead, we wanted to the meetings. Drawing lessons from prior work [12, 25], we tailor the DST for particular moments in the process, such wanted to embed the DST into Electronic Medical Records that clinicians do not have to take pause and invent se- (EMR) to minimize the effort needed from clinicians to type quences of action anew. We wanted the DST to naturally in patient information. We also wanted to augment clini- augment the actions of decision making, rather than pulling cians’ paperwork to provide them additional motivation for the user away from doing their routine work. adoption. We therefore integrated the DST output into a meeting slide generator, a system that automatically extracts Making Clinical DST Unremarkable patient information from EMR and populates slides for the Tolmie et al. [21] introduced the notion of unremarkable com- decision meeting, which could be projected or printed. puting when discussing how ubiquitous computing should We sketched what the DST predictions output might look arrive and create its place in people’s homes. They argued like. We iterated on the design based on feedback of two col- that technology can augment people’s actions in ways that laborating clinicians (an attending Cardiologist and a nurse have a wealth of significance but seem unremarkable, be- practitioner). The final design was a small line chart that cause its interactions are “so highly situated, so fitting, so showed a patient’s predicted chance of survival (Figure 1). It natural”. They argued that home technology should not only also showed the most likely causes of death, such as right be more intelligent, it should also be more subservient to ventricular failure or renal failure. people’s daily routines. In doing so, the technology becomes We placed this chart in the top-right corner of the slide part of the routines, part of the very glue of their everyday summarizing an individual patient’s current state. The sub- life. tlety was a deliberate choice toward achieving the right level We draw connections between this ambition and our afore- of unremarkableness. In the most common case, when the mentioned design goals. We also draw connections between DST agreed with the clinicians’ assessment, the visual dis- this notion of routine and VAD decision making. While these play of the agreement could help clinicians gain trust in the are daunting life-and-death decisions, the implant decisions system without slowing them down. In the rare case that the are part of a work routine for clinicians. To fit a DST into their DST prediction conflicted with the clinicians’ assessment, practice, we need to make it subservient to the day-to-day the DST could slow the decision down. Everyone attending decision-making workflow they engage in. the meeting would see the disagreement. We speculated this We wanted to operationalize this idea of unremarkable would apply social pressure on the senior physicians to ratio- technology in the context of critical, clinical decision making. nalize and articulate their decision making. We speculated This is a difficult goal because it requires a right level of it could also encourage the medical students, residents and “unremarkableness" such that the DST does not constrain other mid-level clinicians to participate in the discussion clinicians’ decision making flow except when it needs to. when they disagreed with the senior clinician’s decision. It could allow them to disagree by pointing to the conflict with Design Process the DST and not claiming that they personally knew more To situate a DST into the current VAD decision-making rou- than the senior physician. tine, we first needed to identify a time and place where We worked out the detailed contents of the slide with the clinicians should naturally and impactfully encounter the it. two collaborating clinicians. We also referenced the meeting We chose the multidisciplinary patient evaluation meetings, printouts and workup checklists currently in use. for a number of reasons. First, the meeting is a rare social
CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk Q. Yang et al. Figure 1: The decision meeting slide design. We designed a DST that automatically generates decision-meeting slides for clin- icians with subtly embedded machine prognostics at the top right corner. We wanted to finalize the design by populating with real it impacted discussion. Unfortunately, this proved to be im- patient data. However, a variety of policies and legal regu- practical. None of the sites would allow us to present slides lations would not allow this. As a work-around, we asked showing information for the patients they were currently our clinical collaborators to help us populate the slides with implanting. All felt this could impact the life and death de- synthetic patient cases. Interestingly, they found it very chal- cision. The clinicians doing the VAD implants were quite lenging to generate a prototypical patient case including busy. They would only agree to interact with a single design. dozens of vital signs and test results. They instead selected They did not have the time for us to make revisions and elements across several of their former cases, removing iden- then revisit. Finally, one of the sites had a specific policy tifiable demographic information and molding parts of the preventing us from observing the decision meeting. They medical condition to disguise the identity. would only participate in one-on-one interviews. In our final design (Figure 1), the DST outputs are in the In reaction to these restrictions, we re-designed the as- top right corner of the slide, next to a summarized patient sessment process with the goal of making the most use of history visualization. Patient test results are categorized and our participant pool within one round of assessment. We put in the center. The patient demographics and links to carried out all following procedures in hospital C. In hospi- social and financial evaluations are on the left. tal B, we carried out all except (3) presenting at a decision meeting. In hospital A, we carried out all procedures except 4 DESIGN ASSESSMENT (4) interviewing all physicians and surgeons. We had several questions we wanted to answer with our as- (1) At each site, we first interviewed the mid-levels to sessment, including: (1) Would clinicians naturally encounter understand their practice around the decision meeting, and to the DST within their current workflow? (2) Would clinicians probe the DST design’s fit in their respective hospitals. When accept computational decision support in the public context necessary, we adjusted the designs to fit specific hospital’s of the meeting? (3) Does placing the prediction in the corner routine practice; present the right amount of unremarkability? Specifically, (2) Our research collaborator at each site recommended does the DST get ignored when its predictions align with one attending physician to be our confederate. We conducted the clinicians’ judgment, and would it slow decisions down interviews with them, discussing the DST design, and con- when its output conflicts with clinicians? firming there was no glaring mismatch between the design and the practice at their respective sites; Assessment in VAD Implant Centers (3) The confederate physician presented the patient case We gained access to three hospitals that regularly perform with the DST on display in the decision meeting. We observed VAD implantation, all within the US. Two were sites from clinicians’ responses and discussions; our formative field study and one was new. The facilities (4) Finally, we interviewed the rest of the VAD team to varied geographically and in scale. The smallest we studied further individually discuss the DST design. performs about 40 VAD implants a year; the largest performs In total, we interviewed nine attending cardiologists or sur- over 100. geons and eight mid-level clinicians. Each interview lasted We wanted to assess our design within the context of an for at least one hour. The DST design was presented in two actual implant decision meeting in order to observe whether hospitals’ multidisciplinary decision meetings. Field notes
Unremarkable AI CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk were recorded using pen and paper. Interviews were audio- Hospital B had in-house statisticians dedicated to outcome recorded and transcribed. We analyzed our data using affinity analysis and patient risk modeling. The physicians and this diagrams [10] and by performing thematic analysis. site were also actively involved in VAD risk modeling re- search. Interestingly, when it came to using a risk model to Assessing Generalizability of the DST Design inform their own implant decisions, they described them- We chose to situate the DST within slides used for decision selves as “very minimalist despite all these interests in ML.” meetings partially because these meetings are best practices Cardiologists and surgeons led implant decision making both in other critical medical domains as well. To gain some in- within and outside of the implant meetings. Meeting partic- sights as to if this design might generalize, we chose to probe ipants did not vote on how to proceed. Hospital B did not a small set of clinicians from other medical domains who provide us authorization to observe its decision meeting. participate in these meetings. Hospital C was more technology-friendly. The meeting To recruit these participants, we asked participants from room had large projector, which most participants could the VAD study to help us identify other clinical domains read. In addition, participants had access to a printout of and decisions that have interdisciplinary decision meetings. the presented materials. One program manager and two We then interviewed 6 physicians whose practices include mid-level clinicians arrived more than 40 minutes before decisions meetings for pediatric surgery, pediatric critical the meeting to set up the computer, projector, and remote care, adult cardio-thoracic surgery, internal medicine emer- conference connections. As the presenting physicians spoke, gency care, orthopedic surgery, and obstetrics/gynecology. a seasoned nurse practitioner operated the computer, pulling We audio-recorded, transcribed and analyzed these inter- out and zooming into relevant patient information from EMR. views using the same methods as we used for our VAD par- This nurse practitioner had been performing this role for ticipants. more than 5 years. Physicians and mid-levels used laptops to search for relevant information in the EMR or online and 5 DESIGN ASSESSMENT FINDINGS to add items to their digital to-do lists. Many more people engaged in discussing the patients. Following the discussion We first offer an overview of observations from the individual of each patient, all clinicians present voted on the next step. sites, describing the different cultures, facilities, and practices. Hospital C had previously experimented with bringing We then report findings across the three sites related to the computational predictions into their meetings. Cardiologists aforementioned assessment goals: the likelihood of encoun- chose a model that had been nationally validated through tering DST during decision-making, the acceptance of DST, five randomized clinical trials. They had a nurse practitioner the right level of remarkableness, and finally, generalizability input all of the data for each patient discussed and show to other kinds of medical decisions. the DST prediction in the decision meeting. One year later, they stopped this practice because two recent journal articles Overview reported that the models used were “horribly mis-calibrated”. Hospital A was the least technologically advanced. They “That was a lot of work to type in all that sh-t and generate recently transitioned from paper-based to electronic clinical that number, and that’s not that helpful.” Their EMR held four records. Phone signals generally did not penetrate the build- other implant outcome prediction models, which predicted ing, built in the late 1800s. Many common web services, such things such as the chance of depression. However, the clin- as Google search, were blocked on their internal network. icians never used these models, stating that each required The weekly meeting took place in a long, grandiose, turn- manually entry of all of a patient’s data. of-the-century board room. This contained one long, 40-seat table above which hung four large chandeliers. At one end Likelihood of Encountering DST in Workflow of the table (the “head” of the table) there was a portable, low resolution projector. They sat according to an unspoken seat Our observations suggested that most clinicians involved chart based on clinical role hierarchy. Physicians sat near in the VAD implant decision would likely encounter the the head end, and a small group of these physicians would DST output if it was included as part of an individual pa- present the individual cases. Nurses sat near the middle. So- tient’s information presented at the decision meeting. All cial workers and others sat farthest from the projector. Only three facilities hosted a weekly implant decision meeting. participants sitting near the head of the table participated in Clinicians of all ranks and roles attended, ranging from sea- the discussion. The content on the projector screen was also soned surgeons to residents, to nurse practitioners to social too small to read since most sat far from the screen. They all workers to palliative care coordinators. Although the weight interacted with bulky printouts of EMR data for each patient that the meetings carried for influencing an implant deci- discussed. sion appeared to vary across the three sites, the occurrence
CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk Q. Yang et al. of the meetings was one of the few events that happened being prepared by “amateurs.” These staff members could everywhere. not personalize patient presentations because they could These meetings offered one of the extremely few situa- not risk skipping information that might prove to be critical. tions where senior clinicians actively discussed decisions Mid-levels felt they could benefit from the automation and in proximity of a computer. Meetings in all three hospitals seasoned physicians felt they would benefit by the removal had a shared computer projecting patient information. Two of the copious, irrelevant data being pulled out of the EMR. hospitals projected dedicated meeting materials. The other Mid-level clinicians viewed the slides as a potentially im- projected patient profiles from the EMR. Clinicians described portant vehicle for communicating their opinions to physi- the other key decision points as “just talk on the fly” with cians. In all three hospitals, senior physicians set the agenda no EMR access or paper records in hand. The other decision for decision meetings. They decided which patients to present, points most often only included attending physicians and and during the meeting, they called out the information that surgeons. “Everything is happening live.” Mid-level clini- they felt was important enough to discuss. This hierarchical cians, who spend more time with each individual patient, did culture was well captured by the design of a custom patient not participate in the decisions made outside of the meeting. review tool at hospital C. Two VAD coordinators customized a patient review dashboard within EMR in order to help them- Acceptance of DST in Decision Meetings selves better track medical tests and share results within the None of our interview participants expressed any resistance team. Although cardiologists and surgeons rarely used the to the including DST output within the context of the deci- tool, they controlled which pieces of information could be sion meeting. One site (Hospital C) had already made the placed on the dashboard and which elements would not be effort to manually include DST data into their meeting but included when the patient case was classified as urgent. had abandoned this practice due to their loss of confidence Mid-levels often doubted that their voice was heard or in its quality. Seasoned physicians and surgeons voiced their that their expertise was considered. They were hesitant to appreciation for what a prognostic DST might bring, stating directly disagree with a physician. They described the sit- that it would “give its perspective” and offer a chance for an uation as more complicated than just the power dynamics. “occasional recalibration.” Clinicians also shared that making They shared that the cardiologists were incentivized to im- an objective decision could sometimes be hard. The decision plant more patients and to implant sicker patients. They to not implant was usually a death sentence for a patient. found themselves often advocating for patient mortality (let “When I really like this patient, really want to help him or her, the patient die). Mid-levels felt their opinions focused on it sometimes helps to get a more factual view.” post-implant quality of life. Unlike the physicians, mid-levels Seasoned physicians shared that their dream DST should worked intimately “with all the problems that can come from play a role similar to mid-level clinicians. They should pro- a patient that maybe shouldn’t have been implanted.” They vide additional context for the seasoned physicians’ decision. noted there was no right or wrong answer between length The DST could provide additional context and a different of life and quality of life. They shared it was often hard to perspective to the senior physicians. They recognized the argue with great confidence that letting patients die was value a DST might bring from its statistical consideration better than offering them a small chance to live. In such situ- across many cases. “The value is you are looking at thousands ations, mid-levels frequently cited “you never know what will of cases, I’m looking at 100 and overweighting the last three I happen” as a reason to not to pursue further discussion with saw.” They also shared that input from mid-levels was not attending physicians. Some shared that over time, they had always “taken really into account”. slowly removed themselves from the decision making. Mid-levels agreed they only inform and support the dis- There is risk stratification for each patient, but I don’t cussions. They did not make decisions. know... It’s like, we talk about it, but I don’t know if it’s My role in selecting patients for VAD... hmm. I don’t really taken really into account. (Nurse practitioner, select patients. But I do talk about it... We are there to B2) help discuss patients. (Nurse practitioner, B2) Mid-levels consider the ability to organize the contents Mid-level clinicians enthusiastically welcomed the idea of of meeting slides as one way to increase their influence. a decision meeting slide generator. They envisioned a num- Meeting slides provide additional, visual presence they could ber of possible benefits. They shared that the slide generator use in support of the facts they felt were important. This would automate work that is not currently billable. At hospi- would make it less like they were only sharing an opinion tal A and B, meeting slides were prepared by staff who had with the physicians. The meeting slides could be facts in a little to no medical training. Physicians could get frustrated space where only the seasoned physicians’ opinions carried with the result, characterizing the unfiltered materials as any weight. They felt the formality the meeting slides carried
Unremarkable AI CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk was unparalleled to any other artifact they had access to. A Instead, clinicians started to focus on the DST prognostics. prognostic DST that indicates post-surgery quality of life They probed on where the model comes from. It took a long could potentially amplify their voices. time for us to explain the data source and the ML mechanism There is not a way to present (my reasoning) formally. to clinicians with no ML experience and without a deep It’s just me saying: ‘This, this and this’. [...] I think it’s understanding of statistics. It took even longer to explain it good to have something visual for anybody to see. It’s to clinicians with statistical depth and ML experience. They like, OK. LOOK. Let’s slow down a bit here. (Nurse fixated on the fact that the ML systems’ performance was practitioner) not the focus of our assessment. The synthetic patient data often turned this into an assessment of the DST’s quality in Intricacies of Making DST Unremarkable the minds of many meeting participants. Both seasoned physicians and mid-levels expressed appre- Is the Model Validated by Clinical Trials? Clinicians com- ciation for DSTs that could slow them down “only when monly expressed a need to know more about the model’s necessary". They liked this aspect of our design. However, source and credibility. When they learned that the model we could not easily conclude whether our specific design presented has not been rigorously validated through clinical had achieved this goal. Instead, clinicians’ discussions and trials and published in prestigious clinical journals, they sug- questions, which we will soon describe, depicted many un- gested we were wasting their time. Physicians and surgeons expected intricacies in this notion of the “right" level of considered discussing an unvalidated model unethical; as unremarkableness. misleading as “looking at a crystal ball”. Others tended to Challenges of Engaging Synthetic Patient Cases via Data. Clin- judge DST quality based on the journal it was published in. icians shared that they could not draw on their experience of Physicians also desired a model that had been validated making critical clinical decisions seeing only patient data on with data from their own hospital. “It’s better to be home- paper. This presented the biggest barrier to assessing how grown.” Models should be published in a good journal and clinicians might respond to a conflicting DST prediction. then validated in a national scale study across several implant Patient history data alone did not give clinicians enough centers. Some suggested including links to the peer reviewed confidence to make an implant decision. Physicians described clinical trial within the DST output on the slide. It “lends a the meeting data as merely a surrogate for the actual patient. lot of weight to a clinical model”. The data did not allow them to see patients “as a whole.” They stressed that to understand a patient clinically, they needed Are the Predictions Based on Clinicians’ Best Efforts? Physi- to “look at the patient, talk to the patient, take care of the cians highlighted that the predictive models, regardless of patient.” Social workers shared that they had not met with how well they measure medical uncertainties, would never this patient nor talked to their family. In our field evaluation, replace human, clinical decision-making. They viewed their presentations of the synthetic patient cases were always own decision making as focused on managing and reducing followed by a long, awkward silence. uncertainties. “If we think that we will be able to tell everybody what to do based on a model, we ignore the fact that we also A very sick but highly motivated patient can do better have tools and mechanisms for dealing with the uncertainty than their illness would otherwise be left them, com- that is inherent when putting VADs in patients.” (Cardiologist) pared to a less sick, less motivated patient. These things Many clinicians’ questions, as well as their discussion are hard to capture. The eyeball tests. (Surgeon, B6) around the DSTs, revealed a tension between what they Clinicians also had wildly different readings into the same saw as the DST’s static view of patient conditions and the DST prognostics. We presented the same two synthetic cases clinicians’ desire and ability to also focus on future actions with the same implant survival predictions to all participants. and interventions. They wanted to know which modifiable Interestingly, they generated wildly different reactions and factors most influenced the DST predictions. They wanted interpretations of the cases. Some viewed the survival esti- to be able to offer treatments that they could improve these mate as implying that an implant would not work. “Gee... VAD factors, thus increasing the likelihood of a positive surgical is futile here.” Others viewed the DST output as implying the outcome at some time in the future. patient should be immediately implanted, before things got worse. ‘‘We still have a chance.” Few clinicians believed that These predictions are (what will happen) despite our all VAD implant candidates would have a similar prognosis best efforts, right? (VAD manager, C8) as the synthetic case we presented: “This chart is meaningless. Having an understanding of what’s driving the risk Every VAD candidate’s projection would look like this.” [features that most influence the prediction] is very im- That the data was based upon synthetic patient cases made portant for us to understand what is modifiable at that any real discussion about the patient even more difficult. patient. [...] Is it age or something we cannot change?
CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk Q. Yang et al. Otherwise there is a lot of potential here. (Hospital C this guy has his own things that make him special. decision meeting) (Collaborating cardiologist, hospital A) Clinicians did not seem to actively make the subtle but Are the Predictions Individual Medicine OR Population Medicine? critical distinction between features that were important Most clinicians share that they thought of DST output as an to predicting an outcome and features that are causal to “average”. They seemed to find the notion of personalized that outcome. For example, an observation that people are predictions difficult to grasp. Some voiced strong concerns carrying umbrellas can be used to predict that it will rain. that using DST was the same as applying “populational statis- However, taking people’s umbrellas away will not prevent tics” to individual patient decision making. They felt this was rain. ML systems make predictions based on covariance of unethical. Others proposed that “instead of having one model features. They do not assess the causality of those features. that we apply to the entire population, we would have a group When prompted, clinicians claimed that this distinction is of models. Those models predict for that group of patients.” “absolutely important”. However, in our conversations, we (Surgeon, B4) did not observe them distinguishing ML predictions from general statistics. They seemed to strongly believed DSTs What Does “Now” Mean in DST Predictions? The DST vi- should be able to distinguish causality from prediction and sualized the patient outcome predictions, including life ex- that they should present only causal features. “This is the pectancy, estimated time until right heart failure, and likely whole point of statistical processes. A DST model should address cause of death. For example, Figure 1 shows that the pa- that, right?" tient’s post-implant life expectancy is 21 days if a VAD was There was a sense that if the DST predictions were not implanted now, under the condition shown on the slides. based on causal factors, then the predictions should not be Clinicians were confused by this notion of “now” because presented at all. Clinicians described differentiating correla- it was extremely unlikely that they would implant a patient tion (predication) versus causality as a central part of their on the same day as the decision meeting. Is “that 21 days decision making. For example, many patients being eval- from today? If we are gonna lose the patient in 21 days [21 uated for left-ventricular VAD also have right-ventricular days following after implant], can we just wait?” heart failure. An important decision cardiologists must make is whether the heart failure on the right was caused by the DSTs Do Not Account For the X Factors. Clinicians said that left heart failure or if it is independent. Will fixing only the the DST would only ever be one factor in their decision left side also fix the right? Currently, clinicians speculate because of “X factors”; the many factors beyond a patient’s by probing patients with medication. They try different left condition that impacts the implant decision. One X factors heart medications and observe how the right side responds. they spoke of was O/E ratio (observed-to-expected mortality Clinicians wanted help: “If you can help us understand [...] ratio). The O/E ratio is a rating that measures the surgeon which factors seem to be most dominant, or most closely asso- and care teams’ performance. Surgeons cared about keeping ciated with certain outcomes, then that helps.” They wanted a high rating. They described the implant decision for high- to know the causal links and features for individual cases. risk patients as “taking on new O/E ratio debts.” This seemed to strongly influence whether they take on another high-risk Are Data-Driven Prognostics Facts OR Predictions? Clinicians patient. It seemed to depend strongly on how many patients frequently asked us to clarify whether DST prognostics are had recently had poor outcomes. predictions that carry agency and subjectivity, or if predic- tions are facts rooted in historic data. We sensed they wanted It’s not that we don’t help that [VAD candidate] patient, to limit discussions to facts, including how heart failure has but if we take this shot and do poorly, then we cannot played out for the patient they were treating and the statistics take on the next 10 patients like him. Because now we from previous, similar cases. We observed resistance from got too much of a cluster of high-risk patients who’ve some clinicians toward the idea of showing predictions. Our done poorly, then we have to do some lower risk ones collaborating physicians, who created the synthetic cases before we can go back up [in O/E ratings]. Insurance and helped us select contents for the slides suggested that companies and Medicare and all that... they will mark the DST output should be “one statistical representation of you. They may not pay. It all plays into the complex 100 patients who are similar to him” rather than a prediction factor for deciding who, especially sicker patients, we for this individual patient. would take a shot. (Surgeon, B6) I think if you continue to call it “VAD projections” 65%, Some surgeons described that, for some cardiac surgeries people are going to poke holes at it. They are gonna try that have officially defined models used to rate surgeons and to prove you wrong. This [DST projection] is just what care teams, their decision meetings had became centered the historical outcomes were. But this guy is different, around risk models. This is not yet the case for VAD implants.
Unremarkable AI CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk Generalizability Beyond VAD making and that it must only be present enough to slow de- Our interviews with clinicians outside of VAD centers showed cision making down when its predictions are in conflict with that multidisciplinary decision meetings take place across a seasoned physician’s suggested course of action. All three many clinical domains for some of their most aggressive proposals aimed to naturally augment the current activities interventions. They are also referred to as internal medicine of decision making, rather than pulling clinicians away from panel meetings, tumor boards, or floor meetings (referring doing their routine. to meetings between critical and general care physicians). Below, we discuss the design implications of these pro- These meetings happen widely because for patients are very posals. We then share challenges encountered and lessons sick and are being considered for their last-option surgi- learned in evaluating the DST as a situated experience. cal intervention, their illness usually have involved multiple organs. Treating them requires physicians from multiple clin- Designing DST to Augment Clinical Routine ical domains. Multidisciplinary meetings therefore occurred Time and Place. Findings of this work suggested that DSTs naturally. may more effectively fit into clinical practice if their interac- Esophageal cancer, COPD, diabetes, cystic fibrosis, LIT- tions are tailored for a specific time and place within the cur- ERALLY everything in psychiatry, gastric bypass, end rent decision-making workflow. Taking lessons from prior stage renal disease, hernia repair, syndromes like Down HCI work, we should not only make AI more intelligent, but and Turner, any disease that requires management with make them highly situated in people’s routines. In doing so, meds with nasty side effects, and even emergency room AI can become part of the decision-making routines, part of situations to expedite processes. Any of the above dis- the very glue of clinicians’ everyday work. eases the approach has to be multidisciplinary almost Our assessment findings largely suggest that decision by definition because they affect multiple systems and meetings are a routine activity that is promising for DST usually but not always the last option is a surgical integration, for several reasons: intervention. (Pediatric surgeon) (1) The meeting is part of an existing clinical decision-making routine. Clinicians therefore would naturally encounter 6 DISCUSSION: DESIGNING AND EVALUATING the DST at the meeting; DST AS A SITUATED EXPERIENCE (2) The meeting is a socially aggregated decision point. The Clinical DSTs, despite compelling evidence of their effective- DST could therefore leverage mid-level clinicians to advo- ness in labs, have mostly failed when moving out of labs and cate for its information and value to the decision makers; into healthcare practice [15, 19]. A lack of contextual inte- (3) The meeting offers a moment of deliberation in their gration in the design of these systems played a critical role otherwise fast-moving decision-making workflow. The in these repeated failures. Prior work suggests that current meetings offer clinicians time to collectively digest the interaction conventions, that clinicians will recognize their implications of the prognostics; own need for a DSTs help and then walk up and use a system (4) Finally, the meeting is a best practice promoted globally separate from the EMR, is not likely to work [25]. in VAD patient care, and across several clinical domains. There is a real need to design DSTs not only as a functional Therefore this DST design could potentially make its utility but as an integrated experience. Their effectiveness place across diverse practices in different hospital sites should be measured not only by prediction accuracy, but by and domains. effectiveness when situated within its social and physical Decision meetings represent only one way of integrating context such as workplace culture and social structures. This DSTs into clinical practice. Similar opportunities may lie in presents exiting new opportunities and challenges to HCI other time and place in existing clinical decision-making and UX research. routine that is socially-aggregated, deliberative and shared Our design makes three dependent proposals about mak- across hospitals. Future research shall advance this work by ing a DST a situated VAD decision making experience. First, systemically searching for such opportunities. we propose that the decision meeting presents a good time and place. Second, assuming the meeting is correct, we pro- Interaction Form. Besides situating the DST in decision-making pose that situating the DST output into the meeting slides routine, we also motivated mid-level clinicians’ use by prepar- would offer an effective form. Third, assuming that having ing patient information for the decision meetings for them. the DST as part of the slides is a good form, we propose that Our field study suggested this was a useful tactic. DSTs sup- the DST plays a fairly unremarkable role in clinician decision porting various clinical decisions can potentially automate making by appearing in one corner. We claim it needs to tedious information retrieval tasks for clinicians to offer ad- be easily passed over when it agrees with current decision ditional motivations for adoption.
CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk Q. Yang et al. Designing a Right Level of Unremarkableness (2) Designing the evaluation methods for describing and The walk-up-and-use convention of current DSTs assume unpacking the complex, subtle, and multi-faceted nature clinicians will know when they need help. Our design chal- of experience, rather than explicitly measuring it; lenges this convention by proposing the notion of Unremark- (3) Using prototypes rather than functioning DST models. able AI. Our unremarkable DST is designed to be situated This allowed us to probe various possible DST outputs naturally in an existing decision-making routine and only and to easily adjust our prototype to incorporate partici- noticed when it might add value to the decision. DST’s in- pant feedback. teraction should have a right level of unremarkableness, yet The Impossibility of the information it provides should significantly impact care. Experience Prototyping Critical DSTs Our field assessment illustrated some positive indicators that making DSTs unremarkable helped reduce the resistance Nonetheless, we encountered additional, seemingly-inevitable clinicians commonly show towards clinical DSTs [6, 19, 25]. challenges of assessing DST’s situated user experience. For For example, we did not observe clinicians feeling threatened example, whether a DST design has indeed achieved a right or feeling they might be replaced by the technology. Clini- level of remarkableness was impossible to assess without cians appreciated that DSTs could inform their discussions, real patient data and fully functioning ML systems. “though the discussion is unlikely to center around the DST." Clinicians need more than just synthetic patient cases to While our DST was visually unremarkable, its very exis- connect with their own decision making. We speculate that tence seems be, to an extent, transforming clinical decision clinicians need to see one of their own patients’ data to really making. It introduced predictions into a culture rooted in assess the DST information design and to see what an actual facts and statistical significance. Moreover, when predictive prediction would look like. This means early DST prototypes risk models were used officially to measure patient risk and will need actual patient data to assess their interactions in clinician skills, clinicians’ decision making became centered context and their impact on care. This is currently impossible around these models. DSTs substantialized their performance in critical clinical cases due to ethics, policies and hospital pressure in decision making. regulations. These observations forced us to take a step and ask: What Clinicians were unable to engage in a group discussion is the preferred role for DST to play in clinical practice? without a fully functioning ML system. Clinicians described Where does a right level of unremarkableness lie? More re- using an unvalidated DST as unethical and misleading. They search is needed to find the right balance between DST aug- suggested that a DST should be validated via randomized menting decision-making in natural and intuitive ways and clinical trials on both retrospective patients and prospective transforming the nature of clinical decision-making. Under- patients, both at a national level and on their own hospital’s standing these tradeoffs should be a critical research question patient population. This gives rise to a chicken-and-egg prob- in DST design and research. lem in our design assessment: Clinicians could not effectively assess the DST design without a working DST that has been validated on prospective patients; and validating a DST on Experience Prototyping DST In-Situ prospective patient data requires a DST design that has been Restricted access to the clinical environment is known to proven effective. impose fundamental challenges to iterative UX design and We suspect these challenges are likely to occur not only in evaluation. Our experience of conducting the field assess- evaluating DSTs for artificial heart implant, but in assessing ment echoed this. Upon reflection, we identified several tac- DSTs for many other critical, high-consequence decisions as tics effective at reducing the risks of our one-shot design well. As data-driven DSTs increasingly move out of research evaluation: labs and into critical decision making in the real world, we encourage DST designers and researchers to join in mak- (1) Designing a generalizable DST: The work flows and so- ing these challenges explicit and investigating new design cial contexts in clinical practices are complex and highly assessment methods and tools to address them. divergent across hospitals. Therefore, generalizability is a necessity for many DST designs. This work took a step 7 ACKNOWLEDGEMENT further than hospital-site generalizablility, designing a This work was supported by grants from NIH, National Heart, DST that can work for a class of structurally similar deci- Lung, and Blood Institute (NHLBI) # 1R01HL122639-01A1. sions (data-intensive, last-option surgical interventions). The first author was also supported by the Center for Ma- A DST’s design and evaluation can become more produc- chine Learning and Health (CMLH) Fellowships in Digital tive than those dedicated to one specific clinical decision Health. We thank the participants in this work for their ded- as well as specific DST models; ication, time and valuable inputs.
Unremarkable AI CHI 2019, May 4–9, 2019, Glasgow, Scotland Uk REFERENCES HCI Research in Healthcare: Using Theory from Evidence to Prac- [1] David Arnott and Graham Pervan. 2014. A critical analysis of decision tice. In CHI ’14 Extended Abstracts on Human Factors in Comput- support systems research revisited: the rise of design science. Journal ing Systems (CHI EA ’14). ACM, New York, NY, USA, 87–90. https: of Information Technology 29, 4 (01 Dec 2014), 269–293. https://doi. //doi.org/10.1145/2559206.2559240 org/10.1057/jit.2014.16 [15] Dean F Sittig, Adam Wright, Jerome A Osheroff, Blackford Middle- [2] Raymond L Benza, Dave P Miller, Robyn J Barst, David B Badesch, ton, Jonathan M Teich, Joan S Ash, Emily Campbell, and David W Adaani E Frost, and Michael D McGoon. 2012. An evaluation of long- Bates. 2008. Grand challenges in clinical decision support. Journal of term survival from time of diagnosis in pulmonary arterial hyper- Biomedical Informatics 41 (2008), 387–392. tension from the REVEAL Registry. CHEST Journal 142, 2 (2012), [16] Mark S. Slaughter, Francis D. Pagani, Joseph G. Rogers, Leslie W. 448–456. Miller, Benjamin Sun, Stuart D. Russell, Randall C. Starling, Leway [3] David Coyle and Gavin Doherty. 2009. Clinical Evaluations and Col- Chen, Andrew J. Boyle, Suzanne Chillcott, Robert M. Adamson, Mar- laborative Design: Developing New Technologies for Mental Health- garet S. Blood, Margarita T. Camacho, Katherine A. Idrissi, Michael care Interventions. In Proceedings of the SIGCHI Conference on Human Petty, Michael Sobieski, Susan Wright, Timothy J. Myers, and David J. Factors in Computing Systems (CHI ’09). ACM, New York, NY, USA, Farrar. 2010. Clinical management of continuous-flow left ven- 2051–2060. https://doi.org/10.1145/1518701.1519013 tricular assist devices in advanced heart failure. The Journal of [4] Srikant Devaraj, Sushil K Sharma, Dyan J Fausto, Sara Viernes, and Heart and Lung Transplantation 29, 4, Supplement (2010), S1 – S39. Hadi Kharrazi. 2014. Barriers and Facilitators to Clinical Decision https://doi.org/10.1016/j.healun.2010.01.011 Clinical Management of Support Systems Adoption: A Systematic Review. Journal of Business Continuous-flow Left Ventricular Assist Devices in Advanced Heart Administration Research 3, 2 (2014), p36. Failure. [5] Glyn Elwyn, Isabelle Scholl, Caroline Tietbohl, Mala Mann, Adrian GK [17] Nicole Sultanum, Michael Brudno, Daniel Wigdor, and Fanny Chevalier. Edwards, Catharine Clay, France Légaré, Trudy van der Weijden, Car- 2018. More Text Please! Understanding and Supporting the Use of men L Lewis, Richard M Wexler, et al. 2013. “Many miles to go...": a Visualization for Clinical Text Overview. In Proceedings of the 2018 CHI systematic review of the implementation of patient decision support Conference on Human Factors in Computing Systems (CHI ’18). ACM, interventions into routine clinical practice. BMC medical informatics New York, NY, USA, Article 422, 13 pages. https://doi.org/10.1145/ and decision making 13, Suppl 2 (2013), S14. 3173574.3173996 [6] Karine Gravel, France Légaré, and Ian D Graham. 2006. Barriers and fa- [18] Alan R Tait, Terri Voepel-Lewis, Brian J Zikmund-Fisher, and Angela cilitators to implementing shared decision-making in clinical practice: Fagerlin. 2010. The effect of format on parents’ understanding of a systematic review of health professionals’ perceptions. Implement the risks and benefits of clinical research: a comparison between text, Sci 1, 1 (2006), 16. tables, and graphics. Journal of health communication 15, 5 (2010), [7] Monique WM Jaspers, Marian Smeulers, Hester Vermeulen, and 487–501. Linda W Peute. 2011. Effects of clinical decision-support systems [19] Svetlena Taneva, Waxberg Sara, Goss Julian, Rossos Peter, Nicholas on practitioner performance and patient outcomes: a synthesis of high- Emily, and Cafazzo Joseph. 2014. The Meaning of Design in Health- quality systematic review findings. Journal of the American Medical care: Industry, Academia, Visual Design, Clinician, Patient and Hf Informatics Association 18, 3 (2011), 327–334. Consultant Perspectives. In Proceedings of the Extended Abstracts [8] Kensaku Kawamoto, Caitlin A Houlihan, E Andrew Balas, and David F of the 32Nd Annual ACM Conference on Human Factors in Comput- Lobach. 2005. Improving clinical practice using clinical decision sup- ing Systems (CHI EA ’14). ACM, New York, NY, USA, 1099–1104. port systems: a systematic review of trials to identify features critical https://doi.org/10.1145/2559206.2579407 to success. Bmj 330, 7494 (2005), 765. [20] Danielle Timmermans, Bert Molewijk, Anne Stiggelbout, and Job [9] Leah Kulp and Aleksandra Sarcevic. 2018. Design In The “Medical” Kievit. 2004. Different formats for communicating surgical risks to Wild: Challenges Of Technology Deployment. In Extended Abstracts patients and the effect on choice of treatment. Patient education and of the 2018 CHI Conference on Human Factors in Computing Systems counseling 54, 3 (2004), 255–263. (CHI EA ’18). ACM, New York, NY, USA, Article LBW040, 6 pages. [21] Peter Tolmie, James Pycock, Tim Diggins, Allan MacLean, and Alain https://doi.org/10.1145/3170427.3188571 Karsenty. 2002. Unremarkable Computing. In Proceedings of the SIGCHI [10] Bill Moggridge. 2007. Designing interactions. Vol. 14. Conference on Human Factors in Computing Systems (CHI ’02). ACM, [11] Mark A Musen, Blackford Middleton, and Robert A Greenes. 2014. New York, NY, USA, 399–406. https://doi.org/10.1145/503376.503448 Clinical decision-support systems. In Biomedical informatics. Springer, [22] Robert L Wears and Marc Berg. 2005. Computer technology and clinical 643–674. work: still waiting for Godot. Jama 293, 10 (2005), 1261–1263. [12] Annette M OâĂŹConnor, John E Wennberg, France Legare, Hilary A [23] Jeremy C Wyatt and Douglas G Altman. 1995. Commentary: Prognostic Llewellyn-Thomas, Benjamin W Moulton, Karen R Sepucha, Andrea G models: clinically useful or quickly forgotten? Bmj 311, 7019 (1995), Sodano, and Jaime S King. 2007. Toward the âĂŸtipping pointâĂŹ: 1539–1541. decision aids and informed patient choice. Health Affairs 26, 3 (2007), [24] Qian Yang, John Zimmerman, and Aaron Steinfeld. 2015. Review of 716–725. Medical Decision Support Tools : Emerging Opportunity for Interac- [13] Brindha Pillay, Addie C Wootten, Helen Crowe, Niall Corcoran, Ben tion Design. In IASDR 2015 Interplay Proceedings. Tran, Patrick Bowden, Jane Crowe, and Anthony J Costello. 2016. [25] Qian Yang, John Zimmerman, Aaron Steinfeld, Lisa Carey, and James F. The impact of multidisciplinary team meetings on patient assessment, Antaki. 2016. Investigating the Heart Pump Implant Decision Process: management and outcomes in oncology settings: a systematic review Opportunities for Decision Support Tools to Help. In Proceedings of of the literature. Cancer treatment reviews 42 (2016), 56–72. the 2016 CHI Conference on Human Factors in Computing Systems (CHI [14] Kate Sellen, Dominic Furniss, Yunan Chen, Svetlena Taneva, Ais- ’16). ACM, New York, NY, USA, 4477–4488. https://doi.org/10.1145/ ling Ann O’Kane, and Ann Blandford. 2014. Workshop Abstract: 2858036.2858373
You can also read