Interactive Task and Concept Learning from Natural Language Instructions and GUI Demonstrations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Interactive Task and Concept Learning from Natural Language Instructions and GUI Demonstrations* Toby Jia-Jun Li1 , Marissa Radensky2 , Justin Jia1 , Kirielle Singarajah1 , Tom M. Mitchell1 , Brad A. Myers1 1 Carnegie Mellon University, Pittsburgh, PA 2 Amherst College, Amherst, MA arXiv:1909.00031v2 [cs.HC] 7 Jan 2020 Abstract low in conventional and EUD programming languages, as they require users to think about their tasks very differ- Natural language programming is a promising approach to enable end users to instruct new tasks for intelligent agents. ently from how they would think about them in familiar However, our formative study found that end users would contexts (Pane, Myers, and others 2001), making program- often use unclear, ambiguous or vague concepts when nat- ming particularly difficult for end users who are not famil- urally instructing tasks in natural language, especially when iar with programming languages and “computational think- specifying conditionals. Existing systems have limited sup- ing” (Wing 2006). To address this issue, the concept of nat- port for letting the user teach agents new concepts or explain- ural programming (Myers et al. 2017; Myers, Pane, and Ko ing unclear concepts. In this paper, we describe a new multi- 2004) has been proposed to create techniques and tools that modal domain-independent approach that combines natural match more closely the ways users think. language programming and programming-by-demonstration to allow users to first naturally describe tasks and associated Natural language programming is a promising technique conditions at a high level, and then collaborate with the agent for bridging the gap between user mental models of tasks to recursively resolve any ambiguities or vagueness through and programming languages (Mihalcea, Liu, and Lieberman conversations and demonstrations. Users can also define new 2006). It should have a low learning barrier for end users, un- procedures and concepts by demonstrating and referring to der the assumption that the majority of end users can already contents within GUIs of existing mobile apps. We demon- communicate procedures and structures for familiar tasks strate this approach in P UMICE, an end-user programmable through natural language conversations (Lieberman and Liu agent that implements this approach. A lab study with 10 users showed its usability. 2006; Pane, Myers, and others 2001). Speech is also a natu- ral input modality for users to describe desired program be- haviors (Oviatt 1999b). However, existing natural language Introduction programming systems are not adequate for supporting end The goal of end user development (EUD) is to empower user task automation in domain-general tasks. Some prior users with little or no programming expertise to pro- systems (e.g., (Price et al. 2000)) directly translate user in- gram (Paterno and Wulf 2017). Among many EUD appli- structions in natural language into conventional program- cations, a particularly useful one would be task automation, ming languages like Java. This approach requires users to through which users program intelligent agents to perform use unambiguous language with fixed structures similar to tasks on their behalf (Maes 1994). To support such EUD those in conventional programming languages. Therefore, it activities, a major challenge is to help non-programmers to does not match the user’s existing mental model of tasks, specify conditional structures in programs. Many common imposing significant learning barriers and high cognitive de- tasks involve conditional structures, yet they are difficult for mands on end users. non-programmers to correctly specify using existing EUD Other natural language programming approaches techniques due to the great distance between how end users (e.g., (Azaria, Krishnamurthy, and Mitchell 2016; think about the conditional structures, and how they are rep- Fast et al. 2018; Kate, Wong, and Mooney 2005; resented in programming languages (Pane, Myers, and oth- Srivastava, Labutov, and Mitchell 2017)) restricted the ers 2001). problem space to specific task domains, so that they could According to Green and Petre’s cognitive dimensions of constrain the space and the complexity of target program notations (Green and Petre 1996), the closer the program- statements in order to enable the understanding of flexible ming world is to the problem world, the easier the problem- user utterances. Such restrictions are due to the limited solving ought to be. This closeness of mapping is usually capabilities of existing natural language understanding * Tobe presented at the AAAI-20 Workshop on Intelligent Pro- techniques – they do not yet support robust understanding cess Automation (IPA-20). An earlier version of this paper was of utterances across diverse domains without extensive published and presented at ACM UIST 2019. training data and structured prior knowledge within each
domain. ing desired program behaviors. Through P UMICE’s mixed- Another difficult problem in natural language program- initiative conversations with users, an underlying persistent ming is supporting the instruction of concepts. In our study knowledge graph is dynamically updated with new pro- (details below in the Formative Study section), we found cedural (i.e., actions) and declarative (i.e., concepts and that end users often refer to ambiguous or vague concepts facts) knowledge introduced by users, allowing the semantic (e.g., cold weather, heavy traffic) when naturally instruct- parser to improve its understanding of user utterances over ing a task. Moreover, even if a concept may seem clear to a time. This structure also allows for effective reuse of user- human, an agent may still not understand it due to the limi- defined procedures and concepts at a fine granularity, reduc- tations in its natural language understanding techniques and ing user effort in EUD. pre-defined ontology. P UMICE presents a multi-modal interface, through which In this paper, we address the research challenge of en- users interact with the system using a combination of abling end users to augment domain-independent task au- demonstrations, pointing, and spoken commands. Users may tomation scripts with conditional structures and new con- use any modality that they choose, so they can leverage their cepts through a combination of natural language program- prior experience to minimize necessary training (MacLellan ming and programming by demonstration (PBD). To support et al. 2018; Laird et al. 2017). This interface also provides programming for tasks in diverse domains, we leverage the users with guidance through a mix of visual aids and ver- graphical user interfaces (GUIs) of existing third-party mo- bal directions through various stages in the process to help bile apps as a medium, where procedural actions are repre- users overcome common challenges and pitfalls identified in sented as sequences of GUI operations, and declarative con- the formative study, such as the omission of else statements, cepts can be represented through references to GUI contents. the difficulty in finding correct GUI objects for defining new This approach supports EUD for a wide range of tasks, pro- concepts, and the confusion in specifying proper data de- vided that these tasks can be performed with one or more scriptions for target GUI objects. A summative lab usability existing third-party mobile apps. study with 10 participants showed that users with little or no We took a user-centered design approach, first study- prior programming expertise could use P UMICE to program ing how end users naturally describe tasks with condi- automation scripts for 4 tasks derived from real-world sce- tionals in natural language in the context of mobile apps, narios. Participants also found P UMICE easy and natural to and what types of tasks they are interested in automating. use. Based on insights from this study, we designed and im- This paper presents the following three primary contribu- plemented an end-user-programmable conversational agent tions: named P UMICE1 (Li et al. 2019) that allows end users to 1. A formative study showing the characteristics of end program tasks with flexible conditional structures and new users’ natural language instructions for tasks with condi- concepts across diverse domains through spoken natural lan- tional structures in the context of mobile apps. guage instructions and demonstrations. 2. A multi-modal conversational approach for the EUD of P UMICE extends our previous S UGILITE (Li, Azaria, and task automation motivated by the aforementioned forma- Myers 2017) system. A key novel aspect of P UMICE’s de- tive study, with the following major advantages: sign is that it allows users to first describe the desired pro- gram behaviors and conditional structures naturally in nat- (a) The top-down conversational structure allows users to ural language at a high level, and then collaborate with an naturally start with describing the task and its con- intelligent agent through multi-turn conversations to explain ditionals at a high-level, and then recursively clarify and to define any ambiguities, concepts and procedures in ambiguities, explain unknown concepts and define new the initial description as needed in a top-down fashion. Users procedures through conversations. can explain new concepts by referring to either previously (b) The agent learns new procedural and declarative defined concepts, or to the contents of the GUIs of third- knowledge through explicit instructions from users, party mobile apps. Users can also define new procedures by and stores them in a persistent knowledge graph, fa- demonstrating using third-party apps (Li, Azaria, and Myers cilitating effective reusability and generalizability of 2017). Such an approach facilitates effective program reuse learned knowledge. in automation authoring, and provides support for a wide (c) The agent learns concepts and procedures in various range of application domains, which are two major chal- task domains while having a low learning barrier lenges in prior EUD systems. The results from the motivat- through its multi-modal approach that supports ref- ing study suggest that this paradigm is not only feasible, but erences and demonstrations using the contents of third- also natural for end users, which was supported by our sum- party apps’ GUIs. mative lab usability study. 3. The P UMICE system: an implementation of this approach, We build upon recent advances in natural language along with a user study evaluating its usability. processing (NLP) to allow P UMICE’s semantic parser to learn from users’ flexible verbal expressions when describ- Background and Related Work 1 P UMICE is a type of volcanic rock. It is also an acronym This research builds upon prior work from many different for Programming in a User-friendly Multimodal Interface through sub-disciplines across human-computer interaction (HCI), Conversations and Examples software engineering (SE), and natural language processing
(NLP). In this section, we focus on related work on three cepts, procedures and functionalities already exist to be used topics: (1) natural language programming; (2) programming as “building blocks” for developing new ones. by demonstration; and (3) the multi-modal approach that combines natural language inputs with demonstrations. Programming by Demonstration P UMICE uses the programming by demonstration (PBD) Natural Language Programming technique to enable end users to define concepts by referring P UMICE uses natural language as the primary modality to the contents of GUIs of third-party mobile apps, and teach for users to program task automation. The idea of us- new procedures through demonstrations with those apps. ing natural language inputs for programming has been ex- PBD is a natural way of supporting EUD with a low learning plored for decades (Ballard and Biermann 1979; Biermann barrier (Cypher and Halbert 1993; Lieberman 2001). Many 1983). In the NLP and AI communities, this idea is also domain-specific PBD tools have been developed in the past known as learning by instruction (Azaria, Krishnamurthy, in various domains, such as text editing (e.g., (Lau et al. and Mitchell 2016; Lieberman and Maulsby 1996). 2001)), photo editing (e.g., (Grabler et al. 2009)), web scrap- The foremost challenge in supporting natural language ing (e.g., (Chasins, Mueller, and Bodik 2018)), smart home programming is dealing with the inherent ambiguities control (e.g., (Li et al. 2017)) and robot control (e.g., (Argall and vagueness in natural language (Vadas and Curran et al. 2009)). 2005). To address this challenge, one prior approach P UMICE supports domain-independent PBD by using the was to constrain the structures and expressions in the GUIs of third-party apps for task automation and data ex- user’s language to similar formulations of conventional traction. Similar approaches have also been used in prior programming languages (e.g., (Ballard and Biermann 1979; systems. For example, S UGILITE (Li, Azaria, and Myers Price et al. 2000)), so that user inputs can be directly 2017), K ITE (Li and Riva 2018) and A PPINITE (Li et al. translated into programming statements. This approach is 2018a) use mobile app GUIs, CoScripter (Leshed et al. not adequate for EUD, as it has a high learning barrier for 2008), d.mix (Hartmann et al. 2007), Vegemite (Lin et al. users without programming expertise. 2009) and P LOW (Allen et al. 2007) use web interfaces, Another approach for handling ambiguities and vague- and H ILC (Intharah, Turmukhambetov, and Brostow 2019) ness in natural language inputs is to seek user clarifica- and Sikuli (Yeh, Chang, and Miller 2009) use desktop GUIs. tion through conversations. For example, Iris (Fast et al. Compared to those, P UMICE is the only one that can learn 2018) asks follow-up questions and presents possible op- concepts as generalized knowledge, and the only one that tions through conversations when the initial user input is in- supports creating conditionals from natural language in- complete or unclear. This approach lowers the learning bar- structions. Sikuli (Yeh, Chang, and Miller 2009) allows rier for end users, as it does not require them to clearly de- users to create conditionals in a scripting language, which fine everything up front. It also allows users to form complex is not suitable for end users without programming expertise. commands by combining multiple natural language instruc- tions in conversational turns under the guidance of the sys- The Multi-Modal Approach tem. P UMICE also adopts the use of multi-turn conversations A central challenge in PBD is generalization. A PBD agent as a key strategy in handling ambiguities and vagueness in should go beyond literal record-and-replay macros, and be user inputs. However, a key difference between P UMICE and able to perform similar tasks in new contexts (Cypher and other conversational instructable agents is that P UMICE is Halbert 1993; Lieberman 2001). This challenge is also a domain-independent. All conversational instructable agents part of the program synthesis problem. An effective way need to map the user’s inputs onto existing concepts, pro- of addressing this challenge is through multi-modal inter- cedures and system functionalities supported by the agent, action (Oviatt 1999b). Demonstrations can clearly commu- and to have natural language understanding mechanisms nicate what the user does, but not why the user does this and and training data in each task domain. Because of this con- how the user wants to do this in different contexts. On the straint, existing agents limit their supported tasks to one other hand, natural language instructions can often reflect or a few pre-defined domains, such as data science (Fast the user’s underlying intent (why) and preferences (how), et al. 2018), email processing (Azaria, Krishnamurthy, and but they are usually ambiguous or unclear. This is where Mitchell 2016; Srivastava, Labutov, and Mitchell 2017), or grounding natural language instructions with concrete GUI database queries (Kate, Wong, and Mooney 2005). demonstrations can help. P UMICE supports learning concepts and procedures from This mutual disambiguation approach (Oviatt 1999a) in existing third-party mobile apps regardless of the task do- multi-modal interaction has been proposed and used in many mains. End users can create new concepts with P UMICE by previous systems. This approach leverages repetition in a referencing relevant information shown in app GUIs, and de- different modality for mediation (Mankoff, Abowd, and fine new procedures by demonstrating with existing apps. Hudson 2000). Particularly for PBD generalization, S UG - This approach allows P UMICE to support a wide range of ILITE (Li, Azaria, and Myers 2017) and P LOW (Allen et al. tasks from diverse domains as long as the corresponding mo- 2007) use natural language inputs to identify parameteriza- bile apps are available. This approach also has a low learning tion in demonstrations, A PPINITE (Li et al. 2018a) uses nat- barrier because end users are already familiar with the capa- ural language explanations of intents to resolve the “data de- bilities of mobile apps and how to use them. In comparison, scription” (Cypher and Halbert 1993) for demonstrated ac- with prior instructable agents, it is often unclear what con- tions, M ARS (Chen, Martins, and Feng 2019) similarly used
multi-layer specifications including the textual descriptions of efforts in the responses, question misunderstandings, and of user intents to disambiguate program synthesis results blank responses. from user-provided input-output examples in the data sci- We report the most relevant findings which motivated the ence domain. design of P UMICE next. P UMICE builds upon this prior work, and extends the multi-modal approach to support learning concepts involved App GUI Grounding Reduces Unclear Concept in demonstrated tasks. The learned concepts can also be gen- Usage eralized to new task domains, as described in later sections. We analyzed whether each user’s verbal instruction for the The prior multi-modal PBD systems also use demonstration task provided a clear definition of the conditions in the task. as the main modality. In comparison, P UMICE uses natu- In the control group (instructing without seeing app screen- ral language conversation as the main modality, and uses shots), 33% of the participants used ambiguous, unclear or demonstration for grounding unknown concepts, values, and vague concepts in the instructions, such as “If it is daytime, procedures after they have been broken down and explained play upbeat music...” which is ambiguous as to when the in conversations. user considers it to be “daytime.” This is despite the fact that the example instructions they saw had clearly defined con- Formative Study ditions. Interestingly, for the experimental group, where each par- We took a user-centered approach (Myers et al. 2016) for ticipant was provided an app screenshot displaying specific designing a natural end-user development system (Myers et information relevant to the task’s condition, fewer partici- al. 2017; Radensky, Li, and Myers 2018). We first studied pants (9%) used ambiguous or vague concepts (this differ- how end users naturally communicate tasks with declarative ence is statistically significant with p ¡ 0.05), while the rest concepts and control structures in natural language for clearly defined the condition (e.g., “If the current time is be- various tasks in the mobile app context through a formative fore 7 pm...”). These results suggest that end users naturally study on Amazon Mechanical Turk with 58 participants use ambiguous and vague concepts when verbally instruct- (41 of which are non-programmers; 38 men, 19 women, 1 ing task logic, but showing users relevant mobile app GUIs non-binary person). with concrete instances of the values can help them ground Each participant was presented with a graphical descrip- the concepts, leading to fewer ambiguities and vagueness in tion of an everyday task for a conversational agent to com- their descriptions. The implication is that a potentially effec- plete in the context of mobile apps. All tasks had distinct tive approach to avoiding unclear utterances for agents is to conditions for a given task so that each task should be per- guide users to explain them in the context of app GUIs. formed differently under different conditions, such as play- ing different genres of music based on the time of the day. Unmet User Expectation of Common Sense Each participant was assigned to one of 9 tasks. To avoid Reasoning biasing the language used in the responses, we used the Nat- ural Programming Elicitation method (Myers et al. 2016) by We observed that participants often expected and assumed showing graphical representations of the tasks with limited the agent to have the capability of understanding and reason- text in the prompts. Participants were asked how they would ing with common sense knowledge when instructing tasks. verbally instruct the agent to perform the tasks, so that the For example, one user said, “if the day is a weekend”. The system could understand the differences among the condi- agent would therefore need to understand the concept of tions and what to do in each condition. Each participant was “weekend” (i.e., how to know today’s day of the week, and first trained using an example scenario and the correspond- what days count as “weekend”) to resolve this condition. ing example verbal instructions. Similarly when a user talked about “sunset time”, he ex- To study whether having mobile app GUIs can affect pected the agent to know what it meant, and how to find users’ verbal instructions, we randomly assigned partici- out its value. pants into one of two groups. For the experimental group, However, the capability for common sense knowledge participants instructed agents to perform the tasks while and reasoning is very limited in current agents, especially looking at relevant app GUIs. Each participant was pre- across many diverse domains, due to the spotty coverage and sented with a mobile app screenshot with arrows pointing unreliable inference of existing common sense knowledge to the screen component that contained the information per- systems. Managing user expectation and communicating the tinent to the task condition. Participants in the control group agent’s capability is also a long-standing unsolved challenge were not shown app GUIs. At the end of each study session, in building interactive intelligent systems (Lieberman et al. we also asked the participants to come up with another task 2004). A feasible workaround is to enable the agent to ask scenario of their own where an agent should perform differ- users to ground new concepts to existing contents in apps ently in different conditions. when they come up, and to build up knowledge of concepts The participants’ responses were analyzed by two in- over time through its interaction with users. dependent coders using open coding (Strauss and Corbin 1990). The inter-rater agreement (Cohen 1960) was κ = Frequent Omission of Else Statements 0.87, suggesting good agreement. 19% of responses were In the study, despite all provided example responses con- excluded from the analysis for quality control due to the lack taining else statements, 18% of the 39 descriptions from
users omitted an else statement when it was expected. “Play overlay to highlight any GUI objects on the screen that fit upbeat music until 8pm every day,” for instance, may im- into the comparison (i.e., those that display a value compa- ply that the user desires an alternative genre of music to rable to 85 degrees Fahrenheit). The user can choose from be played at other times. Furthermore, 33% omitted an else these highlighted objects (see Figure 2 for an example). statement when a person would be expected to infer an else Through this demonstration, P UMICE learns a reusable pro- statement, such as: “If a public transportation access point cedure query Temperature() for getting the current is more than half a mile away, then request an Uber,” which value for the new concept temperature, and stores it in a per- implies using public transportation otherwise. This might be sistent knowledge graph so that it can be used in other tasks. a result of the user’s expectation of common sense reasoning P UMICE confirms with the user every time it learns a new capabilities. The user omits what they expect the agent can concept or a new rule, so that the user is aware of the current infer to avoid prolixity, similar to patterns in human-human state of the system, and can correct any errors (see the Error conversations (Grice et al. 1975). Recovery and Backtracking section for details). These findings suggest that end users will often omit ap- For the next phase, P UMICE has already determined that propriate else statements in their natural language instruc- “order a cup of Iced Cappuccino” should be an action trig- tions for conditionals. Therefore, the agent should proac- gered when the condition “it’s hot” is true, but does not tively ask users about alternative situations in conditionals know how to perform this action (also known as intent ful- when appropriate. fillment in chatbots (Li and Riva 2018)). To learn how to perform this action, it asks, “How do I order a cup of Iced PUMICE Cappuccino?”, to which the user responds, “I can demon- strate.” The user then proceeds to demonstrate the proce- Motivated by the formative study results, we designed the dure of ordering a cup of Iced Cappuccino using the existing P UMICE agent that supports understanding ambiguous nat- app for Starbucks (a coffee chain). From the user demon- ural language instructions for task automation by allowing stration, P UMICE can figure out that “Iced Cappuccino” is users to recursively define any new, ambiguous or vague a task parameter, and can learn the generalized procedure concepts in a multi-level top-down process. order Starbucks() for ordering any item in the Star- bucks app, as well as a list of available items to order in Example Scenario the Starbucks app by looking through the Starbucks app’s This section shows an example scenario to illustrate how menus, using the underlying S UGILITE framework [9] for P UMICE works. Suppose a user starts teaching the agent a processing the task recording. new task automation rule by saying, “If it’s hot, order a cup Finally, P UMICE asks the user about the else condition of Iced Cappuccino.” We also assume that the agent has no by saying, “What should I do if it’s not hot?” Suppose the prior knowledge about the relevant task domains (weather user says “Order a cup of Hot Latte,” then the user will and coffee ordering). Due to the lack of domain knowledge, not need to demonstrate again because P UMICE can rec- the agent does not understand “it’s hot” and “order a cup ognize “hot latte” as an available parameter for the known of Iced Cappuccino”. However, the agent can recognize the order Starbucks() procedure. conditional structure in the utterance (the parse for Utter- ance 0 in Figure 1) and can identify that “it’s hot” should Design Features represent a Boolean expression while “order a cup of Iced In this section, we discuss several of P UMICE’s key design Cappuccino” represents the action to perform if the condi- features in its user interactions, and how they were motivated tion is true. by results of the formative study. P UMICE’s semantic parser can mark unknown parts in user utterances using typed resolve...() functions, as Support for Concept Learning In the formative study, marked in the yellow highlights in the parse for Utterance we identified two main challenges in regards to concept 0 in Figure 1. The P UMICE agent then proceeds to ask the learning. First, user often naturally use intrinsically unclear user to further explain these concepts. It asks, “How do I tell or ambiguous concepts when instructing intelligent agents whether it’s hot?” since it has already figured out that “it’s (e.g., “register for easy courses”, where the definition of hot” should be a function that returns a Boolean value. The “easy” depends on the context and the user preference). user answers “It is hot when the temperature is above 85 Second, users expect agents to understand common-sense degrees Fahrenheit.”, as shown in Utterance 2 in Figure 1. concepts that the agents may not know. To address these P UMICE understands the comparison (as shown in the parse challenges, we designed and implemented the support for for Utterance 2 in Figure 1), but does not know the concept concept learning in P UMICE. P UMICE can detect and learn of “temperature”, only knowing that it should be a numeric three kinds of unknown components in user utterances: pro- value comparable to 85 degrees Fahrenheit. Hence it asks, cedures, Boolean concepts, and value concepts. Because “How do I find out the value for temperature?”, to which the P UMICE’s support for procedure learning is unchanged from user responds, “Let me demonstrate for you.” the underlying S UGILITE mechanisms (Li et al. 2018a; Here the user can demonstrate the procedure of find- Li, Azaria, and Myers 2017), in this section, we focus on ing the current temperature by opening the weather app discussing how P UMICE learns Boolean concepts and value on the phone, and pointing at the current reading of the concepts. weather. To assist the user, P UMICE uses a visualization When encountering an unknown or unclear concept in the
Figure 1: Example structure of how P UMICE learns the concepts and procedures in the command “If it’s hot, order a cup of Iced Cappuccino.” The numbers indicate the order of utterances. The screenshot on the right shows the conversational interface of P UMICE. In this interactive parsing process, the agent learns how to query the current temperature, how to order any kind of drink from Starbucks, and the generalized concept of “hot” as “a temperature (of something) is greater than another temperature”. utterance parsing result, P UMICE first determines the con- popular ridesharing apps). cept type based on the context. If the concept is used as a Similar to Boolean concepts, the user can refer to another condition (e.g., “if it is hot”), then it should be of Boolean value concept when explaining a value concept. When a type. Similarly, if a concept is used where a value is ex- value concept is concrete and available in a mobile app, the pected (e.g., “if the current temperature is above 70°F” user can also demonstrate how to query the value through or “set the AC to the current temperature”), then it will app GUIs. The formative study has suggested that this multi- be marked as a value concept. Both kinds of concepts are modal approach is effective and feasible for end users. Af- represented as typed resolve() functions in the parsing ter users indicate that they want to demonstrate, P UMICE result (shown in Figure 1), indicating that they need to be switches to the home screen of the phone, and prompts the further resolved down the line. This process is flexible. For user to demonstrate how to find out the value of the concept. example, if the user clearly defines a condition without in- To help the user with value concept demonstrations, troducing unknown or unclear concepts, then P UMICE will P UMICE highlights possible items on the current app GUI not need to ask follow-up questions for concept resolution. if the type of the target concept can be inferred from the P UMICE recursively executes each resolve() function type of the constant value, or using the type of value concept in the parsing result in a depth-first fashion. After a con- to which it is being compared (see Figure 2). For example, cept is fully resolved (i.e., all concepts used for defining it in the aforementioned “commute time” example, P UMICE have been resolved), it is added to a persistent knowledge knows that “commute time” should be a duration, because graph (details in the System Implementation section), and a it is comparable to the constant value “30 minutes”. Once link to it replaces the resolve() function. From the user’s the user finds the target value in an app, they can long press perspective, when a resolve() function is executed, the on the target value to select it and indicate it as the target agent asks a question to prompt the user to further explain value. P UMICE uses an interaction proxy overlay (Zhang et the concept. When resolving a Boolean concept, P UMICE al. 2017) for recording, so that it can record all values vis- asks, “How do I know whether [concept name]?” For re- ible on the screen, not limited to the selectable or clickable solving a value concept, P UMICE asks, “How do I find out ones. P UMICE can extract these values from the GUI using the value of [concept name]?” the screen scraping mechanism in the underlying S UGILITE framework (Li, Azaria, and Myers 2017). Once the target To explain a new Boolean concept, the user may verbally value is selected, P UMICE stores the procedure of navigating refer to another Boolean concept (e.g., “traffic is heavy” to the screen where the target value is displayed and finding means “commute takes a long time”) or may describe a the target value on the screen into its persistent knowledge Boolean expression (e.g., “the commute time is longer than graph as a value query, so that this query can be used when- 30 minutes”). When describing the Boolean expression, the ever the underlying value is needed. After the value concept user can use flexible words (e.g., colder, further, more ex- demonstration, P UMICE switches back to the conversational pensive) to describe the relation (i.e., greater than, less than, interface and continues to resolve the next concept if needed. and equal to). As explained previously, if any new Boolean or value concepts are used in the explanation, P UMICE will Concept Generalization and Reuse Once concepts are recursively resolve them. The user can also use more than learned, another major challenge is to generalize them so one unknown value concepts, such as “if the price of a Uber that they can be reused correctly in different contexts and is greater than the price of a Lyft” (Uber and Lyft are both task domains. This is a key design goal of P UMICE. It should
concept named “temperature”, so it will ask “I already know how to find out the value for temperature using the Weather app. Should I use that for determining whether the oven is hot?”, to which the user can say “No” and then demonstrate querying the temperature of the oven using the correspond- ing app (assuming the user has a smart oven with an in-app display of its temperature). This mechanism allows concepts like “hot” to be reused at three different levels: (1) exactly the same (e.g., the tempera- ture of the weather is greater than 85°F); (2) different thresh- old (e.g., the temperature of the weather is greater than x); and (3) different value query (e.g., the temperature of some- thing else is greater than x). Error Recovery and Backtracking Like all other inter- active EUD systems, it is crucial for P UMICE to properly handle errors, and to backtrack from errors in speech recog- nition, semantic parsing, generalizations, and inferences of intent (Li et al. 2018b). We iteratively tested early proto- types of P UMICE with users through early usability testing, Figure 2: The user teaches the value concept “commute and developed the following mechanisms to support error time” by demonstrating querying the value in Google Maps. recovery and backtracking in P UMICE. The red overlays highlight all durations it was able to iden- To mitigate semantic parsing errors, we implemented a tify on the Google Maps GUI. mixed-initiative mechanism where P UMICE can ask users about components within the parsed expression if the pars- be able to learn concepts at a fine granularity, and reuse ing result is considered incorrect by the user. Because pars- parts of existing concepts as much as possible to avoid ask- ing result candidates are all typed expressions in P UMICE’s ing users to make redundant demonstrations. In our previous internal functional domain-specific language (DSL) as a works on generalization for PBD, we focused on generaliz- conditional, Boolean, value, or procedure, P UMICE can ing procedures, specifically learning parameters (Li, Azaria, identify components in a parsing result that it is less con- and Myers 2017) and intents for underlying operations (Li fident about by comparing the top candidate with the alter- et al. 2018a). We have already deployed these existing gen- natives and confidence scores, and ask the user about them. eralization mechanisms in P UMICE, but in addition, we also For example, suppose the user defines a Boolean concept explored the generalization of Boolean concepts and value “good restaurant” with the utterance “the rating is better than concepts. 2”. The parser is uncertain about the comparison operator When generalizing Boolean concepts, P UMICE assumes in this Boolean expression, since “better” can mean either that the Boolean operation stays the same, but the arguments “greater than” or “less than” depending on the context. It may differ. For example, for the concept “hot” in Figure 1, will ask the user “I understand you are trying to compare it should still mean that a temperature (of something) is the value concept ‘rating’ and the value ‘2’, should ‘rating’ greater than another temperature. But the two in comparison be greater than, or less than ‘2’?” The same technique can can be different constants, or from different value queries. also be used to disambiguate other parts of the parsing re- For example, suppose after the interactions in Figure 1, the sults, such as the argument of resolve() functions (e.g., user instructs a new rule “if the oven is hot, start the cook determining whether the unknown procedure should be “or- timer.” P UMICE can recognize that “hot” is a concept that der a cup of Iced Cappuccino” or “order a cup” for Utterance has been instructed before in a different context, so it asks 0 in Figure 1). “I already know how to tell whether it is hot when determin- P UMICE also provides an “undo” function to allow the ing whether to order a cup of Iced Cappuccino. Is it the same user to backtrack to a previous conversational state in case here when determining whether to start the cook timer?” Af- of incorrect speech recognition, incorrect generalization, or ter responding “No”, the user can instruct how to find out when the user wants to modify a previous input. Users can the temperature of the oven, and the new threshold value for either say that they want to go back to the previous state, “hot” either by instructing a new value concept, or using a or click on an “undo” option in P UMICE’s menu (activated constant value. from the option icon on the top right corner on the screen The generalization mechanism for value concepts works shown in Figure 1). similarly. P UMICE supports value concepts that share the same name to have different query implementations for dif- System Implementation ferent task contexts. For example, following the “if the oven We implemented the P UMICE agent as an Android app. The is hot, start the cook timer” example, suppose the user de- app was developed and tested on a Google Pixel 2 XL phone fines “hot” for this new context as “The temperature is above running Android 8.0. P UMICE does not require the root ac- 400 degrees.” P UMICE realizes that there is already a value cess to the phone, and should run on any phone running An-
droid 6.0 or higher. P UMICE is open-sourced on GitHub2 . Semantic Parsing We built the semantic parser for P UMICE using the S EMPRE framework (Berant et al. 2013). The parser runs on a remote Linux server, and communi- cates with the P UMICE client through an HTTP RESTful interface. It uses the Floating Parser architecture, which is a grammar-based approach that provides more flexibility without requiring hand-engineering of lexicalized rules like Figure 3: An example showing how P UMICE parses the synchronous CFG or CCG based semantic parsers (Pasupat user’s demonstrated action and verbal reference to an app’s and Liang 2015). This approach also provides more inter- GUI content into a SET VALUE statement with a query over pretable results and requires less training data than neural the UI snapshot graph when resolving a new value concept network approaches (e.g., (Yin et al. 2018; Yin and Neubig “current temperature” 2017)). The parser parses user utterances into expressions in a simple functional DSL we created for P UMICE. for demonstration, and a persistent knowledge base for stor- A key feature we added to P UMICE’s parser is allowing ing learned procedures and concepts. typed resolve() functions in the parsing results to indicate unknown or unclear concepts and procedures. The purpose of the UI snapshot graph is to support under- This feature adds interactivity to the traditional semantic standing the user’s references to app GUI contents in their parsing process. When this resolve() function is called verbal instructions. The UI snapshot graph mechanism used at runtime, the front-end P UMICE agent asks the user in P UMICE was extended from A PPINITE (Li et al. 2018a). to verbally explain, or to demonstrate how to fulfill this For every state of an app’s GUI, a UI snapshot graph is con- resolve() function. If an unknown concept or procedure structed to represent all visible and invisible GUI objects is resolved through verbal explanation, the parser can parse on the screen, including their types, positions, accessibility the new explanation into an expression of its original type in labels, text labels, various properties, and spatial relations the target DSL (e.g., an explanation for a Boolean concept is among them. We used a lightweight semantic parser from parsed into a Boolean expression), and replace the original the Stanford CoreNLP (Manning et al. 2014) to extract types resolve() function with the new expression. The parser of structured data (e.g., temperature, time, date, phone num- also adds relevant utterances for existing concepts and ber) and named entities (e.g., city names, people’s names). procedures, and visible text labels from demonstrations on When handling the user’s references to app GUI contents, third-party app GUIs to its set of lexicons, so that it can P UMICE parses the original utterances into queries over the understand user references to those existing knowledge current UI snapshot graph (example in Figure 3). This ap- and in-app contents. P UMICE’s parser was trained on rich proach allows P UMICE to generate flexible queries for value features that associate lexical and syntactic patterns (e.g., concepts and procedures that accurately reflect user intents, unigrams, bigrams, skipgrams, part-of-speech tags, named and which can be reliably executed in future contexts. entity tags) of user utterances with semantics and structures The persistent knowledge base stores all procedures, con- of the target DSL over a small number of training data cepts, and facts P UMICE has learned from the user. Proce- (n = 905) that were mostly collected and enriched from the dures are stored as S UGILITE (Li, Azaria, and Myers 2017) formative study. scripts, with the corresponding trigger utterances, param- eters, and possible alternatives for each parameter. Each Demonstration Recording and Replaying P UMICE uses Boolean concept is represented as a set of trigger utterances, our open-sourced S UGILITE (Li, Azaria, and Myers 2017) Boolean expressions with references to the value concepts or framework to support its demonstration recording and re- constants involved, and contexts (i.e., the apps and actions playing. S UGILITE provides action recording and replaying used) for each Boolean expression. Similarly, the structure capabilities on third-party Android apps using the Android for each stored value concept includes its triggering utter- Accessibility API. S UGILITE also provides the support for ances, demonstrated value queries for extracting target val- parameterization of sequences of actions (e.g., identifying ues from app GUIs, and contexts for each value query. “Iced Cappuccino” as a parameter and “Hot Latte” as an al- ternative value in the example in Figure 1), and the support for handling minor GUI changes in apps. Through S UG - User Study ILITE , P UMICE operates well on most native Android apps, We conducted a lab study to evaluate the usability of but may have problems working with web apps and apps P UMICE. In each session, a user completed 4 tasks. For each with special graphic engines (e.g., games). It currently does task, the user instructed P UMICE to create a new task au- not support recording gestural and sensory inputs (e.g., ro- tomation, with the required conditionals and new concepts. tating the phone) either. We used a task-based method to specifically test the usabil- Knowledge Representations P UMICE maintains two ity of P UMICE’s design, since the motivation for the design kinds of knowledge representations: a continuously refresh- derives from the formative study results. We did not use a ing UI snapshot graph representing third-party app contexts control condition, as we could not find other tools that can feasibly support users with little programming expertise to 2 https://github.com/tobyli/Sugilite development complete the target tasks.
Participants We recruited 10 participants (5 women, 5 men, ages 19 to 35) for our study. Each study session lasted 40 to 60 min- utes. We compensated each participant $15 for their time. 6 participants were students in two local universities, and the other 4 worked different technical, administrative or man- agerial jobs. All participants were experienced smartphone users who had been using smartphones for at least 3 years. 8 out of 10 participants had some prior experience of interact- ing with conversational agents like Siri, Alexa and Google Assistant. Figure 4: The graphical prompt used for Task 1 – A possible We asked the participants to report their programming ex- user command can be “Order Iced coffee when it’s hot out- perience on a five-point scale from “never programmed” to side, otherwise order hot coffee when the weather is cold.” “experienced programmer”. Among our participants, there were 1 who had never programmed, 5 who had only used demonstration, concept generalization, procedure general- end-user programming tools (e.g., Excel functions, Office ization and “else” condition handling); (2) involved only macros), 1 novice programmer with experience equivalent to app interfaces that most users are familiar with; and (3) used 1-2 college level programming classes, 1 programmer with conditions that we can control so we can test the correctness 1-2 years of experience, and 2 programmers with more than of the scripts (we controlled the temperature, the traffic 3 years of experience. In our analysis, we will label the first condition, and the room price by manipulating the GPS two groups “non-programmers” and the last three groups location of the phone). “programmers”. In order to minimize biasing users’ utterances, we used the Natural Programming Elicitation method (Myers et al. Procedure 2016). Task descriptions were provided in the form of graph- ics, with minimal text descriptions that could not be directly At the beginning of each session, the participant received a used in user instructions (see Figure 4 for an example). 5-minute tutorial on how to use P UMICE. In the tutorial, the experimenter demonstrated an example of teaching P UMICE to check the bus schedule when “it is late”, and “late” was Task 1 In this task, the user instructs P UMICE to order iced defined as “current time is after 8pm” through a conversa- coffee when the weather is hot, and order hot coffee other- tion with P UMICE. The experimenter then showed how to wise (Figure 4). We pre-taught P UMICE the concept of “hot” demonstrate to P UMICE finding out the current time using in the task domain of turning on the air conditioner. So the the Clock app. user needs to utilize the concept generalization feature to Following the tutorial, the participant was provided a generalize the existing concept “hot” to the new domain of Google Pixel 2 phone with P UMICE and relevant third-party coffee ordering. The user also needs to demonstrate order- apps installed. The experimenter showed the participant the ing iced coffee using the Starbucks app, and to provide “or- available apps, and made sure that the participant understood der hot coffee” as the alternative for the “else” operation. the functionality of each third-party app. We did this because The user does not need to demonstrate again for ordering the underlying assumption of the study (and the design of hot coffee, as it can be automatically generalized from the P UMICE) is that users are familiar with the third-party apps, previous demonstration of ordering iced coffee. so we are testing whether they can successfully use P UMICE, not the apps. Then, the participant received 4 tasks in ran- Task 2 In this task, the user instructs P UMICE to set an dom order. We asked participants to keep trying until they alarm for 7:00am if the traffic is heavy on their commut- were able to correctly execute the automation, and that they ing route. We pre-stored “home” and “work” locations in were happy with the resulting actions of the agent. We also Google Maps. The user needs to define “heavy traffic” as checked the scripts at the end of each study session to eval- prompted by P UMICE by demonstrating how to find out the uate their correctness. estimated commute time, and explaining that “heavy traffic” After completing the tasks, the participant filled out a means that the commute takes more than 30 minutes. The post-survey to report the perceived usefulness, ease of use user then needs to demonstrate setting a 7:00am alarm using and naturalness of interactions with P UMICE. We ended the built-in Clock app. each session with a short informal interview with the par- ticipant on their experiences with P UMICE. Task 3 In this task, the user instructs P UMICE to choose between making a hotel reservation and requesting a Uber Tasks to go home depending on whether the hotel price is cheap. We assigned 4 tasks to each participant. These tasks were The user should verbally define ”cheap” as ”room price is designed by combining common themes observed in users’ below $100”, and demonstrate how to find out the hotel price proposed scenarios from the formative study. We ensured using the Marriott (a hotel chain) app. The user also needs to that these tasks (1) covered key P UMICE features (i.e., demonstrate making the hotel reservation using the Marriott concept learning, value query demonstration, procedure app, specify ”request an Uber” as the action for the “else”
the tutorial that they should try to use concepts that can be found in mobile apps in their explanations of new concepts. These factors might contributed to the successes of our participants. In a post survey, we asked participants to rate state- ments about the usability, naturalness and usefulness of P UMICE on a 7-point Likert scale from “strongly disagree” to “strongly agree”. P UMICE scored on average 6.2 on “I feel P UMICE is easy to use”, 6.1 on “I find my interactions with P UMICE natural”, and 6.9 on “I think P UMICE is a useful tool for automating tasks on smartphones,” indicating that our participants were generally satisfied with their experi- ence using P UMICE. Figure 5: The average task completion times for each task. The error bars show one standard deviation in each direction. Discussion condition, and demonstrate how to request an Uber using In the informal interview after completing the tasks, par- the Uber app. ticipants praised P UMICE for its naturalness and low learn- ing barriers. Non-programmers were particularly impressed Task 4 In this task, the user instructs P UMICE to order a by the multi-modal interface. For example, P7 (who was a pepperoni pizza if there is enough money left in the food non-programmer) said: “Teaching P UMICE feels similar to budget. The user needs to define the concept of “enough explaining tasks to another person...[Pumice’s] support for budget”, demonstrate finding out the balance of the budget demonstration is very easy to use since I’m already famil- using the Spending Tracker app, and demonstrate ordering a iar with how to use those apps.” Participants also considered pepperoni pizza using the Papa Johns (a pizza chain) app. P UMICE’s top-down interactive concept resolution approach very useful, as it does not require them to define everything Results clearly upfront. Participants were excited about the usefulness of All participants were able to complete all 4 tasks. The to- P UMICE. P6 said, “I have an Alexa assistant at home, but tal time for tasks ranged from 19.4 minutes to 25 min- I only use them for tasks like playing music and setting utes for the 10 participants. Figure 5 shows the overall av- alarms. I tried some more complicated commands before, erage task completion time of each task, as well as the but they all failed. If it had the capability of P UMICE, I comparison between the non-programmers and the pro- would definitely use it to teach Alexa more tasks.” They also grammers. The average total time-on-task for programmers proposed many usage scenarios based on their own needs in (22.12 minutes, SD=2.40) was slightly shorter than that for the interview, such as paying off credit card balance early non-programmers (23.06 minutes, SD=1.57), but the differ- when it has reached a certain amount, automatically closing ence was not statistically significant. background apps when the available phone memory is low, Most of the issues encountered by participants were actu- monitoring promotions for gifts saved in the wish list when ally from the Google Cloud speech recognition system used approaching anniversaries, and setting reminders for events in P UMICE. It would sometimes misrecognize the user’s in mobile games. voice input, or cut off the user early. These errors were han- Several concerns were also raised by our participants. dled by the participants using the “undo” feature in P UMICE. P4 commented that P UMICE should “just know” how to Some participants also had parsing errors. P UMICE’s current find out weather conditions without requiring her to teach semantic parser has limited capabilities in understanding ref- it since “all other bots know how to do it”, indicating the erences of pronouns (e.g., for an utterance “it takes longer need for a hybrid approach that combines EUD with pre- than 30 minutes to get to work”, the parser would recognize programmed common functionalities. P5 said that teaching it as “it” instead of “the time it takes to get to work” is greater the agent could be too time-consuming unless the task was than 30 minutes). Those errors were also handled by partic- very repetitive since he could just “do it with 5 taps.” Sev- ipants through undoing and rephrasing. One participant ran eral users also expressed privacy concerns after learning that into the “confusion of Boolean operator” problem in Task P UMICE can see all screen contents during demonstrations, 2 when she used the phrase “commute [time is] worse than while one user, on the other hand, suggested having P UMICE 30 minutes”, for which the parser initially recognized incor- observe him at all times so that it can learn things in the rectly as “commute is less than 30 minutes.” She was able background. to correct this using the mixed-initiative mechanism, as de- scribed in the Error Recovery and Backtracking section. Overall, no participant had major problem with the multi- Limitations and Future Work modal interaction approach and the top-down recursive The current version of P UMICE has no semantic under- concept resolution conversational structure, which was en- standing of information involved in tasks, which prevents it couraging. However, all participants had received a tutorial from dealing with implicit parameters (e.g., “when it snows” with an example task demonstrated. We also emphasized in means “the current weather condition is snowing”) and un-
You can also read