SkillBot: Identifying Risky Content for Children in Alexa Skills
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SkillBot: Identifying Risky Content for Children in Alexa Skills Tu Le Danny Yuxing Huang Noah Apthorpe Yuan Tian University of Virginia New York University Colgate University University of Virginia Abstract priate content (e.g., expletives) or collect personal information arXiv:2102.03382v1 [cs.MA] 5 Feb 2021 Many households include children who use voice personal through voice interactions. assistants (VPA) such as Amazon Alexa. Children benefit There is no systematic testing tool that vets VPA skills to from the rich functionalities of VPAs and third-party apps identify those that contain risky content for children. Legal ef- but are also exposed to new risks in the VPA ecosystem (e.g., forts and industry solutions have tried to protect children using inappropriate content or information collection). To study the VPAs; however, their effectiveness is unclear. The 1998 Chil- risks VPAs pose to children, we build a Natural Language dren’s Online Privacy Protection Act (COPPA) regulates the Processing (NLP)-based system to automatically interact with information collected from children under 13 online [10], but VPA apps and analyze the resulting conversations to identify widespread COPPA violations have been shown in the mobile contents risky to children. We identify 28 child-directed apps application market [45], and compliance in the VPA space is with risky contents and maintain a growing dataset of 31,966 far from guaranteed. Additionally, parental control modes pro- non-overlapping app behaviors collected from 3,434 Alexa vided by VPAs (e.g., Amazon FreeTime and Google Family apps. Our findings suggest that although voice apps designed App) often place a burden on parents during setup and receive for children are subject to more policy requirements and in- complaints from parents due to their limitations [1, 9, 15]. tensive vetting, children are still vulnerable to risky content. Research Questions. Protecting children in the era of voice We then conduct a user study showing that parents are more devices therefore raises several pressing questions: concerned about VPA apps with inappropriate content than • RQ0. Can we automate the analysis of VPA skills to those that ask for personal information, but many parents are identify content risky for children without requiring man- not aware that risky apps of either type exist. Finally, we ual human voice interactions? identify a new threat to users of VPA apps: confounding utter- • RQ1. Are VPA skills targeted to children that claim ances, or voice commands shared by multiple apps that may to follow additional content requirements – hereafter cause a user to invoke or interact with a different app than in- referred to as “kid skills” – actually safe for child users? tended. We identify 4,487 confounding utterances, including • RQ2. What are parents’ attitudes and awareness of the 581 shared by child-directed and non-child-directed apps. risks posed by VPAs to children? • RQ3. How likely is it for children to be exposed to 1 Introduction risky skills through confounding utterances—voice com- The rapid development of Internet of Things (IoT) technol- mands shared by multiple skills which could cause a ogy has aligned with growing popularity of voice personal child to accidentally invoke or interact with a different assistant (VPA) services, such as Amazon Alexa and Google skill than intended. Home. In addition to the first-party features provided by these In this paper, we design, implement, and perform a systematic products, VPA service providers have also developed plat- automated analysis of the Amazon Alexa VPA skill ecosystem forms that allow third-party developers to build and publish and conduct a user study to answer these research questions. their own voice apps—hereafter referred to as “skills”. Challenges to Automated Skill Analysis. In comparison to Risks to Children from VPAs. Researchers have found that mobile applications and other traditional software, neither the 91% of children between ages 4 and 11 in the U.S. have ac- executable files nor source code of VPA skills are available to cess to VPAs, 26% of children are exposed to a VPA between researchers for analysis. Instead, the skills’ natural language 2 and 4 hours a week, and 20% talk to VPA devices more processing modules and key function logic are hosted in the than 5 hours a week [16]. The lack of robust authentication on cloud as a black box. Thus, decompilation, traditional static, commercial VPAs makes it challenging to regulate children’s or dynamic analysis methods cannot be applied to VPA skills. use of skills [53], especially as anyone in the same physical VPA skill voice interactions are built following a template vicinity of a VPA can interact with the device. As a result, defined by the third-party developer, which is also unavailable children may have access to risky skills that deliver inappro- to researchers. To automatically detect risky content, we need 1
to generate testing inputs that trigger this content through threat that is particularly problematic for child users: con- sequential interactions. A further challenge is that risky con- founding utterances. Confounding utterances are voice inputs tent does not always occur during a users’ first interaction that are share by more than one skill that may be present with a skill; human users often need to have back-and-forth on a VPA device. When a user interacts with a VPA via a conversations with skills to discover risky contents. Automat- confounding utterance, the utterance might trigger a reaction ing this process requires developing a tool that can generate from any of these skills. If a kid skill shares a confounding valid voice inputs and dynamic follow-up responses that will utterance with a skill inappropriate for kids, a kid user might cause the skill to reveal risky contents. This is different from inadvertently begin interacting with the inappropriate skill. existing chatbot development techniques [28], as the goal is For example, a child user use an utterance to invoke a kid not to generate inputs that sound natural to a human. Instead skill X, but the another skill Y, which is in the non-kid cate- automated skill analysis requires generating inputs that will gory and which shares the same utterance, could be triggered explore the space of skill behaviors as thoroughly as possible. instead. As Echo does not offer visual cues on what skill is actually invoked, the user may not realize that Y is running Automated Identification of Risky Content. This paper instead of X. Furthermore, skills in the non-kid categories presents our systematic approach to analyzing VPA skills typically face more relaxed requirements than kid skills do; based on automated interactions. We apply this approach to the child user could be exposed to risky contents from skill Y. 3,434 Alexa skills targeted toward children in order to mea- Our SkillBot reveals 4,487 confounding utterances, 581 sure the prevalence of kids skills that contain risky content. of which are shared between a kid skill and a skill that is More specifically, we build a natural-language-based sys- not in “Kids” category (Section 8). Of these 581 utterances, tem called “SkillBot” that interacts with VPA skills and an- 27% prioritize invoking a non-kid skill over a kid skill. This alyzes the results for risky content, including inappropriate indicates that children are at real risk of accidentally invoking language and personal information collection (Section 5). non-kid skills and that an adversary could exploit overlapping SkillBot generates valid skill inputs, analyzes skill responses, utterances to get child users to invoke non-kid skills (RQ3). and systematically generates follow-up inputs. Through mul- tiple rounds of interactions, we can determine whether skills Contributions. We make the following contributions: contain risky content. Automated System for Skill Analysis: We present a system, The design of SkillBot answers RQ0, and our SkillBot SkillBot, that automatically interacts with Alexa skills and analysis of 3,434 kid skills allows us to answer RQ1. We collects their contents at scale. Our system can be run longi- identify 8 kid skills with inappropriate content and 20 kid tudinally to identify new conversations and new conversation skills that ask for personal information (Section 6). branches in previously analyzed skills. We plan to publicly release our system to help future research. Online User Study and Insights from Parents. We next Identification of Risks to Children: We analyze 31,966 wanted to verify our SkillBot results by seeing whether par- conversations collected from 3,434 Alexa kid skills to detect ents also viewed identified skills as risky, as well as to better potential risky skills directed to children. We find 8 skills that understand the real world contexts of children’s interactions contain inappropriate content for children and 20 skills that with VPAs (RQ2). We conduct a user study of 232 U.S. Alexa ask for personal information through voice interaction. users who have children under 13 years old. We present these User Study of Parents’ Awareness and Experiences: We parents with examples of interactions with risky and non-risky conduct a user study demonstrating that a majority of parents skills identified by SkillBot and ask them to report their reac- express concern about the content of the risky kids skills iden- tions to these skills, experiences with risky/unwanted content tified by SkillBot tempered by disbelief that these skills are on their own VPAs, and use of VPA parental control features. actually available for Alexa VPAs. This lack of risk awareness We find that parents are uncomfortable about the inappro- is compounded by findings that many parents’ do not use VPA priate content in our identified skills. 54.1% cannot imagine parental controls and allow their children to use VPA versions such interactions are possible on Alexa, and 58.4% believe that do not have parental controls enabled by default. Alexa should block such interactions. Many parents do not Confounding Utterances: We identify confounding utter- think that these skills are designed for families/kids, although ances as a novel threat to VPA users. Our SkillBot analysis these skills are actually published in Amazon’s “Kids” cate- reveals 4,487 confounding utterances shared between two or gory. We also find that 23.7% of parents do not know about more skills and highlight those that place child users at risk Alexa’s parental control feature, and of those who know about by invoking a non-kid skill instead of an expected kid skill. the feature, only 29.4% use it. These data highlight the risks to children posed by VPA skills with inappropriate content. 2 Background While SkillBot demonstrates that such skills exist, parents are predominantly unaware of this fact and typically neglect basic Voice Personal Assistant. VPA is a software agent, which precautions such as activating parental controls. can interpret users’ speech to perform certain tasks or simply Confounding Utterances. Our analysis also reveals a novel answer questions from users via synthesized voices. Most 2
VPAs such as Amazon Alexa and Google Home follow a sion control features, and then show their limitations. cloud-based system design. In particular, when the user speaks Alexa Parental Control. Amazon FreeTime is a parental to the VPA device (e.g., Amazon Echo) with a request, this control feature which allows parents to manage what content request is sent to the VPA service provider’s cloud server for their children can access on their Amazon devices. FreeTime processing and invoking the corresponding skills. Third-party on Alexa provides a Parent Dashboard user interface for par- skills can be hosted on external web services instead of the ents to set daily time limits, monitor activities, and manage VPA service provider’s cloud server. allowed content. If Freetime is enabled, users can only use Building and Publishing Skills. To provide a broader range the skills in the kids category by default. To use other skills, of features, Amazon allows third parties to develop skills for parents need to manually add skills in the white list. Free- Alexa via Alexa Skills Kit (ASK) [5]. Using ASK, developers Time Unlimited is a subscription that offers thousands of can build custom Alexa skills that use their own web services kid-friendly content, including a list of kid skills available on to communicate with Alexa [14]. There are currently more compatible Echo devices, for children under 13. Parents can than 50,000 skills, including a wide variety of features such as purchase this subscription via their Amazon account and use reading news, playing games, controlling smart home, check- it across all compatible Amazon devices. ing credit card balances, and telling jokes, that are publicly Children can potentially access an Amazon Echo device available on the Alexa Skills Store [6]. located in a shared space and invoke such “risky" skills in Enabling and Invoking Skills. Unlike mobile apps, Alexa the absence of child-protection features on the Amazon Echo skills are hosted on Amazon’s cloud servers. Therefore, users because of the following reasons. FreeTime is turned off by do not have to download any binary file or run any installation default on the regular version of Amazon Echo. Previous process. To use a skill, users only need to enable it in their studies, such as those in medicine [31], psychology [39], or behavioral economics [35], have shown that people often opt Amazon account. There are two ways to enable/disable a for default settings. Although parents can turn on FreeTime skill. The first way is via skill info page in which there is an enable/disable button. The users can access the skill info page for their regular version of Amazon Echo, the feature places via the Alexa Skills Store on Amazon website or the Alexa a burden of usage on users. For example, users sometimes companion app. The other way is via voice command. Note cannot remove or disable certain skills added by FreeTime that for usability, Amazon also allows invoking skills directly (which has been an issue since 2017 [1, 9]). Some users find it hard to access the list of skills available via FreeTime Un- through voice without needing to enable the skill first. Users can invoke a skill by saying its invocation limited [13, 15]. In particular, skills that parents would love phrases [18]. Invocation phrases include two types: with in- to use may not be appropriate for kids; thus, not allowed in tent and without intent. For example, one can say “Alexa, FreeTime mode by default. As a consequence, users may mis- open Ted Talks” to invoke Ted Talks skill or “Alexa, open understand that not being able to use a skill in FreeTime mode Daily Horoscopes for Capricorn” to tell Daily Horoscopes is a bug of the skill itself, which leads to complaints being sent to the skill developer [4]. If parents want to use these skills skill to give some information about Capricorn. Since there can be different ways of paraphrasing a sentence, there are in FreeTime mode, they have to manually add these skills to multiple variants of an invocation phrase that perform the the white list in the parent dashboard interface. They have to same task. Besides, Alexa allows some flexibility in invoking remember to enable or disable FreeTime at appropriate time skills through name-free interaction feature [19]. The user can which affects user experience. speak to Alexa with a skill request that does not necessarily Alexa Permission Control. Alexa skills might need personal include the skill name. Alexa can process the request and se- information from users to give accurate responses or to pro- lect a top candidate skill that fulfills the request. If the chosen cess transactions. To get any personal information, a skill skill is not yet enabled by the user, it may be auto-enabled for should request the corresponding permission from the user. the user. When the user first enables the skill, Alexa asks the user to go Every skill has an Amazon webpage, which includes at to the Alexa companion app to grant the requested permission. most three sample utterances, i.e., voice commands with However, this permission control mechanism only protect per- which users could verbally interact with the said skill. In ad- sonal information in the user’s Amazon Alexa account. If the dition, the webpage may include an “Additional Instructions” skills do not specify permission requests, but directly ask for section with additional voice commands for interactions, al- such personal information through voice interaction, they can though these additional commands are optional. easily bypass the permission control. 3 Alexa Parental Control, Permission Control, 4 Threat Model and their Limitations In this paper, we consider two main types of threats: (1) We first introduce the current schemes for protecting chil- risky skills (i.e., skills that contain inappropriate content or dren users on Alexa, such as the parental control and permis- ask for user’s personal information through voice interaction) 3
and (2) confounding utterances (i.e., utterances that are shared among two or more different skills). Risky Skills. We investigate the risky content that harm the children. We define “risky" skills as skills that contain two kinds of content: (1) inappropriate content for children or (2) asking for personal information through voice interaction. An Figure 1: Automated Skill Interaction Pipeline Overview example is the “My burns” skill in Amazon’s Kids category that says “You’re so ugly you’d scare the crap out of the toilet. I’m on a roll”. These threats may come from either 5.1 Automated Interaction System Design an adversary who intentionally develops malicious skills or a benign/inexperienced developer who is not aware of the risks. Our goal for SkillBot is to interact effectively and efficiently Confounding Utterances. We identify a new risk which with the skills and uncover the risky content for children in we call “confounding utterances”. We define confounding skill’s behaviors thoroughly and at scale. utterances as utterances that are shared among two or more Overview. Our system consists of four main components: different skills. Effectively, a confounding utterance used by Skill Information Extractor, Web Driver, Chatbot, and Conver- the user could trigger an unexpected skill for the user. sation Dataset (see the workflow in Figure 1). Skill Informa- Confounding utterances are different from previous re- tion Extractor handles exploring, downloading, and parsing search on voice squatting attacks, which exploited the speech information of skills available in the Alexa skills store. Web recognition misinterpretations made by voice personal as- Driver handles connections to Alexa and requests from/to sistants [32, 33, 54, 55]. They showed that voice command the skills. Chatbot discovers interactions with the skills and misinterpretation problem due to spoken errors could yield records the conversations into Conversation Dataset. unwanted skill interactions, and an adversary can route the users to malicious Alexa skills by giving the skill invocation Skill Information Extractor. Amazon provides an online names that are pronounced similar to the legitimate one. repository of skills via Alexa Skills Store [6]. Each skill is In contrast, this paper considers a new risk that even if there an individual product, which has its own product info page is no such voice command misinterpretation, Alexa may still and an Amazon Standard Identification Number (ASIN) that invoke the skill that the user does not want because multiple can be used to search for the skill in Amazon’s catalogue [22]. skills can have completely same utterances. We want to find The URL to a skill’s info page can be constructed from its out, given a confounding utterance that is shared between ASIN. Our skill information extractor includes a web scraper multiple skills, which skill Alexa prioritizes to enable/invoke. to systematically access the Alexa website and download the Users have no control over what skills are actually opened skills’ info page in HTML based on their ASINs (i.e., skill either upon an intentional voice command or an unintentional IDs). It then reads the HTML files and constructs json dictio- one (e.g., Alexa being triggered by background conversations). nary structure using BeautifulSoup library [8]. For each skill, In other words, a confounding utterance may invoke a ran- we extract any information available on its info page such dom skill which is not the user’s intention. With name-free as ASIN (i.e., skill’s ID), icon, sample utterances, invocation interaction feature [19], users can invoke a skill without its name, description, reviews, permission list, and category (e.g., invocation name. Thus, an unexpected skill can be mistakenly kids, education, smart home, etc.). invoked by users. Furthermore, there is no downloading or Web Driver. We leverage Amazon’s Alexa developer con- installation process on the customers’ devices which makes sole [2] to allow programmatically interacting with skills us- it easy for the these skills to bypass user awareness. For in- ing text inputs. We build a web driver module using Selenium stance, a child may have one skill in mind but accidentally framework [17], which is a popular web browser automation invoke a different skill that has a similar invocation name (or framework for testing web applications, to automate send- similar utterances). An adversary can exploit confounding ing requests to Alexa and interacting with the skill info page utterances to get kids to use malicious skills. to check the status of the skill (i.e., enabled, disabled, not available). We also implement a module that handles skill 5 Automated Interaction with Skills enabling/disabling requests. This module uses private APIs derived from inspecting XMLHttpRequest within network To study the impacts that risky skills might have on chil- activities of Alexa webpages. dren, we propose SkillBot, which systematically interacts with the skills to discover risky content and confounding ut- Chatbot. We build an NLP-based module to interact with terances. In this section, we first show how we design SkillBot the skills and explore as much content of the skills as possible. for interacting with the skills and collecting their responses The module includes several techniques to explore sample thoroughly and at scale. We then evaluate SkillBot for its utterances suggested by the skill developers, create additional reliability, coverage, and performance. utterances based on the skill’s info, classify utterances, detect 4
questions in responses, and generating follow-up utterances. expecting either a “yes” or a “no” answer. Our system sends Exploring and Classifying Utterances: Amazon allows de- “yes” or “no” as a follow-up utterance to continue the conver- velopers to list up to three sample utterances in the sample sation. utterances section of their skill’s information page. Our sys- tem first extracts these sample utterances. Some developers (2) WH questions: For WH questions, we further employ also put additional instructions into their skill’s description. the question classification method presented in [49] to de- Therefore, our system further processes the skill’s description termine the theme of an open-ended question. There are six to generate more utterances. In particular, we consider sen- general categories of question theme: Abbreviation, Entity, tences that start with an invocation word (i.e., “Alexa,...”) to Description, Human, Location, Numeric [11]. ‘Abbreviation’ be utterances. We also notice that phrases inside quotes can includes questions that ask about a short form of an expression also be utterances. An example is “You can say ‘give me a fun (e.g., “What is the abbreviation for California?”). ‘Entity’ in- fact’ to ask the skill for a fun fact”. Once a list of collected ut- cludes questions about objects that are not human (e.g., “What terances is constructed, our system classifies these utterances is your favorite color?”). ‘Description’ includes questions into opening and in-skill utterances. Opening utterances are about explanations of concepts (e.g., “What does a defibrilla- used to invoke/open a skill. These often include the skill’s tor do?"). ‘Human’ includes questions about an individual or name and start with opening words such as open, launch, and a group of people. ‘Location’ includes questions about places start [18]. In-skill utterances are used within the skill’s session such as cities, countries, states, etc. ‘Numeric’ includes ques- (when the skill is already invoked). Some examples include tions asking for some numerical values such as count, weight, “tell me a joke”, “help”, or “more info”. size, etc. For each category, there can be subcategories. For ex- Detecting Questions in Skill Responses: To extend the con- ample, ‘Human’ has ’name’ and ’title’, ‘Location’ has ’city’, versation, our system first classifies responses collected from ’country’, ’state’, etc. We create a dictionary of answers to the skill into three main categories. These three categories those subcategories (e.g., "age":{1, 2, 3,...}, "states":{Oregon, include: Yes/No question, WH question, and non-question Arizona,...}) to continue the conversation with the skill. For statement. For this classification task, we employ spaCy [30] questions asking about some knowledge such as those in and StanfordCoreNLP [38, 43] which are popular tools for ‘Abbreviation’ or ‘Description’ whose subcategories are too NLP tasks. In particular, we first tokenize the skill’s response general, our system also sends “I don’t know. Please tell me.” into sentences and each sentence into words. We then annotate to prompt for responses from the skill. each sentence using part-of-speech (POS) tagging. For POS tags, we utilize both TreeBank POS tags [48] and Universal (3) Non-question statements: These include two types of POS tags [20]. With the POS tagging, we can identify the statements: directive statement and informative statement. role of each word in the sentence, such as auxiliary, subject, Some directive statements can ask the user to provide an an- or object, based on its tag. swer to a question, which is basically similar to a WH question. A Yes/No question usually starts with an auxiliary verb, An example is “Please tell us your birthday”. For these cases, which follows the subject-auxiliary inversion formation rule. our system parses the sentence to look for what being asked Yes/No questions generally take the form of [auxiliary + sub- and handles it similar to a WH question (discussed above). ject + (main verb) + (object/adjective/adverb)?]. Some exam- Other directive statements can suggest words/phrases for the ples are “Is she nice?”, “Do you play video games?”, and “Do user to select to continue the conversation. Some examples you swim today?”. It is also possible to have the auxiliary include “Please say ’continue’ to get a fun fact" and “Say ’1’ verb as a negative contraction such as “Don’t you know it?” to get info about a book, ’2’ to get info about a movie". For or “Isn’t she nice?”. these cases, our system extracts the suggested words/phrases A WH question contains WH words such as what, why, or and uses them to continue the conversation. Informative state- how. We first identify these WH words in the sentence based ments provide users with some information such as a joke, a on their POS tags: “WDT”, “WP”, “WP$”, and “WRB”. Next, fact, or a daily news. These often do not give any directives we check for WH question grammar structure. Regular WH on what else the user can say to continue the conversation. questions usually take the form of [WH-word + auxiliary + Thus, our system sends an in-skill utterance, “Tell me another subject + (main verb) + (object)?]. Some examples are “What one", or “Tell me more” as follow-up utterances to explore is your name?” and “What did you say?”. Furthermore, we more content from the skill. consider pied-piping WH questions such as “To whom did you send it?”. We exclude cases that WH words are used in Conversation Dataset. Our conversation dataset is a set of a non-question statement such as “What you think is great”, json files, each of which represents a skill. The file’s content “That is what I did”, and “What goes around comes around”. is a list of conversations with the skill collected by the chatbot Generating Follow-up Utterances: Given a skill response, module. Each conversation is stored as a list in which even there can be three ways to follow up. (1) Yes/No questions: indexes of the list are the utterances sent by our system while This type of question asks for confirmation from the users, odd indexes are the corresponding responses from the skills. 5
5.2 Exploring Conversation Trees Each run of SkillBot terminates when exploring down a par- ticular path is unlikely to trigger new responses from Alexa; For each skill, SkillBot runs multiple rounds to explore in this case, SkillBot would start over with the same skill and different paths within the conversation trees. Each node in explores a different path. We list four conditions where Skill- this tree is a unique response from Alexa. There is an edge Bot would terminate a particular run: (i) Alexa’s response is between nodes i and j if there exists an interaction where not new; in other words, SKillBot has seen Alexa’s response Alexa says i, the user (i.e., SkillBot) says something, and then in a previous run of the skill and/or in a different skill. Skill- Alexa says j. We call the progression from i to j a path in Bot’s goal is to maximize the interaction with unique Alexa’s the tree. Furthermore, multiple paths of interactions could responses, rather than previously seen ones, in an attempt to exist for a skill. For instance, node i could have two edges: discover risky contents. (ii) Alexa’s response is empty. (iii) one with j and another one with k. Effectively, two paths lead Alexa’s response is a dynamic audio clip (e.g., music or pod- from i. In one path, the user says something after hearing cast, which does not rely on Alexa’s automated voice). Due i, and Alexa responds with j. In another path, the user says to limitations of the Alexa simulator, the SkillBot is unable something else after hearing i, and Alexa responds with k. to extract and parse dynamic audio clips; as such, SkillBot To illustrate how we construct a conversation tree on a terminates a path if it sees a dynamic audio clip because it typical skill, we show a hypothetical example in Figure 2. does not know how to react. (iv) Alexa’s response is an error First, the user would launch a skill by saying “Open Skill X” message, such as “Sorry, I don’t understand.” or “Launch Skill X”. This initial utterance could be found in the “Sample Utterances” section of the skill’s information 5.3 Evaluation page on Amazon.com; alternatively, it could also be displayed in the “Additional Instructions” section on the skill’s page. Per In this section, we present our validation to ensure that inter- Figure 2, let us assume that either “Open Skill X” or “Launch acting with skills via our SkillBot (presented in Section 5) can Skill X” triggers the same response from Alexa, “Welcome represent user’s interaction with skills via the physical Echo to Skill X. Say ‘Continue’,” which is denoted by Node 1 in device. We further validate the performance of our SkillBot. Figure 2. The user would say “Continue” and trigger another Interaction Reliability We randomly selected 100 skills for response (denoted as Node 2) from Alexa, “Great. Would validation. We used an Echo Dot device to interact with the you like to do A?” The user could either respond with “Yes”, skills and compared with our system. Note that since a skill which would trigger the response in Node 3, or “No”, which can have dynamic content which makes its responses different would trigger Node 4. in each invocation, we first check the collected skill responses. SkillBot explores multiple paths of the conversation tree If they do not match, we further check the skill invocation in by interacting with a skill multiple times, each time picking a the activity log of Alexa to see if the same skill is invoked. different response. Per the example in Figure 2, the first time We find that our system and the Echo Dot share similar inter- SkillBot runs on this skill (i.e., the first run), it could follow a action of 99 skills. Among these 99 skills, there are two skills path along Nodes 1, 2, and 3. Once at Node 3, the skill in this that responded with audio playbacks, which are not supported example does not provide the user with the option to return by the Alexa developer console [3] employed in our system to the state in Node 2, so to explore a different path, SkillBot (see detailed justifications in Section 9). However, their invo- would have to start over. In the second run, SkillBot could cations were shown in the activity log, which matched those follow a path along Nodes 1, 2, 4, and 5. SkillBot responds invocations when using the Echo Dot. We cannot verify the with “No” after Node 2 because it remembers answering “Yes” remaining one skill as Alexa cannot recognize its sample in the previous run. In the third run, SkillBot could follow utterances. This might be an issue of the skill’s web service. Nodes 1, 2, 4, and 6. Skill’s Responses Classification. As described in Sec- tion 5.1, to extend the conversation with a skill, our system User: “Open Skill X” or “Launch Skill X. classifies responses from the skill into three groups: Yes/No Alexa: “Welcome to Skill X. Say ‘Continue’.” 1 question, WH question, and non-question statement. To eval- User: “Continue.” uate the performance, we randomly sampled 300 unique skill Alexa: “Great. Would you like to do A?” 2 responses from our conversation collection and manually la- User: “Yes.” beled them to make a ground truth. In the ground truth, we User: “No.” Alexa: “Let’s do A.” 3 had 52 Yes/No questions, 50 open-ended questions, and 198 Alexa: “OK. Say ‘C’ to do C, or say ‘D’ to do D.” 4 non-question statements. We then used our system to label User: “C.” User: “D.” these responses and verified the labels against our ground Alexa: “Let’s do C.” 5 Alexa: “Let’s do D.” 6 truth. Our classifier predicted 56 Yes/No questions, 50 open- ended questions, and 194 non-question statements, which is Figure 2: A conversation tree that represents how we interact with a over 95% accuracy. The performance detail for each class is typical skill. shown in Table 1 (see Table 6 in Appendix E for the confusion 6
matrix of our 3-class classifier). store. Note that our system filtered out error pages (e.g., 404 not found) after three retries or non-English skills. As a result, Table 1: Skill Response Classification Performance we collected 43,740 Alexa skills from 23 different skill cate- Accuracy Precision Recall F1 Score gories (e.g., business & finance, social, kids, etc.). Our system Yes/No 98% 0.91 0.98 0.94 then parsed data about the skills, such as ASIN (i.e., skill’s Open-ended 98% 0.94 0.94 0.94 ID), icon, sample utterances, invocation name, description, Non-question 96% 0.98 0.96 0.97 reviews, permission list, and category, from the downloaded skill info pages. Coverage. We measure the coverage of SkillBot by analyzing For our analysis, we investigate all skills in Amazon’s Kids the conversation trees for every skill. Our analysis includes category (3,439 kid skills). We ran our SkillBot to interact four criteria: (i) the number of unique responses from Alexa, with each skill and record the conversations. To speed up the i.e., the number of nodes in a gree; (ii) the maximum depth task, we ran five processes of SkillBot simultaneously. Note (or height) in a tree; (iii) the maximum number of branches that SkillBot can be run over time to revisit each skill and cu- in a tree, i.e., how many options that SkillBot explored; and mulatively collect new conversations as well as new branches (iv) the number of initial utterances, which counts the number of the collected conversations for that skill. As a result, our of distinct ways to start interacting with Alexa. We show the sample had 31,966 conversations from 3,434 kid skills after results in in Figure 3. removing five skills that resulted in errors or crashed Alexa. Per the 2nd chart in Figure 3, we highlight that SkillBot is able to reach a depth of at least 10 on 2.7% of the skills. Such 6.2 Risky Kid Skill Findings a depth allows SkillBot to trigger and explore a wide variety We performed content analysis on the conversations col- of Alexa’s responses from which to discover risky contents. lected from 3,434 kid skills to identify risky kid skills that In fact, out of the 28 risky kid skills, 2 skills were identified have inappropriate content or ask for personal information. at depth 11, 1 skill at depth 5, 4 skills at depth 4, 6 at depth 3, 8 at depth 2, and 7 at depth 1 (more details in Section 6). Skills with Inappropriate Content for Children. Our goal Per the 4th chart in Figure 3, we highlight that SkillBot is was to analyze the skills’ contents to identify risky skills able to initiate conversations with skills using more than 3 that provide inappropriate content to children. To identify different utterances. Normally, a skill’s information page on inappropriate content for children in the skills’ contents, we Amazon.com list at most three sample utterances. In addition combined WebPurify and Microsoft Azure’s Content Mod- to using these sample utterances, SkillBot also discovers and erator, which are two popular content moderation services extracts utterances in the “Additional Instructions” section on providing inappropriate content filtering to websites and ap- the skill’s page. As a result, SkillBot interacted with 20.3% plications with a focus on children protection [7, 21]. We of skills using more than 3 utterances. These extra initial ut- implemented a content moderation module for our SkillBot in terances allow SkillBot to trigger more responses from Alexa. Python 3, leveraging WebPurify API and Azure Moderation As we will explain in Section 6, 3 out of the 28 risky kid skills API, to flag skills that have inappropriate content for children. were discovered by SkillBot from the additional utterances As a result, our content moderation module flagged 33 (i.e., those not from the 3 sample utterances). potentially risky skills that have expletives in the content. However, a human review process is necessary to verify the Time Performance. It took about 21 seconds on average output because whether or not a flagged skill is considered for collecting one conversation. Our SkillBot interacted with to actually have inappropriate content for children depends 4,507 skills and collected 39,322 conversations within 46 on context. For example, some of the expletives (such as hours using five parallel processes on an Ubuntu 20.04 ma- “facial” and “sex”) are likely considered to be appropriate in chine with Intel Core i7-9700K CPU. some conversational context. For the human review process, four researchers in our team—who come from 3 countries 6 Kid Skill Analysis (including the USA), all of whom are English speakers, and To investigate the risks of skills made for kids (RQ1), we whose ages range from 22 to 35—independently reviewed employed our SkillBot to collect and analyze 31,966 conversa- each of the flagged skills, and voted whether the skill’s content tions from a sample of 3,434 Alexa kid skills. In this section, is inappropriate for children. Skills that received three or four we describe our dataset of kid skills and present our findings votes were counted towards the final list. Using this approach, of risky kid skills. we identified 8 kid skills with actual inappropriate content. Out of these 8 kid skills, SkillBot identified the inappropri- 6.1 Dataset ate content of one skill at depth 11, one skill at depth 5, two at depth 4, one at depth 2, and three at depth 1. Our system first explored and downloaded information of We performed a false negative analysis by sampling 100 skills from their info pages available in the Alexa’s U.S. skills out of the other skills that were not flagged as having inappro- 7
15% 30% Percent of Skills 40% (N = 4,508) 20% 20% 10% 10% 20% 10% 5% 0% 0% 0% 0% 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Unique Response Count Max Depth Max Branch Count Init Utterance Count Figure 3: Coverage of SkillBot in terms of four criteria: number of unique responses from Alexa; maximum depth in an Interaction Tree; maximum number of branches for any node in an Interaction Tree; and number of initial utterances. priate content and manually checking them. As a result, we under 13. Our goal was to qualitatively understand parents’ found 0 false negatives. expectations and attitudes about these risky skills, parents’ awareness of parental control features, and how risky skills Skills Collecting Personal Information. Our goal was to might affect children. Our study protocol was approved by detect if the skills asked users for personal information. To the our Institutional Review Board (IRB), and the full text of our best of our knowledge, available tools only focus on detecting survey instrument is provided in Appendix A. In this section, personal information in the text, which is a different goal. For we describe our recruitment strategy, survey design, response this analysis, we employed a keyword-based search approach filtering, and results. to identify skill responses that asked for personal information. We constructed a list of personal information keywords based on the U.S. Department of Defense Privacy Office [12] and 7.1 Recruitment searched for these keywords in the skill responses. In particu- lar, our list includes: name, age, address, phone number, social We recruited participants on Prolific1 , a crowd-sourcing security number, passport number, driver’s license number, website for online research. Participants were required to taxpayer ID number, patient ID number, financial account be adults 18 years or older who are fluent in English, live number, credit card number, date of birth, and zipcode. A in the U.S. with their kids under 13, and have at least one naive keyword search approach that basically looks for those Amazon Echo device in their home. We combined Prolific’s keywords in the text would not be sufficient due to the fact pre-screening filters and a screening survey to get this niche that the text containing those keywords does not always ask sample of participants for our main survey. Our screening for such information. Thus, we combined keyword search survey consisted of two questions to determine: (1) if the with our question detection and answer generation techniques participant has kids aged 1 – 13 and (2) if the participant has used for our Chatbot module presented in Section 5.1 to detect Amazon Echo device(s) in their household. 1,500 participants if the skill asked the user to provide personal information. participated in our screening survey and 258 of them qualified 22 risky skills were flagged as asking users for personal for our main survey. The screening survey took less than 1 information. To verify the result, we manually checked these minute to complete and our main survey took an average of 22 skills and 100 random skills that were not flagged. As a 6.5 minutes (5.2 minutes in the median case). Participants result, we found 2 false positives and 0 false negatives. Thus, were compensated $0.10 for completing the screening survey 20 kid skills asked for personal information such as name, and $2 for completing the main survey. To improve response age, and birthday. quality, we limited both the screening and main surveys to Out of these 20 skills, SkillBot identified contents that ask Prolific workers with at least a 99% approval rate. for sensitive information of one skill at depth 11, two skills at depth 4, six skills at depth 3, seven at depth 2, and four 7.2 Screening Survey at depth 1. Also, SkillBot identified such contents on non- sample utterances for three of the skills (i.e., utterances not The screening survey consisted of two multiple-choice listed as the three samples, but rather listed in the “Additional questions: “Who lives in your household?" and “Which elec- Instructions” section of the skill’s page on Amazon.com). tronic devices do you have in your household?". This allowed We further analyze the permission requests by the skills. us to identify participants with kids aged 1 – 13 and Amazon None of the identified 20 risky kid skills requested any per- Echo device(s) in their household who were eligible to take mission from the user. the main survey. 7 Awareness & Opinions of Risky Kid Skills 7.3 Main Survey To evaluate how the risky kid skills we identified actually The main survey consisted of the following four sections. impact kid users (RQ2 and RQ3), we conducted a user study of 232 U.S. parents who use Amazon Alexa and have children 1 https://www.prolific.co/ 8
Parents’ Perceptions of VPA Skills. This section investi- either of two attention check questions (“What is the com- gated parents’ opinions of and experiences with risky skills. pany that makes Alexa?” and “How many buttons are there Participants were presented with two conversation samples on an Amazon Echo?”). We also excluded participants who collected by SkillBot from each of the following categories gave meaningless responses (e.g., entering only whitespaces (six samples total). Conversation samples were randomly into all free-text answer boxes). This resulted in 232 valid selected from each category for each participant and were responses for analysis. presented in random order. • Expletive. Conversation samples from 8 skills identi- 7.5 User Study Results fied in our analysis that contain inappropriate language content for children. We find that most parents allow their kids to use other types • Sensitive. Conversation samples from 20 skills identi- of Amazon Echo than the Kids Edition. Such types of Echo fied in our analysis that ask the user to provide personal do not have parental control enabled by default. We also find information, such as name, age, and birthday. that many parents do not know about the parental control • Non-Risky. Conversation samples from 100 skills that feature. For those who know about the feature, only a few of did not contain inappropriate content for children or ask them use it. Thus, kids potentially have access to risky skills. for personal information. Our results further show that parents are not aware of the The full list of skills in the Expletive and Sensitive cat- risky skills that are avaiable in the Kids category on Amazon. egories are provided in Appendix D. Each participant was When presented with examples of risky kid skills that have asked the following set of questions after viewing each con- expletives and those that ask for personal information, parents versation sample: express concerns, especially for expletive ones. Some parents • Do you think the conversation is possible on Alexa? reported previous experiences of using such risky skills. • Do you think Alexa should allow this type of conversa- Parents’ Perceptions of Kid Skills. Table 2 shows the dis- tion? tribution of responses to the following questions across the • Do you think this particular skill or conversation is de- Expletive, Sensitive, and Non-Risky skill sets: signed for families and kids? • Do you think the conversation is possible on Alexa? • How comfortable are you if this conversation is between • Do you think Alexa should allow this type of conversa- your children and Alexa? tion? • If you answered "Somewhat uncomfortable" or "Ex- • Do you think this particular skill or conversation is de- tremely uncomfortable" to the previous question, what signed for families and kids? skills or conversations have you experienced with your A majority of parents thought that the interactions with the Alexa that made you similarly uncomfortable? expletive skills were not possible and should not be allowed Amazon Echo Usage. We asked which device model(s) of by Alexa. Only 45.9% of the respondents thought these in- Amazon Echo our participants have in their household (e.g., teractions were possible and only 41.6% of the respondents Echo Dot, Echo Dot Kids Edition, Echo Show). We also asked thought such skills should be allowed. Furthermore, most par- whether their kids used Amazon Echo at home. ents (57.1%) felt that the expletive skills were not designed for families and kids. Awareness of Parental Control Feature. We asked the par- The parents’ responses with regard to the expletive skills ticipants if they think Amazon Echo supports parental con- are significantly different from their responses to the sen- trol (yes/no/don’t know). Participants who answered “yes” sitive and non-risky skills on these questions. For each of were further asked to identify the feature’s name (free-text these three questions, we conduct Chi-square tests on the response) and if they used the feature (yes/no/don’t know). pairs of responses across the skill sets: Non-Risky vs. Ex- Demographic Information. At the end of the survey, we pletive, Non-Risky vs. Sensitive, and Expletive vs. Sensitive. asked demographic questions about gender, age, and comfort The responses from the Expletive set are significantly dif- level with computing technology. Our sample consisted of ferent from responses from the other two sets for all three 128 male (55.2%), 103 female (44.4%), and 1 preferred not questions (p < 0.05). The responses to the “Alexa should to answer (0.4%). The majority (79.7%) were between 25 allow” question are also significantly different for the Non- and 44 years old. Most participants in our sample are techni- Risky set versus for the Sensitive set (p < 0.05). In contrast, cally savvy (68.5%). See Table 5 in Appendix C for detailed the responses for the “Possible on Alexa” and “Designed for demographic information. families and kids” questions display no significant difference between Sensitive and Non-Risky sets. This is alarming, as 7.4 Response Filtering the sensitive skills ask for personal information through the conversations with users, thereby bypassing Amazon’s built- We received 237 responses for our main survey. We filtered in permission control model for skills. As many skills are out responses from participants who incorrectly answered hosted by third parties, sensitive information about children 9
could be leaked to someone other than Amazon. We do not find any significant difference between parents’ comfort with the Sensitive conversations versus the Non-risky Designed for Family and Children. Table 3 shows the dis- conversations. However, the Sensitive conversations involved tribution of responses for the question: “Do you think this skills asking for different types of personal information. Out particular skill or conversation is designed for families and of the 20 skills in the Sensitive set, 15 skills asked for the kids?” with a breakdown across different types of skills (e.g., user’s name, 3 asked for the user’s age, and 2 asked for the Non-risky, Expletive, and Sensitive). These results show that user’s birthday. We show the distribution of the participants’ the majority of parents (72.6%) did not think that skills with comfort level according to each type of personal information expletives were designed for families/kids. This indicates in Figure 5. This indicates that that parents expressed more that the respondents were not aware of the skills with exple- discomfort (“Extremely uncomfortable” and “Somewhat un- tives that were actually developed for kids and published in comfortable”) for skills that ask for the user’s birthday (15.2% Amazon’s “Kid” category. In addition, about half of parents of respondents), compared with skills that ask for the user’s (44.2%) did not think the sensitive skills were designed for name (11.8%) or age (11.5%). Some participants expressed families/kids, although these skills are actually in the “Kid” their concerns about these skills by free-text responses, in- category on Amazon as well. cluding “I don’t like a skill or Alexa asking for PII (P115)”, “I Parents’ Comfort Level. We used a five-point Likert scale haven’t had a similar experience but I think it is inappropriate to measure parents’ comfort levels if the presented conversa- for Alexa to be asking for the name of a child (P209)”, “I tions were between their children and Alexa. Figure 4 shows don’t know why it needs a name (P228)”, and “I would not the participants’ comfort levels for each skill category. These want Alexa to collect my children’s imformation [sic] (P003)”. results indicate that parents were more uncomfortable with the Expletive skill conversations compared to the Sensitive Extremely uncomfortable skill conversations. In particular, 42.7% of the respondents ex- Types of sensitive information asked age Somewhat uncomfortable pressed discomfort (“Extremely uncomfortable” and “Some- Neutral name Somewhat comfortable what uncomfortable”) with the Expletive skills, compared Extremely comfortable to only 12.1% with the Sensitive skills and 5.6% with the birthday Non-risky skills. 0% 20% 40% 60% 80% 100% Percent of responses Chi-square tests show that parents’ comfort with the Exple- Figure 5: Participants’ levels of comfort for each type of personal tive conversations is significantly different from their comfort information, if the conversations happen between the participants’ with the Sensitive conversations and from the Non-risky con- children and Alexa. versations (p < 0.05). Some participants expressed their con- Extremely uncomfortable Amazon Echo Usage. Our results also show that most house- Conversation type Non-risky Somewhat uncomfortable holds with kids use Echo devices other than the Echo Kids Neutral Sensitive Somewhat comfortable Edition. Echo Dot was the most popular type (46.4%) of Echo Extremely comfortable device in our participants’ households. Only 27 participants Expletive (6.8%) bought an Echo Dot Kids Edition, which has parental 0% 20% 40% 60% 80% 100% control mode enabled by default. This shows that if kids use Percent of responses Figure 4: Participants’ levels of comfort if conversations of a partic- Echo, they likely have access to the types of Echo devices ular type happen between the participants’ children and Alexa. that do not have parental control mode enabled by default. Furthermore, the majority of participants (91.8%) reported that their kids do use Amazon Echo at home. Figure 6 shows cerns about skills in the Expletive set by free-text responses, including “It doesn’t seem appropriate to tell jokes like this the types of Echo that the participants own in their household to children (P148)”, “Under no circumstances should anyone associated with the breakdown of answers to the question “Do have a coversation [sic] with children about orgasms. This your kids use Amazon Echo at home?". Most parents allow would be grounds for legal action (P163)”, “I do not believe their kids to use Amazon Echo at home even without an Echo Dot Kids Edition. This indicates that many kids have access Alexa should be used in such a crass manner or to teach my to risky skills, as these skills can be used by default on Echo child how to be crass (P210)”, “Poop and poopy jokes don’t happen in my household (P216)”, and “It is too sexual (P123)”. devices other than the Kids Edition. Beyond the skills shown in the survey, one respondent also Awareness of Parental Control Feature. We analyzed the recalled hearing similar skills such as “Roastmaster (P121).” responses to the question: “Does Amazon Echo support Another respondent remembered something similar but was parental control?” In total, 76.3% said “yes”, 0.4% said “no”, unable to provide the name of the skill: “We have asked Alexa and 23.3% were unsure. For participants who had Echo Kids to tell us a joke in front of our young son and Alexa has told Edition, almost all (92.6%) said “yes”, 7.4% said “no”, and a few jokes that were borderline inappropriate (P140).” none was unsure. In contrast, for participants without Echo 10
You can also read