SkillBot: Identifying Risky Content for Children in Alexa Skills

Page created by Brent Oconnor

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

SkillBot: Identifying Risky Content for Children in Alexa Skills

                                                   Tu Le                     Danny Yuxing Huang                   Noah Apthorpe                      Yuan Tian
                                            University of Virginia           New York University                 Colgate University             University of Virginia

                                                                 Abstract                                      priate content (e.g., expletives) or collect personal information
arXiv:2102.03382v1 [cs.MA] 5 Feb 2021

                                        Many households include children who use voice personal                through voice interactions.
                                        assistants (VPA) such as Amazon Alexa. Children benefit                   There is no systematic testing tool that vets VPA skills to
                                        from the rich functionalities of VPAs and third-party apps             identify those that contain risky content for children. Legal ef-
                                        but are also exposed to new risks in the VPA ecosystem (e.g.,          forts and industry solutions have tried to protect children using
                                        inappropriate content or information collection). To study the         VPAs; however, their effectiveness is unclear. The 1998 Chil-
                                        risks VPAs pose to children, we build a Natural Language               dren’s Online Privacy Protection Act (COPPA) regulates the
                                        Processing (NLP)-based system to automatically interact with           information collected from children under 13 online [10], but
                                        VPA apps and analyze the resulting conversations to identify           widespread COPPA violations have been shown in the mobile
                                        contents risky to children. We identify 28 child-directed apps         application market [45], and compliance in the VPA space is
                                        with risky contents and maintain a growing dataset of 31,966           far from guaranteed. Additionally, parental control modes pro-
                                        non-overlapping app behaviors collected from 3,434 Alexa               vided by VPAs (e.g., Amazon FreeTime and Google Family
                                        apps. Our findings suggest that although voice apps designed           App) often place a burden on parents during setup and receive
                                        for children are subject to more policy requirements and in-           complaints from parents due to their limitations [1, 9, 15].
                                        tensive vetting, children are still vulnerable to risky content.       Research Questions. Protecting children in the era of voice
                                        We then conduct a user study showing that parents are more             devices therefore raises several pressing questions:
                                        concerned about VPA apps with inappropriate content than                  • RQ0. Can we automate the analysis of VPA skills to
                                        those that ask for personal information, but many parents are                identify content risky for children without requiring man-
                                        not aware that risky apps of either type exist. Finally, we                  ual human voice interactions?
                                        identify a new threat to users of VPA apps: confounding utter-            • RQ1. Are VPA skills targeted to children that claim
                                        ances, or voice commands shared by multiple apps that may                    to follow additional content requirements – hereafter
                                        cause a user to invoke or interact with a different app than in-             referred to as “kid skills” – actually safe for child users?
                                        tended. We identify 4,487 confounding utterances, including               • RQ2. What are parents’ attitudes and awareness of the
                                        581 shared by child-directed and non-child-directed apps.                    risks posed by VPAs to children?
                                                                                                                  • RQ3. How likely is it for children to be exposed to
                                        1   Introduction                                                             risky skills through confounding utterances—voice com-
                                           The rapid development of Internet of Things (IoT) technol-                mands shared by multiple skills which could cause a
                                        ogy has aligned with growing popularity of voice personal                    child to accidentally invoke or interact with a different
                                        assistant (VPA) services, such as Amazon Alexa and Google                    skill than intended.
                                        Home. In addition to the first-party features provided by these        In this paper, we design, implement, and perform a systematic
                                        products, VPA service providers have also developed plat-              automated analysis of the Amazon Alexa VPA skill ecosystem
                                        forms that allow third-party developers to build and publish           and conduct a user study to answer these research questions.
                                        their own voice apps—hereafter referred to as “skills”.                Challenges to Automated Skill Analysis. In comparison to
                                        Risks to Children from VPAs. Researchers have found that               mobile applications and other traditional software, neither the
                                        91% of children between ages 4 and 11 in the U.S. have ac-             executable files nor source code of VPA skills are available to
                                        cess to VPAs, 26% of children are exposed to a VPA between             researchers for analysis. Instead, the skills’ natural language
                                        2 and 4 hours a week, and 20% talk to VPA devices more                 processing modules and key function logic are hosted in the
                                        than 5 hours a week [16]. The lack of robust authentication on         cloud as a black box. Thus, decompilation, traditional static,
                                        commercial VPAs makes it challenging to regulate children’s            or dynamic analysis methods cannot be applied to VPA skills.
                                        use of skills [53], especially as anyone in the same physical             VPA skill voice interactions are built following a template
                                        vicinity of a VPA can interact with the device. As a result,           defined by the third-party developer, which is also unavailable
                                        children may have access to risky skills that deliver inappro-         to researchers. To automatically detect risky content, we need

                                                                                                           1

to generate testing inputs that trigger this content through threat that is particularly problematic for child users: con-
sequential interactions. A further challenge is that risky confounding utterances. Confounding utterances are voice inputs
tent does not always occur during a users’ first interaction that are share by more than one skill that may be present
with a skill; human users often need to have back-and-forth on a VPA device. When a user interacts with a VPA via a
conversations with skills to discover risky contents. Automat- confounding utterance, the utterance might trigger a reaction
ing this process requires developing a tool that can generate from any of these skills. If a kid skill shares a confounding
valid voice inputs and dynamic follow-up responses that will utterance with a skill inappropriate for kids, a kid user might
cause the skill to reveal risky contents. This is different from inadvertently begin interacting with the inappropriate skill.
existing chatbot development techniques [28], as the goal is For example, a child user use an utterance to invoke a kid
not to generate inputs that sound natural to a human. Instead skill X, but the another skill Y, which is in the non-kid cate-
automated skill analysis requires generating inputs that will gory and which shares the same utterance, could be triggered
explore the space of skill behaviors as thoroughly as possible. instead. As Echo does not offer visual cues on what skill is
actually invoked, the user may not realize that Y is running
Automated Identification of Risky Content. This paper
instead of X. Furthermore, skills in the non-kid categories
presents our systematic approach to analyzing VPA skills
typically face more relaxed requirements than kid skills do;
based on automated interactions. We apply this approach to
the child user could be exposed to risky contents from skill Y.
3,434 Alexa skills targeted toward children in order to mea-
Our SkillBot reveals 4,487 confounding utterances, 581
sure the prevalence of kids skills that contain risky content.
of which are shared between a kid skill and a skill that is
More specifically, we build a natural-language-based sys-
not in “Kids” category (Section 8). Of these 581 utterances,
tem called “SkillBot” that interacts with VPA skills and an-
27% prioritize invoking a non-kid skill over a kid skill. This
alyzes the results for risky content, including inappropriate
indicates that children are at real risk of accidentally invoking
language and personal information collection (Section 5).
non-kid skills and that an adversary could exploit overlapping
SkillBot generates valid skill inputs, analyzes skill responses,
utterances to get child users to invoke non-kid skills (RQ3).
and systematically generates follow-up inputs. Through mul-
tiple rounds of interactions, we can determine whether skills Contributions. We make the following contributions:
contain risky content. Automated System for Skill Analysis: We present a system,
The design of SkillBot answers RQ0, and our SkillBot SkillBot, that automatically interacts with Alexa skills and
analysis of 3,434 kid skills allows us to answer RQ1. We collects their contents at scale. Our system can be run longi-
identify 8 kid skills with inappropriate content and 20 kid tudinally to identify new conversations and new conversation
skills that ask for personal information (Section 6). branches in previously analyzed skills. We plan to publicly
release our system to help future research.
Online User Study and Insights from Parents. We next
Identification of Risks to Children: We analyze 31,966
wanted to verify our SkillBot results by seeing whether par-
conversations collected from 3,434 Alexa kid skills to detect
ents also viewed identified skills as risky, as well as to better
potential risky skills directed to children. We find 8 skills that
understand the real world contexts of children’s interactions
contain inappropriate content for children and 20 skills that
with VPAs (RQ2). We conduct a user study of 232 U.S. Alexa
ask for personal information through voice interaction.
users who have children under 13 years old. We present these
User Study of Parents’ Awareness and Experiences: We
parents with examples of interactions with risky and non-risky
conduct a user study demonstrating that a majority of parents
skills identified by SkillBot and ask them to report their reac-
express concern about the content of the risky kids skills iden-
tions to these skills, experiences with risky/unwanted content
tified by SkillBot tempered by disbelief that these skills are
on their own VPAs, and use of VPA parental control features.
actually available for Alexa VPAs. This lack of risk awareness
We find that parents are uncomfortable about the inappro-
is compounded by findings that many parents’ do not use VPA
priate content in our identified skills. 54.1% cannot imagine
parental controls and allow their children to use VPA versions
such interactions are possible on Alexa, and 58.4% believe
that do not have parental controls enabled by default.
Alexa should block such interactions. Many parents do not
Confounding Utterances: We identify confounding utter-
think that these skills are designed for families/kids, although
ances as a novel threat to VPA users. Our SkillBot analysis
these skills are actually published in Amazon’s “Kids” cate-
reveals 4,487 confounding utterances shared between two or
gory. We also find that 23.7% of parents do not know about
more skills and highlight those that place child users at risk
Alexa’s parental control feature, and of those who know about
by invoking a non-kid skill instead of an expected kid skill.
the feature, only 29.4% use it. These data highlight the risks
to children posed by VPA skills with inappropriate content. 2 Background
While SkillBot demonstrates that such skills exist, parents are
predominantly unaware of this fact and typically neglect basic
Voice Personal Assistant. VPA is a software agent, which
precautions such as activating parental controls.
can interpret users’ speech to perform certain tasks or simply
Confounding Utterances. Our analysis also reveals a novel answer questions from users via synthesized voices. Most

VPAs such as Amazon Alexa and Google Home follow a sion control features, and then show their limitations.
cloud-based system design. In particular, when the user speaks Alexa Parental Control. Amazon FreeTime is a parental
to the VPA device (e.g., Amazon Echo) with a request, this control feature which allows parents to manage what content
request is sent to the VPA service provider’s cloud server for their children can access on their Amazon devices. FreeTime
processing and invoking the corresponding skills. Third-party
on Alexa provides a Parent Dashboard user interface for par-
skills can be hosted on external web services instead of the
ents to set daily time limits, monitor activities, and manage
VPA service provider’s cloud server. allowed content. If Freetime is enabled, users can only use
Building and Publishing Skills. To provide a broader range the skills in the kids category by default. To use other skills,
of features, Amazon allows third parties to develop skills for parents need to manually add skills in the white list. Free-
Alexa via Alexa Skills Kit (ASK) [5]. Using ASK, developers Time Unlimited is a subscription that offers thousands of
can build custom Alexa skills that use their own web services kid-friendly content, including a list of kid skills available on
to communicate with Alexa [14]. There are currently more compatible Echo devices, for children under 13. Parents can
than 50,000 skills, including a wide variety of features such as purchase this subscription via their Amazon account and use
reading news, playing games, controlling smart home, check- it across all compatible Amazon devices.
ing credit card balances, and telling jokes, that are publicly Children can potentially access an Amazon Echo device
available on the Alexa Skills Store [6]. located in a shared space and invoke such “risky" skills in
Enabling and Invoking Skills. Unlike mobile apps, Alexa the absence of child-protection features on the Amazon Echo
skills are hosted on Amazon’s cloud servers. Therefore, users because of the following reasons. FreeTime is turned off by
do not have to download any binary file or run any installation default on the regular version of Amazon Echo. Previous
process. To use a skill, users only need to enable it in their studies, such as those in medicine [31], psychology [39], or
behavioral economics [35], have shown that people often opt
Amazon account. There are two ways to enable/disable a
for default settings. Although parents can turn on FreeTime
skill. The first way is via skill info page in which there is an
enable/disable button. The users can access the skill info page for their regular version of Amazon Echo, the feature places
via the Alexa Skills Store on Amazon website or the Alexa a burden of usage on users. For example, users sometimes
companion app. The other way is via voice command. Note cannot remove or disable certain skills added by FreeTime
that for usability, Amazon also allows invoking skills directly (which has been an issue since 2017 [1, 9]). Some users find
it hard to access the list of skills available via FreeTime Un-
through voice without needing to enable the skill first.
Users can invoke a skill by saying its invocation limited [13, 15]. In particular, skills that parents would love
phrases [18]. Invocation phrases include two types: with in- to use may not be appropriate for kids; thus, not allowed in
tent and without intent. For example, one can say “Alexa, FreeTime mode by default. As a consequence, users may mis-
open Ted Talks” to invoke Ted Talks skill or “Alexa, open understand that not being able to use a skill in FreeTime mode
Daily Horoscopes for Capricorn” to tell Daily Horoscopes is a bug of the skill itself, which leads to complaints being sent
to the skill developer [4]. If parents want to use these skills
skill to give some information about Capricorn. Since there
can be different ways of paraphrasing a sentence, there are in FreeTime mode, they have to manually add these skills to
multiple variants of an invocation phrase that perform the the white list in the parent dashboard interface. They have to
same task. Besides, Alexa allows some flexibility in invoking remember to enable or disable FreeTime at appropriate time
skills through name-free interaction feature [19]. The user can which affects user experience.
speak to Alexa with a skill request that does not necessarily Alexa Permission Control. Alexa skills might need personal
include the skill name. Alexa can process the request and se- information from users to give accurate responses or to pro-
lect a top candidate skill that fulfills the request. If the chosen cess transactions. To get any personal information, a skill
skill is not yet enabled by the user, it may be auto-enabled for should request the corresponding permission from the user.
the user. When the user first enables the skill, Alexa asks the user to go
Every skill has an Amazon webpage, which includes at to the Alexa companion app to grant the requested permission.
most three sample utterances, i.e., voice commands with However, this permission control mechanism only protect per-
which users could verbally interact with the said skill. In ad- sonal information in the user’s Amazon Alexa account. If the
dition, the webpage may include an “Additional Instructions” skills do not specify permission requests, but directly ask for
section with additional voice commands for interactions, al- such personal information through voice interaction, they can
though these additional commands are optional. easily bypass the permission control.

3 Alexa Parental Control, Permission Control, 4 Threat Model
and their Limitations
In this paper, we consider two main types of threats: (1)
We first introduce the current schemes for protecting chil- risky skills (i.e., skills that contain inappropriate content or
dren users on Alexa, such as the parental control and permis- ask for user’s personal information through voice interaction)

and (2) confounding utterances (i.e., utterances that are shared
among two or more different skills).
Risky Skills. We investigate the risky content that harm the
children. We define “risky" skills as skills that contain two
kinds of content: (1) inappropriate content for children or (2)
asking for personal information through voice interaction. An Figure 1: Automated Skill Interaction Pipeline Overview
example is the “My burns” skill in Amazon’s Kids category
that says “You’re so ugly you’d scare the crap out of the
toilet. I’m on a roll”. These threats may come from either 5.1 Automated Interaction System Design
an adversary who intentionally develops malicious skills or a
benign/inexperienced developer who is not aware of the risks. Our goal for SkillBot is to interact effectively and efficiently
Confounding Utterances. We identify a new risk which with the skills and uncover the risky content for children in
we call “confounding utterances”. We define confounding skill’s behaviors thoroughly and at scale.
utterances as utterances that are shared among two or more Overview. Our system consists of four main components:
different skills. Effectively, a confounding utterance used by Skill Information Extractor, Web Driver, Chatbot, and Conver-
the user could trigger an unexpected skill for the user. sation Dataset (see the workflow in Figure 1). Skill Informa-
Confounding utterances are different from previous re- tion Extractor handles exploring, downloading, and parsing
search on voice squatting attacks, which exploited the speech information of skills available in the Alexa skills store. Web
recognition misinterpretations made by voice personal as- Driver handles connections to Alexa and requests from/to
sistants [32, 33, 54, 55]. They showed that voice command the skills. Chatbot discovers interactions with the skills and
misinterpretation problem due to spoken errors could yield records the conversations into Conversation Dataset.
unwanted skill interactions, and an adversary can route the
users to malicious Alexa skills by giving the skill invocation Skill Information Extractor. Amazon provides an online
names that are pronounced similar to the legitimate one. repository of skills via Alexa Skills Store [6]. Each skill is
In contrast, this paper considers a new risk that even if there an individual product, which has its own product info page
is no such voice command misinterpretation, Alexa may still and an Amazon Standard Identification Number (ASIN) that
invoke the skill that the user does not want because multiple can be used to search for the skill in Amazon’s catalogue [22].
skills can have completely same utterances. We want to find The URL to a skill’s info page can be constructed from its
out, given a confounding utterance that is shared between ASIN. Our skill information extractor includes a web scraper
multiple skills, which skill Alexa prioritizes to enable/invoke. to systematically access the Alexa website and download the
Users have no control over what skills are actually opened skills’ info page in HTML based on their ASINs (i.e., skill
either upon an intentional voice command or an unintentional IDs). It then reads the HTML files and constructs json dictio-
one (e.g., Alexa being triggered by background conversations). nary structure using BeautifulSoup library [8]. For each skill,
In other words, a confounding utterance may invoke a ran- we extract any information available on its info page such
dom skill which is not the user’s intention. With name-free as ASIN (i.e., skill’s ID), icon, sample utterances, invocation
interaction feature [19], users can invoke a skill without its name, description, reviews, permission list, and category (e.g.,
invocation name. Thus, an unexpected skill can be mistakenly kids, education, smart home, etc.).
invoked by users. Furthermore, there is no downloading or Web Driver. We leverage Amazon’s Alexa developer con-
installation process on the customers’ devices which makes sole [2] to allow programmatically interacting with skills us-
it easy for the these skills to bypass user awareness. For in- ing text inputs. We build a web driver module using Selenium
stance, a child may have one skill in mind but accidentally framework [17], which is a popular web browser automation
invoke a different skill that has a similar invocation name (or framework for testing web applications, to automate send-
similar utterances). An adversary can exploit confounding ing requests to Alexa and interacting with the skill info page
utterances to get kids to use malicious skills. to check the status of the skill (i.e., enabled, disabled, not
available). We also implement a module that handles skill
5 Automated Interaction with Skills enabling/disabling requests. This module uses private APIs
derived from inspecting XMLHttpRequest within network
To study the impacts that risky skills might have on chil-
activities of Alexa webpages.
dren, we propose SkillBot, which systematically interacts
with the skills to discover risky content and confounding ut- Chatbot. We build an NLP-based module to interact with
terances. In this section, we first show how we design SkillBot the skills and explore as much content of the skills as possible.
for interacting with the skills and collecting their responses The module includes several techniques to explore sample
thoroughly and at scale. We then evaluate SkillBot for its utterances suggested by the skill developers, create additional
reliability, coverage, and performance. utterances based on the skill’s info, classify utterances, detect

questions in responses, and generating follow-up utterances. expecting either a “yes” or a “no” answer. Our system sends
Exploring and Classifying Utterances: Amazon allows de- “yes” or “no” as a follow-up utterance to continue the conver-
velopers to list up to three sample utterances in the sample sation.
utterances section of their skill’s information page. Our sys-
tem first extracts these sample utterances. Some developers (2) WH questions: For WH questions, we further employ
also put additional instructions into their skill’s description. the question classification method presented in [49] to de-
Therefore, our system further processes the skill’s description termine the theme of an open-ended question. There are six
to generate more utterances. In particular, we consider sen- general categories of question theme: Abbreviation, Entity,
tences that start with an invocation word (i.e., “Alexa,...”) to Description, Human, Location, Numeric [11]. ‘Abbreviation’
be utterances. We also notice that phrases inside quotes can includes questions that ask about a short form of an expression
also be utterances. An example is “You can say ‘give me a fun (e.g., “What is the abbreviation for California?”). ‘Entity’ in-
fact’ to ask the skill for a fun fact”. Once a list of collected ut- cludes questions about objects that are not human (e.g., “What
terances is constructed, our system classifies these utterances is your favorite color?”). ‘Description’ includes questions
into opening and in-skill utterances. Opening utterances are about explanations of concepts (e.g., “What does a defibrilla-
used to invoke/open a skill. These often include the skill’s tor do?"). ‘Human’ includes questions about an individual or
name and start with opening words such as open, launch, and a group of people. ‘Location’ includes questions about places
start [18]. In-skill utterances are used within the skill’s session such as cities, countries, states, etc. ‘Numeric’ includes ques-
(when the skill is already invoked). Some examples include tions asking for some numerical values such as count, weight,
“tell me a joke”, “help”, or “more info”. size, etc. For each category, there can be subcategories. For ex-
Detecting Questions in Skill Responses: To extend the con- ample, ‘Human’ has ’name’ and ’title’, ‘Location’ has ’city’,
versation, our system first classifies responses collected from ’country’, ’state’, etc. We create a dictionary of answers to
the skill into three main categories. These three categories those subcategories (e.g., "age":{1, 2, 3,...}, "states":{Oregon,
include: Yes/No question, WH question, and non-question Arizona,...}) to continue the conversation with the skill. For
statement. For this classification task, we employ spaCy [30] questions asking about some knowledge such as those in
and StanfordCoreNLP [38, 43] which are popular tools for ‘Abbreviation’ or ‘Description’ whose subcategories are too
NLP tasks. In particular, we first tokenize the skill’s response general, our system also sends “I don’t know. Please tell me.”
into sentences and each sentence into words. We then annotate to prompt for responses from the skill.
each sentence using part-of-speech (POS) tagging. For POS
tags, we utilize both TreeBank POS tags [48] and Universal (3) Non-question statements: These include two types of
POS tags [20]. With the POS tagging, we can identify the statements: directive statement and informative statement.
role of each word in the sentence, such as auxiliary, subject, Some directive statements can ask the user to provide an an-
or object, based on its tag. swer to a question, which is basically similar to a WH question.
A Yes/No question usually starts with an auxiliary verb, An example is “Please tell us your birthday”. For these cases,
which follows the subject-auxiliary inversion formation rule. our system parses the sentence to look for what being asked
Yes/No questions generally take the form of [auxiliary + sub- and handles it similar to a WH question (discussed above).
ject + (main verb) + (object/adjective/adverb)?]. Some exam- Other directive statements can suggest words/phrases for the
ples are “Is she nice?”, “Do you play video games?”, and “Do user to select to continue the conversation. Some examples
you swim today?”. It is also possible to have the auxiliary include “Please say ’continue’ to get a fun fact" and “Say ’1’
verb as a negative contraction such as “Don’t you know it?” to get info about a book, ’2’ to get info about a movie". For
or “Isn’t she nice?”. these cases, our system extracts the suggested words/phrases
A WH question contains WH words such as what, why, or and uses them to continue the conversation. Informative state-
how. We first identify these WH words in the sentence based ments provide users with some information such as a joke, a
on their POS tags: “WDT”, “WP”, “WP$”, and “WRB”. Next, fact, or a daily news. These often do not give any directives
we check for WH question grammar structure. Regular WH on what else the user can say to continue the conversation.
questions usually take the form of [WH-word + auxiliary + Thus, our system sends an in-skill utterance, “Tell me another
subject + (main verb) + (object)?]. Some examples are “What one", or “Tell me more” as follow-up utterances to explore
is your name?” and “What did you say?”. Furthermore, we more content from the skill.
consider pied-piping WH questions such as “To whom did
you send it?”. We exclude cases that WH words are used in Conversation Dataset. Our conversation dataset is a set of
a non-question statement such as “What you think is great”, json files, each of which represents a skill. The file’s content
“That is what I did”, and “What goes around comes around”. is a list of conversations with the skill collected by the chatbot
Generating Follow-up Utterances: Given a skill response, module. Each conversation is stored as a list in which even
there can be three ways to follow up. (1) Yes/No questions: indexes of the list are the utterances sent by our system while
This type of question asks for confirmation from the users, odd indexes are the corresponding responses from the skills.

5.2 Exploring Conversation Trees Each run of SkillBot terminates when exploring down a par-
ticular path is unlikely to trigger new responses from Alexa;
For each skill, SkillBot runs multiple rounds to explore in this case, SkillBot would start over with the same skill and
different paths within the conversation trees. Each node in explores a different path. We list four conditions where Skill-
this tree is a unique response from Alexa. There is an edge Bot would terminate a particular run: (i) Alexa’s response is
between nodes i and j if there exists an interaction where not new; in other words, SKillBot has seen Alexa’s response
Alexa says i, the user (i.e., SkillBot) says something, and then in a previous run of the skill and/or in a different skill. Skill-
Alexa says j. We call the progression from i to j a path in Bot’s goal is to maximize the interaction with unique Alexa’s
the tree. Furthermore, multiple paths of interactions could responses, rather than previously seen ones, in an attempt to
exist for a skill. For instance, node i could have two edges: discover risky contents. (ii) Alexa’s response is empty. (iii)
one with j and another one with k. Effectively, two paths lead Alexa’s response is a dynamic audio clip (e.g., music or pod-
from i. In one path, the user says something after hearing cast, which does not rely on Alexa’s automated voice). Due
i, and Alexa responds with j. In another path, the user says to limitations of the Alexa simulator, the SkillBot is unable
something else after hearing i, and Alexa responds with k. to extract and parse dynamic audio clips; as such, SkillBot
To illustrate how we construct a conversation tree on a terminates a path if it sees a dynamic audio clip because it
typical skill, we show a hypothetical example in Figure 2. does not know how to react. (iv) Alexa’s response is an error
First, the user would launch a skill by saying “Open Skill X” message, such as “Sorry, I don’t understand.”
or “Launch Skill X”. This initial utterance could be found
in the “Sample Utterances” section of the skill’s information 5.3 Evaluation
page on Amazon.com; alternatively, it could also be displayed
in the “Additional Instructions” section on the skill’s page. Per In this section, we present our validation to ensure that inter-
Figure 2, let us assume that either “Open Skill X” or “Launch acting with skills via our SkillBot (presented in Section 5) can
Skill X” triggers the same response from Alexa, “Welcome represent user’s interaction with skills via the physical Echo
to Skill X. Say ‘Continue’,” which is denoted by Node 1 in device. We further validate the performance of our SkillBot.
Figure 2. The user would say “Continue” and trigger another Interaction Reliability We randomly selected 100 skills for
response (denoted as Node 2) from Alexa, “Great. Would validation. We used an Echo Dot device to interact with the
you like to do A?” The user could either respond with “Yes”, skills and compared with our system. Note that since a skill
which would trigger the response in Node 3, or “No”, which can have dynamic content which makes its responses different
would trigger Node 4. in each invocation, we first check the collected skill responses.
SkillBot explores multiple paths of the conversation tree If they do not match, we further check the skill invocation in
by interacting with a skill multiple times, each time picking a the activity log of Alexa to see if the same skill is invoked.
different response. Per the example in Figure 2, the first time We find that our system and the Echo Dot share similar inter-
SkillBot runs on this skill (i.e., the first run), it could follow a action of 99 skills. Among these 99 skills, there are two skills
path along Nodes 1, 2, and 3. Once at Node 3, the skill in this that responded with audio playbacks, which are not supported
example does not provide the user with the option to return by the Alexa developer console [3] employed in our system
to the state in Node 2, so to explore a different path, SkillBot (see detailed justifications in Section 9). However, their invo-
would have to start over. In the second run, SkillBot could cations were shown in the activity log, which matched those
follow a path along Nodes 1, 2, 4, and 5. SkillBot responds invocations when using the Echo Dot. We cannot verify the
with “No” after Node 2 because it remembers answering “Yes” remaining one skill as Alexa cannot recognize its sample
in the previous run. In the third run, SkillBot could follow utterances. This might be an issue of the skill’s web service.
Nodes 1, 2, 4, and 6.
Skill’s Responses Classification. As described in Sec-
tion 5.1, to extend the conversation with a skill, our system
User: “Open Skill X” or “Launch Skill X.
classifies responses from the skill into three groups: Yes/No
Alexa: “Welcome to Skill X. Say ‘Continue’.” 1
question, WH question, and non-question statement. To eval-
User: “Continue.”
uate the performance, we randomly sampled 300 unique skill
Alexa: “Great. Would you like to do A?” 2
responses from our conversation collection and manually la-
User: “Yes.” beled them to make a ground truth. In the ground truth, we
User: “No.”
Alexa: “Let’s do A.” 3
had 52 Yes/No questions, 50 open-ended questions, and 198
Alexa: “OK. Say ‘C’ to do C, or say ‘D’ to do D.” 4
non-question statements. We then used our system to label
User: “C.” User: “D.”
these responses and verified the labels against our ground
Alexa: “Let’s do C.” 5 Alexa: “Let’s do D.” 6
truth. Our classifier predicted 56 Yes/No questions, 50 open-
ended questions, and 194 non-question statements, which is
Figure 2: A conversation tree that represents how we interact with a over 95% accuracy. The performance detail for each class is
typical skill. shown in Table 1 (see Table 6 in Appendix E for the confusion

matrix of our 3-class classifier). store. Note that our system filtered out error pages (e.g., 404
not found) after three retries or non-English skills. As a result,
Table 1: Skill Response Classification Performance we collected 43,740 Alexa skills from 23 different skill cate-
Accuracy Precision Recall F1 Score gories (e.g., business & finance, social, kids, etc.). Our system
Yes/No 98% 0.91 0.98 0.94 then parsed data about the skills, such as ASIN (i.e., skill’s
Open-ended 98% 0.94 0.94 0.94 ID), icon, sample utterances, invocation name, description,
Non-question 96% 0.98 0.96 0.97 reviews, permission list, and category, from the downloaded
skill info pages.
Coverage. We measure the coverage of SkillBot by analyzing For our analysis, we investigate all skills in Amazon’s Kids
the conversation trees for every skill. Our analysis includes category (3,439 kid skills). We ran our SkillBot to interact
four criteria: (i) the number of unique responses from Alexa, with each skill and record the conversations. To speed up the
i.e., the number of nodes in a gree; (ii) the maximum depth task, we ran five processes of SkillBot simultaneously. Note
(or height) in a tree; (iii) the maximum number of branches that SkillBot can be run over time to revisit each skill and cu-
in a tree, i.e., how many options that SkillBot explored; and mulatively collect new conversations as well as new branches
(iv) the number of initial utterances, which counts the number of the collected conversations for that skill. As a result, our
of distinct ways to start interacting with Alexa. We show the sample had 31,966 conversations from 3,434 kid skills after
results in in Figure 3. removing five skills that resulted in errors or crashed Alexa.
Per the 2nd chart in Figure 3, we highlight that SkillBot is
able to reach a depth of at least 10 on 2.7% of the skills. Such 6.2 Risky Kid Skill Findings
a depth allows SkillBot to trigger and explore a wide variety
We performed content analysis on the conversations col-
of Alexa’s responses from which to discover risky contents.
lected from 3,434 kid skills to identify risky kid skills that
In fact, out of the 28 risky kid skills, 2 skills were identified
have inappropriate content or ask for personal information.
at depth 11, 1 skill at depth 5, 4 skills at depth 4, 6 at depth 3,
8 at depth 2, and 7 at depth 1 (more details in Section 6). Skills with Inappropriate Content for Children. Our goal
Per the 4th chart in Figure 3, we highlight that SkillBot is was to analyze the skills’ contents to identify risky skills
able to initiate conversations with skills using more than 3 that provide inappropriate content to children. To identify
different utterances. Normally, a skill’s information page on inappropriate content for children in the skills’ contents, we
Amazon.com list at most three sample utterances. In addition combined WebPurify and Microsoft Azure’s Content Mod-
to using these sample utterances, SkillBot also discovers and erator, which are two popular content moderation services
extracts utterances in the “Additional Instructions” section on providing inappropriate content filtering to websites and ap-
the skill’s page. As a result, SkillBot interacted with 20.3% plications with a focus on children protection [7, 21]. We
of skills using more than 3 utterances. These extra initial ut- implemented a content moderation module for our SkillBot in
terances allow SkillBot to trigger more responses from Alexa. Python 3, leveraging WebPurify API and Azure Moderation
As we will explain in Section 6, 3 out of the 28 risky kid skills API, to flag skills that have inappropriate content for children.
were discovered by SkillBot from the additional utterances As a result, our content moderation module flagged 33
(i.e., those not from the 3 sample utterances). potentially risky skills that have expletives in the content.
However, a human review process is necessary to verify the
Time Performance. It took about 21 seconds on average
output because whether or not a flagged skill is considered
for collecting one conversation. Our SkillBot interacted with
to actually have inappropriate content for children depends
4,507 skills and collected 39,322 conversations within 46
on context. For example, some of the expletives (such as
hours using five parallel processes on an Ubuntu 20.04 ma-
“facial” and “sex”) are likely considered to be appropriate in
chine with Intel Core i7-9700K CPU.
some conversational context. For the human review process,
four researchers in our team—who come from 3 countries
6 Kid Skill Analysis
(including the USA), all of whom are English speakers, and
To investigate the risks of skills made for kids (RQ1), we whose ages range from 22 to 35—independently reviewed
employed our SkillBot to collect and analyze 31,966 conversa- each of the flagged skills, and voted whether the skill’s content
tions from a sample of 3,434 Alexa kid skills. In this section, is inappropriate for children. Skills that received three or four
we describe our dataset of kid skills and present our findings votes were counted towards the final list. Using this approach,
of risky kid skills. we identified 8 kid skills with actual inappropriate content.
Out of these 8 kid skills, SkillBot identified the inappropri-
6.1 Dataset ate content of one skill at depth 11, one skill at depth 5, two
at depth 4, one at depth 2, and three at depth 1.
Our system first explored and downloaded information of We performed a false negative analysis by sampling 100
skills from their info pages available in the Alexa’s U.S. skills out of the other skills that were not flagged as having inappro-

15%                           30%
Percent of Skills

                                                                        40%
  (N = 4,508)

                                                  20%                                                  20%
                    10%
                                                  10%                   20%                            10%
                    5%
                    0%                            0%                    0%                              0%
                           0
                           1
                           2
                           3
                           4
                           5
                           6
                           7
                           8
                           9
                          10

                                                         0
                                                         1
                                                         2
                                                         3
                                                         4
                                                         5
                                                         6
                                                         7
                                                         8
                                                         9
                                                        10

                                                                               0
                                                                               1
                                                                               2
                                                                               3
                                                                               4
                                                                               5
                                                                               6
                                                                               7
                                                                               8
                                                                               9
                                                                              10

                                                                                                              0
                                                                                                              1
                                                                                                              2
                                                                                                              3
                                                                                                              4
                                                                                                              5
                                                                                                              6
                                                                                                              7
                                                                                                              8
                                                                                                              9
                                                                                                             10
                          Unique Response Count         Max Depth                 Max Branch Count               Init Utterance Count
Figure 3: Coverage of SkillBot in terms of four criteria: number of unique responses from Alexa; maximum depth in an Interaction Tree;
maximum number of branches for any node in an Interaction Tree; and number of initial utterances.

priate content and manually checking them. As a result, we              under 13. Our goal was to qualitatively understand parents’
found 0 false negatives.                                                expectations and attitudes about these risky skills, parents’
                                                                        awareness of parental control features, and how risky skills
Skills Collecting Personal Information. Our goal was to
                                                                        might affect children. Our study protocol was approved by
detect if the skills asked users for personal information. To the
                                                                        our Institutional Review Board (IRB), and the full text of our
best of our knowledge, available tools only focus on detecting
                                                                        survey instrument is provided in Appendix A. In this section,
personal information in the text, which is a different goal. For
                                                                        we describe our recruitment strategy, survey design, response
this analysis, we employed a keyword-based search approach
                                                                        filtering, and results.
to identify skill responses that asked for personal information.
We constructed a list of personal information keywords based
on the U.S. Department of Defense Privacy Office [12] and               7.1    Recruitment
searched for these keywords in the skill responses. In particu-
lar, our list includes: name, age, address, phone number, social           We recruited participants on Prolific1 , a crowd-sourcing
security number, passport number, driver’s license number,              website for online research. Participants were required to
taxpayer ID number, patient ID number, financial account                be adults 18 years or older who are fluent in English, live
number, credit card number, date of birth, and zipcode. A               in the U.S. with their kids under 13, and have at least one
naive keyword search approach that basically looks for those            Amazon Echo device in their home. We combined Prolific’s
keywords in the text would not be sufficient due to the fact            pre-screening filters and a screening survey to get this niche
that the text containing those keywords does not always ask             sample of participants for our main survey. Our screening
for such information. Thus, we combined keyword search                  survey consisted of two questions to determine: (1) if the
with our question detection and answer generation techniques            participant has kids aged 1 – 13 and (2) if the participant has
used for our Chatbot module presented in Section 5.1 to detect          Amazon Echo device(s) in their household. 1,500 participants
if the skill asked the user to provide personal information.            participated in our screening survey and 258 of them qualified
    22 risky skills were flagged as asking users for personal           for our main survey. The screening survey took less than 1
information. To verify the result, we manually checked these            minute to complete and our main survey took an average of
22 skills and 100 random skills that were not flagged. As a             6.5 minutes (5.2 minutes in the median case). Participants
result, we found 2 false positives and 0 false negatives. Thus,         were compensated $0.10 for completing the screening survey
20 kid skills asked for personal information such as name,              and $2 for completing the main survey. To improve response
age, and birthday.                                                      quality, we limited both the screening and main surveys to
    Out of these 20 skills, SkillBot identified contents that ask       Prolific workers with at least a 99% approval rate.
for sensitive information of one skill at depth 11, two skills
at depth 4, six skills at depth 3, seven at depth 2, and four           7.2    Screening Survey
at depth 1. Also, SkillBot identified such contents on non-
sample utterances for three of the skills (i.e., utterances not            The screening survey consisted of two multiple-choice
listed as the three samples, but rather listed in the “Additional       questions: “Who lives in your household?" and “Which elec-
Instructions” section of the skill’s page on Amazon.com).               tronic devices do you have in your household?". This allowed
    We further analyze the permission requests by the skills.           us to identify participants with kids aged 1 – 13 and Amazon
None of the identified 20 risky kid skills requested any per-           Echo device(s) in their household who were eligible to take
mission from the user.                                                  the main survey.

7                   Awareness & Opinions of Risky Kid Skills            7.3    Main Survey
   To evaluate how the risky kid skills we identified actually            The main survey consisted of the following four sections.
impact kid users (RQ2 and RQ3), we conducted a user study
of 232 U.S. parents who use Amazon Alexa and have children                1 https://www.prolific.co/

                                                                    8

Parents’ Perceptions of VPA Skills. This section investi- either of two attention check questions (“What is the com-
gated parents’ opinions of and experiences with risky skills. pany that makes Alexa?” and “How many buttons are there
Participants were presented with two conversation samples on an Amazon Echo?”). We also excluded participants who
collected by SkillBot from each of the following categories gave meaningless responses (e.g., entering only whitespaces
(six samples total). Conversation samples were randomly into all free-text answer boxes). This resulted in 232 valid
selected from each category for each participant and were responses for analysis.
presented in random order.
• Expletive. Conversation samples from 8 skills identi- 7.5 User Study Results
fied in our analysis that contain inappropriate language
content for children. We find that most parents allow their kids to use other types
• Sensitive. Conversation samples from 20 skills identi- of Amazon Echo than the Kids Edition. Such types of Echo
fied in our analysis that ask the user to provide personal do not have parental control enabled by default. We also find
information, such as name, age, and birthday. that many parents do not know about the parental control
• Non-Risky. Conversation samples from 100 skills that feature. For those who know about the feature, only a few of
did not contain inappropriate content for children or ask them use it. Thus, kids potentially have access to risky skills.
for personal information. Our results further show that parents are not aware of the
The full list of skills in the Expletive and Sensitive cat- risky skills that are avaiable in the Kids category on Amazon.
egories are provided in Appendix D. Each participant was When presented with examples of risky kid skills that have
asked the following set of questions after viewing each con- expletives and those that ask for personal information, parents
versation sample: express concerns, especially for expletive ones. Some parents
• Do you think the conversation is possible on Alexa? reported previous experiences of using such risky skills.
• Do you think Alexa should allow this type of conversa- Parents’ Perceptions of Kid Skills. Table 2 shows the dis-
tion? tribution of responses to the following questions across the
• Do you think this particular skill or conversation is de- Expletive, Sensitive, and Non-Risky skill sets:
signed for families and kids? • Do you think the conversation is possible on Alexa?
• How comfortable are you if this conversation is between • Do you think Alexa should allow this type of conversa-
your children and Alexa? tion?
• If you answered "Somewhat uncomfortable" or "Ex- • Do you think this particular skill or conversation is de-
tremely uncomfortable" to the previous question, what signed for families and kids?
skills or conversations have you experienced with your A majority of parents thought that the interactions with the
Alexa that made you similarly uncomfortable? expletive skills were not possible and should not be allowed
Amazon Echo Usage. We asked which device model(s) of by Alexa. Only 45.9% of the respondents thought these in-
Amazon Echo our participants have in their household (e.g., teractions were possible and only 41.6% of the respondents
Echo Dot, Echo Dot Kids Edition, Echo Show). We also asked thought such skills should be allowed. Furthermore, most par-
whether their kids used Amazon Echo at home. ents (57.1%) felt that the expletive skills were not designed
for families and kids.
Awareness of Parental Control Feature. We asked the par- The parents’ responses with regard to the expletive skills
ticipants if they think Amazon Echo supports parental con- are significantly different from their responses to the sen-
trol (yes/no/don’t know). Participants who answered “yes” sitive and non-risky skills on these questions. For each of
were further asked to identify the feature’s name (free-text these three questions, we conduct Chi-square tests on the
response) and if they used the feature (yes/no/don’t know). pairs of responses across the skill sets: Non-Risky vs. Ex-
Demographic Information. At the end of the survey, we pletive, Non-Risky vs. Sensitive, and Expletive vs. Sensitive.
asked demographic questions about gender, age, and comfort The responses from the Expletive set are significantly dif-
level with computing technology. Our sample consisted of ferent from responses from the other two sets for all three
128 male (55.2%), 103 female (44.4%), and 1 preferred not questions (p < 0.05). The responses to the “Alexa should
to answer (0.4%). The majority (79.7%) were between 25 allow” question are also significantly different for the Non-
and 44 years old. Most participants in our sample are techni- Risky set versus for the Sensitive set (p < 0.05). In contrast,
cally savvy (68.5%). See Table 5 in Appendix C for detailed the responses for the “Possible on Alexa” and “Designed for
demographic information. families and kids” questions display no significant difference
between Sensitive and Non-Risky sets. This is alarming, as
7.4 Response Filtering the sensitive skills ask for personal information through the
conversations with users, thereby bypassing Amazon’s built-
We received 237 responses for our main survey. We filtered in permission control model for skills. As many skills are
out responses from participants who incorrectly answered hosted by third parties, sensitive information about children

could be leaked to someone other than Amazon.                                                              We do not find any significant difference between parents’
                                                                                                        comfort with the Sensitive conversations versus the Non-risky
Designed for Family and Children. Table 3 shows the dis-
                                                                                                        conversations. However, the Sensitive conversations involved
tribution of responses for the question: “Do you think this
                                                                                                        skills asking for different types of personal information. Out
particular skill or conversation is designed for families and
                                                                                                        of the 20 skills in the Sensitive set, 15 skills asked for the
kids?” with a breakdown across different types of skills (e.g.,
                                                                                                        user’s name, 3 asked for the user’s age, and 2 asked for the
Non-risky, Expletive, and Sensitive). These results show that
                                                                                                        user’s birthday. We show the distribution of the participants’
the majority of parents (72.6%) did not think that skills with
                                                                                                        comfort level according to each type of personal information
expletives were designed for families/kids. This indicates
                                                                                                        in Figure 5. This indicates that that parents expressed more
that the respondents were not aware of the skills with exple-
                                                                                                        discomfort (“Extremely uncomfortable” and “Somewhat un-
tives that were actually developed for kids and published in
                                                                                                        comfortable”) for skills that ask for the user’s birthday (15.2%
Amazon’s “Kid” category. In addition, about half of parents
                                                                                                        of respondents), compared with skills that ask for the user’s
(44.2%) did not think the sensitive skills were designed for
                                                                                                        name (11.8%) or age (11.5%). Some participants expressed
families/kids, although these skills are actually in the “Kid”
                                                                                                        their concerns about these skills by free-text responses, in-
category on Amazon as well.
                                                                                                        cluding “I don’t like a skill or Alexa asking for PII (P115)”, “I
Parents’ Comfort Level. We used a five-point Likert scale                                               haven’t had a similar experience but I think it is inappropriate
to measure parents’ comfort levels if the presented conversa-                                           for Alexa to be asking for the name of a child (P209)”, “I
tions were between their children and Alexa. Figure 4 shows                                             don’t know why it needs a name (P228)”, and “I would not
the participants’ comfort levels for each skill category. These                                         want Alexa to collect my children’s imformation [sic] (P003)”.
results indicate that parents were more uncomfortable with
the Expletive skill conversations compared to the Sensitive
                                                                                                                                                                                 Extremely uncomfortable
skill conversations. In particular, 42.7% of the respondents ex-
                                                                                                        Types of sensitive
                                                                                                        information asked
                                                                                                                                 age                                             Somewhat uncomfortable
pressed discomfort (“Extremely uncomfortable” and “Some-                                                                                                                         Neutral
                                                                                                                               name                                              Somewhat comfortable
what uncomfortable”) with the Expletive skills, compared                                                                                                                         Extremely comfortable
to only 12.1% with the Sensitive skills and 5.6% with the                                                                    birthday
Non-risky skills.                                                                                                                   0%   20%      40%       60%     80%   100%
                                                                                                                                               Percent of responses
   Chi-square tests show that parents’ comfort with the Exple-
                                                                                                        Figure 5: Participants’ levels of comfort for each type of personal
tive conversations is significantly different from their comfort                                        information, if the conversations happen between the participants’
with the Sensitive conversations and from the Non-risky con-                                            children and Alexa.
versations (p < 0.05). Some participants expressed their con-

                                                                         Extremely uncomfortable        Amazon Echo Usage. Our results also show that most house-
Conversation type

                    Non-risky                                            Somewhat uncomfortable         holds with kids use Echo devices other than the Echo Kids
                                                                         Neutral
                    Sensitive                                            Somewhat comfortable           Edition. Echo Dot was the most popular type (46.4%) of Echo
                                                                         Extremely comfortable
                                                                                                        device in our participants’ households. Only 27 participants
                    Expletive
                                                                                                        (6.8%) bought an Echo Dot Kids Edition, which has parental
                            0%   20%      40%       60%     80%   100%                                  control mode enabled by default. This shows that if kids use
                                       Percent of responses
Figure 4: Participants’ levels of comfort if conversations of a partic-                                 Echo, they likely have access to the types of Echo devices
ular type happen between the participants’ children and Alexa.                                          that do not have parental control mode enabled by default.
                                                                                                           Furthermore, the majority of participants (91.8%) reported
                                                                                                        that their kids do use Amazon Echo at home. Figure 6 shows
cerns about skills in the Expletive set by free-text responses,
including “It doesn’t seem appropriate to tell jokes like this                                          the types of Echo that the participants own in their household
to children (P148)”, “Under no circumstances should anyone                                              associated with the breakdown of answers to the question “Do
have a coversation [sic] with children about orgasms. This                                              your kids use Amazon Echo at home?". Most parents allow
would be grounds for legal action (P163)”, “I do not believe                                            their kids to use Amazon Echo at home even without an Echo
                                                                                                        Dot Kids Edition. This indicates that many kids have access
Alexa should be used in such a crass manner or to teach my
                                                                                                        to risky skills, as these skills can be used by default on Echo
child how to be crass (P210)”, “Poop and poopy jokes don’t
happen in my household (P216)”, and “It is too sexual (P123)”.                                          devices other than the Kids Edition.
Beyond the skills shown in the survey, one respondent also                                              Awareness of Parental Control Feature. We analyzed the
recalled hearing similar skills such as “Roastmaster (P121).”                                           responses to the question: “Does Amazon Echo support
Another respondent remembered something similar but was                                                 parental control?” In total, 76.3% said “yes”, 0.4% said “no”,
unable to provide the name of the skill: “We have asked Alexa                                           and 23.3% were unsure. For participants who had Echo Kids
to tell us a joke in front of our young son and Alexa has told                                          Edition, almost all (92.6%) said “yes”, 7.4% said “no”, and
a few jokes that were borderline inappropriate (P140).”                                                 none was unsure. In contrast, for participants without Echo

                                                                                                   10

You can also read