Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing Boaz Shmueli1,2,3,∗, Jan Fell2 , Soumya Ray2 , and Lun-Wei Ku3 1 Social Networks and Human-Centered Computing, TIGP, Academia Sinica 2 Institute of Service Science, National Tsing Hua University 3 Institute of Information Science, Academia Sinica Abstract and fairness (Hovy and Spruit, 2016; Leidner and Plachouras, 2017). Other works are concerned The use of crowdworkers in NLP research with the ethical implications of NLP shared tasks is growing rapidly, in tandem with the expo- (Parra Escartín et al., 2017), and introducing ethics arXiv:2104.10097v1 [cs.CL] 20 Apr 2021 nential increase in research production in ma- chine learning and AI. Ethical discussion re- into the NLP curriculum (Bender et al., 2020). garding the use of crowdworkers within the A substantial amount of NLP research now takes NLP research community is typically confined advantage of crowdworkers — workers on crowd- in scope to issues related to labor conditions sourcing platforms such as Amazon Mechanical such as fair pay. We draw attention to the Turk (known also as AMT or MTurk), Figure lack of ethical considerations related to the var- Eight1 , Appen, Upwork, Prolific, Hybrid, Tencent ious tasks performed by workers, including la- beling, evaluation, and production. We find Questionnaire, and Baidu Zhongbao, as well as in- that the Final Rule, the common ethical frame- ternal crowdsourcing platforms in companies such work used by researchers, did not anticipate as Microsoft and Apple. Workers are recruited to la- the use of online crowdsourcing platforms for bel, evaluate, and produce data. In the pre-internet data collection, resulting in gaps between the era, such tasks (e.g. part-of-speech (POS) tagging) spirit and practice of human-subjects ethics in were done by hiring expert annotators or linguistics NLP research. We enumerate common sce- students. However, these are now mostly replaced narios where crowdworkers performing NLP by crowdworkers due to lower costs, convenience, tasks are at risk of harm. We thus recom- mend that researchers evaluate these risks by speed, and scalability. considering the three ethical principles set up Overall, the general consensus in the literature by the Belmont Report. We also clarify some is that as long as the pay to the crowdworkers is common misconceptions regarding the Institu- “fair” (minimum hourly wage or above), there are tional Review Board (IRB) application. We no further ethical concerns, and there is no need for hope this paper will serve to reopen the discus- approval by an Institutional Review Board2 (with sion within our community regarding the ethi- cal use of crowdworkers. some exceptions). For example, Hovy and Spruit (2016) mention that “[w]ork on existing corpora 1 Revisiting the Ethics of Crowdsourcing is unlikely to raise any flags that would require an IRB approval”, with a footnote that there are “a The information age brought with it the internet, few exceptions”. Fort et al. (2011) mention that big data, smartphones, AI, and along with these, a only “[a] small number of universities have insisted plethora of complex ethical challenges. As a result, on institutional review board approval for MTurk there is growing concern and discussion on ethics experiments”. As another example, NLP students within the research community at large, including are being taught that “paid labeling does not re- the NLP community. This is manifested in new quire IRB approval” since “[i]t’s not an experiment ethics-focused workshops, ethics conference panels and relevant updates to peer review forms. 1 Previously CrowdFlower; acquired by Appen in 2019. 2 While ethics in NLP has multiple aspects, most Institutional Review Boards (IRBs) are university-level, recent attention focuses on pressing issues related multi-stakeholder committees that review the methods pro- posed for research involving human subjects to ensure that to the societal impact of NLP. These include dis- they conform to ethical principles. IRBs are also known by crimination, exclusion, over-generalization, bias, various other names, such as Research Ethics Boards (REBs) and Research Ethics Committees (RECs). Non-academic or- ∗ Corresponding author: shmueli@iis.sinica.edu.tw ganizations may employ similar committees.
Accepted Papers Papers Using Payment IRB Year ACL EMNLP NAACL All Crowdsourcing Mentioned Mentioned 2015 318 312 186 816 59 (7%) 4 (7%) 0 2016 328 264 182 774 82 (11%) 15 (18%) 0 2017 302 323 — 625 57 (9%) 12 (21%) 3 2018 381 549 332 1262 136 (11%) 17 (13%) 1 2019 660 683 423 1766 189 (11%) 32 (17%) 5 2020 779 754 — 1533 180 (12%) 42 (23%) 5 Total 2768 2885 1123 6776 703 (10%) 122 (17%) 14 Table 1: Papers using crowdsourced tasks in top NLP conferences, 2015–2020. The columns show, from left to right: conference year; number of accepted papers at ACL, EMNLP, and NAACL; total number of accepted papers; number of accepted papers using crowdsourced tasks (percentage of papers using crowdsourced tasks); number of papers using crowdsourced tasks that mention payment (percentage of papers using crowdsourced tasks that mention payment); number of papers using crowdsourced tasks that mention IRB review or exemption. with human subjects” (Carnegie-Mellon University, 6776 papers were accepted for publication in these Language Technologies Institute, 2020). Indeed, 16 conferences4 . In total, we identified 703 papers NLP papers that involve crowdsourced work rarely that use crowdworkers as part of their research.5 mention a review by an ethics board. The results are summarized in Table 1. In this work, we wish to revisit the ethical issues of crowdsourcing in the NLP context, highlight Renumeration For each paper that uses crowd- several issues of concern, and suggest ways for- sourced labor, we checked whether the authors dis- ward. From our survey of top NLP conferences, we cuss payment or labor-related issues. Out of the find that crowdsourcing tasks are growing rapidly 703 papers, 122 (17%) discuss payment, either by within the community. We therefore establish a detailing the amount paid per task, the worker’s common understanding of research ethics and how hourly wages, or declaring that wages were paid it relates to crowdsourcing. We demonstrate that ethically. While in some cases researchers empha- the existing ethical framework is often inadequate size fair payment (e.g., Nangia et al. (2020) ensured and does not seek to protect crowdsourced workers. “a pay rate of at least $15/hour”), many other papers We then dispel common misunderstandings regard- are more concerned about the cost of dataset ac- ing the IRB process that NLP researchers might quisition, and thus mention the cost per task or the harbor. And finally, we outline how to apply ethical total dataset cost, but not the hourly compensation. values as guidelines to minimize potential harms, and conclude with recommendations. IRB Review Finally, we also checked whether authors mention a review by an IRB (or equivalent 2 The Rise and Rise of NLP body) for their research. We found very few papers Crowdsourcing that mention an IRB approval or exemption — a total of 14 papers — which make up only 2% of To get a sense of the extent and growth of crowd- the works that use crowdsourcing. sourced tasks within the NLP research community, we analyzed the proceedings of three top NLP con- Growth We see that research papers using crowd- ferences: ACL, EMNLP, and NAACL (also known sourced tasks have made up a relatively constant as NAACL-HLT)3 . We scanned the annual proceed- 11-12% of all research papers in the last three years. ings of these conferences in the six years from 2015 As research production grows exponentially, we ex- to 2020, looking for papers that mention direct em- pect a corresponding increase in the number of ployment of crowdsourced workers. All together, crowdsourced tasks.6 3 ACL is the Annual Meeting of the Association for Compu- 4 tational Linguistics, EMNLP is the Conference on Empirical NAACL was not held in 2017 and 2020; it is skipped Methods in Natural Language Processing, and NAACL is the every three years. 5 Annual Conference of the North American Chapter of the MTurk is used in 80% of tasks. 6 Association for Computational Linguistics: Human Language ACL, EMNLP, and NAACL are only three of many other Technologies. publication venues in the NLP community.
2.1 Categorizing Crowdsourced Tasks scores given by the workers may reflect the inter- pretation, values, or beliefs of the worker. To understand the many nuanced ways in which Finally, in production tasks, workers are asked researchers presently use crowdworkers in NLP to produce their own data, rather than label or eval- tasks, we examined the tasks performed in each uate existing data. In the NLP context, this often of the papers that use crowdsourcing. We found amounts to text elicitation or text generation. Ex- that NLP crowdsourced tasks generally fall into amples include captioning a photo or video clip, one of three categories, which we designate as la- writing a story given a sequence of images, or com- beling, evaluation, and production. We found that posing questions and answers. In this category we the categories account for 34%, 43%, and 23% of also include text translation. The produced data is crowdsourcing tasks respectively. The results are often used for model training or evaluation. summarized in Table 2, which also lists common action verbs that researchers use to describe the While the majority of the studies use only one work performed by crowdworkers. type of task — labeling, evaluation, or production — we found that in 10% of the papers that use crowd- Task Action sourcing, researchers used two or more types of Category Pct. Verbs tasks in the same study. The combination of pro- Labeling 34% annotate, label, identify, indicate, check duction and evaluation is particularly common; re- Evaluation 43% rate, decide, score, judge, rank, choose searchers often ask workers to generate data, which Production 23% write, modify, produce, explain, paraphrase in turn is used to train a model; they then use work- ers to evaluate the model’s performance. Table 2: Tasks in the papers that use crowdsourcing, broken down by category. We also list typical action 2.2 Surveys and Gamification verbs used by the authors to describe the task. Although not common, some papers also collect Labeling entails the processing of existing data personal information from workers. For exam- by the crowdworker and then the selection or com- ple, Yang et al. (2015) and Ding and Pan (2016) position of a label or labels for that data7 . Labeling conduct personality surveys among its crowdwork- tasks augment the data with human-supplied labels. ers. Pérez-Rosas and Mihalcea (2015) collect de- The augmented data are often used for training ma- mographic data from the workers which included chine learning models. We further divide labeling “their gender, age, country of origin, and education tasks into two: objective and subjective labeling. level”. Finally, we also found a few papers that add In objective labeling, the desired label is factual, elements of gaming to their crowdsourced tasks, and does not depend on the worker. Classical ex- e.g. Niculae and Danescu-Niculescu-Mizil (2016) amples are the tagging of sentences with named and Urbanek et al. (2019). entities and part-of-speech (POS) labeling, and text 3 The Rules and Institutions of Research transcription. In contrast, subjective labeling com- Ethics prises of tasks where labels may depend on the worker’s personality, cultural background, opinion, Given the increasing use of crowdsourced NLP or affective state. Examples include emotion la- tasks, how can researchers ensure ethical concerns beling, detecting sarcasm in a tweet, or deciding are reasonably addressed? Should a researcher whether a text constitutes hate speech or not. make a judgement call and decide which tasks pose In evaluation tasks, the worker is presented with a risk of harm to the worker, and which are benign? data, for example a sentence, tweet, paragraph, To answer such questions, we will first explore the or dialogue, and then requested to evaluate and existing ethical framework used by researchers in score the data — mostly text — according to pre- the biomedical, social, and behavioral sciences. defined criteria, such as fluency, coherence, orig- inality, or structure. These tasks are often used 3.1 The Genesis of Modern Research Ethics by researchers to evaluate natural language genera- The roots of contemporary research ethics originate tion (NLG) models. Similar to subjective labeling, in the 19th century, when researchers made unpar- 7 alleled discoveries, but also engaged in hazardous, We refrain from using the terms “annotation” and “an- notators” as these terms are overloaded and often used for and frequently deadly, experimentation without non-annotation work. great concern for the human subjects involved as
long as these trials advanced the boundaries of Behavioral Research, 1978) are met: scientific knowledge (Ivy, 1948). The dominant ethics paradigm at the time was largely devoid of 1. Respect for persons, which includes “the re- now-common principles surrounding therapeutic quirement to acknowledge autonomy and the benefits, scientific validity, full knowledge, or sub- requirement to protect those with diminished ject consent (Lederer, 1995). Examples include re- autonomy”; searchers infecting intellectually disabled orphans 2. Beneficence, which mandates considering with gonorrhea, or puncturing a healthy and un- whether the benefits resulting from the re- aware woman with the nodules of a leper patient search can outweigh the risks; and to observe the clinical course of these diseases (Shamoo and Resnik, 2009). 3. Justice, requiring that the burden — and ben- Such incidents were common before the 1940s, efits — of the research are equally shared and academia and public discourse were gener- among potential subjects. ally ignorant of them and of research ethics in general (Rothman, 1987). The revelation of the The Final Rule is codified in Title 45, Code of Nazi concentration camp experiments at the end Federal Regulations, Part 46 and applies to all of World War II was a watershed moment (Gille- government-funded research in the United States spie, 1989) and led to an early precursor of con- (Israel, 2015). Virtually all universities in the temporary research ethics, namely the Nuremberg United States apply this regulation to human sub- Code of 1947 (Jonsen, 1998). Not long after, the jects research projects irrespective of funding fallout of the Tuskegee Syphilis Study in the US source (Klitzman, 2015). Specifically, the Final prompted the formalization of research ethics at Rule requires that most research involving human American universities (Caplan, 1992). In the study, subjects receives approval from an IRB. The IRB which took place between 1932 and 1972, a total is a special university-level committee that reviews of 399 socio-economically disadvantaged African- research proposals to verify that they comply with American males with latent syphilis infections were ethical standards. While it is difficult to assess the recruited through the false promise of “free health- effectiveness of IRBs, and the process of ethical re- care”. Yet, these subjects were actually left with- view is sometimes criticized as overly bureaucratic out therapy even as effective treatment became which may hamper low-risk social science (Resnik, available, with the objective to observe the clin- 2018; Schrag, 2010), glaring ethical lapses have ical course of the disease (Brandt, 1978). been rare after 1974 (Klitzman, 2015). 3.3 Are IRBs Universal? 3.2 The Belmont Principles and IRBs The three ethical principles outlined by the Bel- From thereon, beginning with biomedical research, mont Report — Respect for persons, Beneficence, the notion of research ethics has been institutional- and Justice — are also the stated principles guiding ized in the US at the university level through institu- the actions of more recently formed ethics boards tional review boards (IRBs), as well as national leg- around the world, as well as the underlying princi- islation (Israel, 2015). Gradually, research ethics ples of relevant policies of intergovernmental orga- have become a concern also in the social and be- nizations, including the Universal Declaration on havioral sciences, as demonstrated by the 1978 Bel- Bioethics and Human Rights (UNESCO, 2006) and mont Report created by the National Commission the International Ethical Guidelines for Biomedical for the Protection of Human Subjects of Biomedi- Research Involving Human Subjects (Council for cal and Behavioral Research (Jonsen, 2005). This International Organizations of Medical Sciences, report became the basis of the 1991 Federal Policy 2002). Consequently, many countries worldwide for the Protection of Human Subjects in the United have modeled national regulations after the Final States, more commonly known as the Common Rule or its predecessor, the Common Rule (Capron, Rule (Owen, 2006) and superseded by the Final 2008). Both regulations have also influenced edi- Rule (2018). Chiefly, the Final Rule aims to ensure torial policies of academic journals (e.g., the Com- that the following three basic principles listed in mittee on Publication Ethics, 2020). The Final the Belmont Report (National Commission for the Rule can thus be considered a cross-disciplinary de- Protection of Human Subjects of Biomedical and facto standard for research ethics in Western/US-
influenced academic settings (Gontcharov, 2018), 4.2 Are Crowdworkers Human Subjects? including India, Japan, Korea, and Taiwan. The Final Rule defines human subjects as follows: Though we find that a global agreement on human-subjects ethics is emerging, countries still Human subject means a living individ- vary in the extent to which relevant policies are ual about whom an investigator (whether accepted, framed, implemented, or enforced. professional or student) conducting re- search: 4 Do NLP Tasks Constitute Research (i) Obtains information or biospeci- Involving Human Subjects? mens through intervention or inter- action with the individual, and, uses, These formal rules and institutions have been estab- studies, or analyzes the information lished to protect the rights and interests of humans or biospecimens; or subjects involved as participants in scientific re- (ii) Obtains, uses, studies, analyzes, search. Thus, we must detemine whether the rules or generates identifiable private in- and institutions of research ethics are even appli- formation or identifiable biospeci- cable to crowdsourced studies. At the core of this mens. determination are two fundamental questions: (45 CFR 46.102(e)(1), Final Rule, 2018) 1. Are crowdsourcing tasks research? Clearly, if the researcher obtains identifiable pri- 2. Are crowdworkers human subjects? vate information (IPI) as part of the crowdsourced task — e.g. name, date of birth, email address, In the following, we address these two questions. national identity number, or any other information that identifies the worker — then (ii) holds and the 4.1 Are Crowdsourcing Tasks Research? worker is considered a human subject. Even if the researcher does not obtain any IPI, The Final Rule defines research as follows: the worker may still be considered a human subject under (i) in certain cases. This is because NLP Research means a systematic investi- researchers interact with crowdworkers through gation, including research development, MTurk or analogous platforms when they publish testing, and evaluation, designed to the task, and obtain information through this in- develop or contribute to generalizable teraction. It is also evident that academic NLP knowledge. Activities that meet this defi- researchers make use of this information as they nition constitute research for purposes of conduct a given study, and hence it is “used, stud- this policy, whether or not they are con- ied, or analyzed”. If the information is about the ducted or supported under a program that crowdworker then they are considered human sub- is considered research for other purposes. jects and (i) is met. However, Final Rule does not (45 CFR 46.102(l), Final Rule, 2018) expand on what constitutes information about the individual. According to University of Washing- From this definition it is evident that rather than ton, Office of Research (2020), for example, about a concrete research behavior on part of the NLP whom means that the “data or information relates researcher, it is the purpose of the research behav- to the person. Asking what [crowdworkers] think ior that classifies said behavior as research under about something, how they do something, or simi- the Final Rule. In other words, all categories of lar questions usually pertain to the individuals. This crowdsourced tasks summarized in Section 2, i.e., is in contrast to questions about factual information labeling, evaluation, and production, may be con- not related to the person.” sidered part of research so long as the intended Whether the information obtained in an NLP outcome is to create generalizable knowledge. Typ- task is about the worker can initially seem like an ically, this encompasses academic settings where easy-to-answer question. For example, Benton et al. research behavior takes place (course assignments (2017) write: by students being a prominent exception), but does not include research conducted in industry settings [R]esearch that requires the annotation of (Meyer, 2020; Jackman and Kanerva, 2015). corpora for training models involves hu-
man annotators. But since the research 5 Dispelling IRB Misconceptions does not study the actions of those an- As we now see, if workers constitute human sub- notators, the research does not involve jects, an IRB application is required. To clarify human subjects. By contrast, if the goal any other misconceptions that might be present of the research was to study how humans within the community, we list key points related to annotate data, such as to learn about how the IRB process and dispel misconceptions around humans interpret language, then the re- them. While not exhaustive, the list can serve both search may constitute human subjects re- researchers and reviewers. search. 5.1 Researchers Cannot Exempt Themselves However, we believe that this is not so clear-cut. from an IRB Application First, one might argue that although in a POS la- beling task we do not obtain information about The Final Rule includes provisions for IRB exemp- the worker, other labeling tasks might be harder tions, and we expect the vast majority of crowd- to classify. For example, when researchers ask a sourced NLP tasks to fall into that category. How- worker to compose a story given a sequence of pho- ever, it is crucial to understand that granting a re- tos, do they obtain information about the worker? search project an IRB exemption is not the preroga- And if so, what kind of information? Similar ques- tive of the researcher; it is only IRB that hands out tions can be asked about tasks related to emotion exemptions following an initial review: classification (which might reveal a worker’s per- [T]he determination of exempt status sonality or mood), composing questions and an- (and the type of review that applies) rests swers (which point to areas of interest and cultural with the IRB or with an administration background), or identifying hate speech (which can official named by your institution. The indicate political orientation). determination does not rest with the in- Second, platforms might automatically provide vestigator. Therefore, all projects must information that can be considered by some to be be submitted to the IRB for initial review. about the individual. Even in the most benign and (American Association for Public Opin- “objective” tasks such as POS tagging, MTurk sup- ion Research, 2020) plies researchers with information on the amount of time taken to complete each task. This infor- 5.2 Worker IDs Constitute IPI mation is sometimes collected and used by NLP Researchers often obtain and store worker IDs — researchers (e.g., Sen et al., 2020). unique and persistent identifiers assigned to each In summary, we have shown that in an academic worker by the crowdsourcing platform — even context, NLP crowdsourcing tasks are research, when workers are “anonymous”. MTurk, for ex- but that the categorization of crowdworkers as hu- ample, assigns each worker a fixed 14-digit string, man subjects can, in some cases, be a gray area which is provided to the researcher with completed that is open to interpretation. The Final Rule was tasks. A worker ID is part of the worker’s account, designed to address ethical issues in medical re- and is therefore linked to their personal details, in- search, and later in behavioral sciences; lawmak- cluding full name, email address, and bank account ers and experts involved did not anticipate its use number. As a consequence, the worker ID con- in new domains such as crowdsourcing. There- stitutes IPI (identifiable private information). If fore, its application to online data collection, and the worker ID is obtained by the researcher, the crowdsourcing in particular, can be ambiguous and research mandates an initial IRB review. unsatisfactory.8 Thus, while in some cases the pro- To avoid obtaining this IPI, researchers can cre- tections and procedures mandated under the Final ate and store pseudonymized worker IDs, provided Rule apply, in others they might not. As a conse- that these IDs cannot be mapped back to the origi- quence, some NLP crowdsourcing tasks may not nal worker IDs. require an IRB application, and this may happen even if crowdworkers are at risk. 5.3 Obtaining Anonymous Data Does Not Automatically Absolve from IRB Review 8 Some universities employ the Precautionary Principle and require all crowdsourced-enabled research to go through Even if IPI is not obtained, and only anonymous an IRB application. and non-identifiable data is collected, we have
shown in Section 4.2 that crowdsourced NLP tasks IRB approval cannot be granted retroac- involve interaction between researcher and partici- tively, and data may need to be rec- pant through which the data about the worker may ollected for the project. (University be collected, and thus often require an initial review of Winthrop, The Office of Grants by an IRB. and Sponsored Research Development, 2021) 5.4 Payment to Crowdworkers Does Not Exempt Researchers from IRB Review 6 Risks and Harms for Crowdworkers Remuneration of human subjects does not change Previous work on the ethics of crowdsourcing fo- their status to independent contractors beyond the cused on labor conditions, such as fair pay, and scope of research ethics. In fact, compensation of on privacy issues (Fort et al., 2011; Gray and Suri, human subjects for the time and inconvenience in- 2019). However, even when payment is adequate volved in participating is a standard practice “espe- and the privacy of the workers is preserved, there cially for research that poses little or no direct bene- are additional ethical considerations that are often fit for the subject” and at the same time “should not not taken into account, and might put the worker constitute undue inducement to participate”, as the at risk. We propose using the three ethical princi- University of Virginia, Human Research Protection ples outlined by the Belmont Report — Respect Program (2020) points out. for Persons, Beneficence, and Justice — to guide 5.5 Non-Published Research Also Requires the action of researchers. We outline some of the IRB Review specific risks and harms that might befall NLP task crowdworkers in light of these principles. While Some researchers believe that research that will not the list is not comprehensive, it can serve as a start- be published is not subject to an IRB review. For ing point to be used by researchers when planning example, Carnegie-Mellon University, Language their crowdsourced task, as well as by reviewers Technologies Institute (2020) teaches students that examining the ethical implications of a manuscript “Paid labeling does not require IRB approval... [b]ut or research proposal. sometimes you want to discuss results in papers, so consider IRB approval”.9 6.1 Inducing Psychological Harms The definition of research in the Final Rule is not contingent on subsequent publication of the NLP researchers are increasingly cognizant that results. Given the uncertainties of the peer-review texts can potentially harm readers, as evident by process, whether research eventually finds its way trigger warnings they add to their own papers (e.g., into a publication can often be ascertained only ex Sap et al., 2020; Nangia et al., 2020; Sharma et al., post. An important exception not requiring IRB 2020; Han and Tsvetkov, 2020). Moreover, re- approval is student work as part of course assign- searchers have long known that annotation work ments. However, subsequent use of research data may be psychologically harmful. The Linguistic collected originally as part of a student assignment Data Consortium (LDC), for example, arranged is not mentioned in the Final Rule. Consequently, stress-relieving activities for annotators of broad- universities handle this situation differently. For cast news data, following reports of “negative example, University of Michigan allows retroactive psychological impact” such as intense irritation, IRB approval: overwhelmed feelings, and task-related nightmares (Strassel et al., 2000). Although literature on the Class assignments may become subject emotional toll on crowdworkers is still scant (Huws, to this policy... if the faculty member or 2015), there is growing literature on the psycho- the students change their plans... applica- logical cost of work done by commercial content tion to the IRB for permission to use the moderators (Steiger et al., 2021). Crowdworkers data is required. (University of Michigan, deserve similar consideration: while NLP tasks Research Ethics and Compliance, 2021) can be as benign as the POS tagging of a children’s poem, they may also involve exposure to disturbing while Winthrop University does not: textual or visual content. 9 Moreover, whether the labeling is paid or unpaid is irrele- In 45 CFR 46.110 the Final Rule allows expe- vant; see Section 5.4. dited IRB review for research posing no more than
minimal risk to the human subjects involved. Ac- Task Risk of Exposure of Workers’ cording to 45 CFR 46.102(j) minimal risk “means Category Sensitive Information that the probability and magnitude of harm or dis- Objective Labeling Low comfort anticipated in the research are not greater Subjective Labeling Medium in and of themselves than those ordinarily encoun- Evaluation Medium tered in daily life or during the performance of Production High routine [...] psychological examinations or tests.” Exposing crowdworkers to sensitive content may Table 3: Potential risk level by task category. exceed the threshold of minimal risk if the data that requires labeling or evaluation might be psycholog- marital status, and education level (Amazon Me- ically harmful. chanical Turk, 2020). The amount of harm (if any) depends on the Researchers may also obtain other types of sen- sensitivity of the worker to the specific type of sitive information by creating their own, arbitrary content. The risk is generally higher in data label- qualification tests as a quality control measure ing or text evaluation tasks, where workers might (Daniel et al., 2018). be repeatedly asked to categorize offensive tweets, In summary, the information obtained by re- transcribe violent texts, evaluate hateful text that searchers may reveal as much about the individ- may expose them to emotional stimuli, or, depend- ual crowdworkers as they do about the data being ing on the content and worker, traumatize them. In labeled, evaluated, or produced. some cases, sexual material can be highly offend- ing or shocking and cause an emotional disturbance. 6.3 Unwittingly Including Vulnerable Although less likely, hazards can also occur when Populations workers are asked to produce text, since workers are elicited to produce texts based on given input. Research on crowdsourcing platforms such as For example, when users are asked to compose a MTurk is inherently prone to the inclusion of vul- story based on images, certain images might trigger nerable populations. In 45 CFR 46.111(b) the Final a harmful response in the worker. Rule (2018) non-exhaustively lists “children, pris- oners, individuals with impaired decision-making capacity, or economically or educationally disad- 6.2 Exposing Sensitive Information of vantaged persons” as vulnerable groups. The Fi- Workers nal Rule requires that additional safeguards are A crowdworker might inadvertently or subcon- implemented to protect these vulnerable popula- sciously expose sensitive information about them- tions and makes IRB approval contingent on these selves to the researcher. This is more pronounced safeguards. in text production, where the responses produced A great proportion of MTurk crowdworkers are by such tasks reveal as much about the individual located in developing countries, such as India or workers as they do about the produced text. How- Bangladesh, which makes MTurk an attractive ever, workers also reveal information about them- proposition to those offering a task (Gray and selves when evaluating or labeling text, especially Suri, 2019), but increases the risk of including when subjective labeling is in place. Moreover, economically-disadvantaged persons. Furthermore, even seemingly trivial data — for example, the it is difficult to ensure that MTurk crowdworkers elapsed time taken by the worker to label, evaluate, are above the age of majority, or fall into any other or produce text — may contain valuable informa- of the defined vulnerable populations (Mason and tion about the worker (and this information is auto- Suri, 2011). matically captured by MTurk). Table 3 shows the Moreover, given the power imbalances between risk level for the different task categories. researchers in industrialized countries and crowd- Moreover, researchers can obtain sensitive infor- workers in developing countries, ethical consider- mation about workers because the crowdsourcing ation should occur regardless of whether the juris- platforms allow screening of workers using built-in diction in which the crowdworkers are located even qualification attributes including age, financial situ- has a legal framework of research ethics in place, or ation, physical fitness, gender, employment status, whether such a local framework meets the standard purchasing habits, political affiliation, handedness, of the Belmont Report or Final Rule.
6.4 Breaching Anonymity and Privacy 7 Ways Forward There is a perception among researchers that crowd- The use of crowdworkers is growing within the workers are anonymous and thus the issue of pri- NLP community, but the ethical framework set in vacy is not a concern. This is not the case. Lease place to guarantee their ethical treatment (whether et al. (2013), for example, discuss a vulnerability de jure or de facto) did not anticipate the emergence that can expose the identity of an Amazon Mechani- of crowdsourcing platforms. In most crowdsourced cal Turk worker using their worker ID — a string of NLP tasks, researchers do not intend to gather infor- 14 letters and digits — because the same worker ID mation about the worker. However, the crowdsourc- is used also as the identifier of the crowdworker’s ing platform often autonomously collects such in- account on other Amazon assets and properties. formation. As a result, it is often difficult to de- As a result, a Google search for the worker ID termine whether crowdworkers constitute human can lead to personal information such as product re- subjects — which hinges on whether the researcher views written by the crowdworker on Amazon.com, collects information about the worker. However, which in turn can disclose the worker’s identity. Re- a determination that all crowdworkers are human searchers might be unaware of these issues when subjects — and thus mandate an IRB approval for they make worker IDs publicly available in papers government-supported institutions — might cre- or in datasets. For example, Gao et al. (2015) rank ate a “chilling effect” and disadvantage univer- their crowdworkers using MTurk worker IDs in one sity researchers compared to industry-affiliated re- of the figures. searchers. The effect is exacerbated in institutions Moreover, breaches of privacy can also occur un- where the ethics committee is heavily bureaucratic intentionally. For example, workers on MTurk are and does not offer a streamlined, expedited exemp- provided with an option to contact the researcher. tion process for low-to-no risk studies. In this case, their email address will be sent to the Whether their crowdsourced task requires IRB researcher, who is inadvertently exposed to further application or not, we recommend that the ethics- identifiable private information (IPI). We maintain aware researcher should carefully examine their that the anonymity of crowdworkers cannot be au- study in light of the three principles set up by the tomatically assumed or guaranteed, as this is not a Belmont Report: Respect for persons, Beneficence, premise of the crowdsourcing platform. and Justice. And while this mandates fair pay, it is 6.5 Triggering Addictive Behaviour important to note that this is just one of the impli- Graber and Graber (2013) identify another source cations. There are other ethical considerations that for harmful effects, which ties in with the risk are often overlooked — in particular risk assess- of psychological harm, and is specific to gami- ment of causing psychological harm and exposure fied crowdsourced tasks: a possibility of addiction of sensitive information. Thus, we recommend caused by dopamine release following a reward increasing awareness of the potential ethical im- given during the gamified task. Gamification tech- plications of crowdsourced NLP tasks. As NLP niques can be added to data labeling, evaluation, researchers are now encouraged to add an “ethical and production. Indeed, some NLP work is us- considerations” section to their papers (NAACL, ing gamification, mostly for data collection (e.g., 2020), they should also be encouraged to carefully Kumaran et al., 2014; Ogawa et al., 2020; Öhman weigh potential benefits against risks related to the et al., 2018). crowdsourced task. Moreover, the crowdsourcing platform may add We also propose increasing awareness by dissem- elements of gamification over which the researcher inating relevant knowledge and information. An has no control. For example, MTurk recently in- educational ethics resource created using a com- troduced a “Daily Goals Dashboard”, where the munity effort could serve as a beneficial first step. worker can set game-like “HITs Goal” and “Re- Such a resource can include guidelines, checklists, ward Goal”, as shown in Figure 1. and case studies that are specific to the ethical chal- lenges of crowdsourced tasks in the context of NLP research. We believe that the creation of such a re- source can serve as a springboard for a necessary Figure 1: The “Daily Goals Dashboard” on MTurk nuanced conversation regarding the ethical use of crowdworkers in the NLP community.
Acknowledgements Council for International Organizations of Medical Sciences. 2002. International Ethical Guidelines We thank the anonymous reviewers for their valu- for Biomedical Research Involving Human Sub- able comments and suggestions which helped im- jects. https://cioms.ch/wp-content/ prove the paper. This research was partially sup- uploads/2016/08/International_ Ethical_Guidelines_for_Biomedical_ ported by the Ministry of Science and Technology Research_Involving_Human_Subjects. in Taiwan under grants MOST 108-2221-E-001- pdf. Accessed: 2021-04-06. 012-MY3 and MOST 109-2221-E-001-015- [sic]. Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality control in crowdsourcing: A survey References of quality attributes, assessment techniques, and as- Amazon Mechanical Turk. 2020. Amazon Mechani- surance actions. ACM Comput. Surv., 51(1). cal Turk Pricing. https://web.archive. org/web/20201126061244/https: Tao Ding and Shimei Pan. 2016. Personalized em- //requester.mturk.com/pricing. Ac- phasis framing for persuasive message generation. cessed: 2020-11-26. In Proceedings of the 2016 Conference on Empiri- cal Methods in Natural Language Processing, pages American Association for Public Opinion Research. 1432–1441, Austin, Texas. Association for Compu- 2020. IRB FAQs for Survey Researchers. https: tational Linguistics. //www.aapor.org/Standards-Ethics/ Institutional-Review-Boards/ Final Rule. 2018. Protection of Human Subjects, Part IRB-FAQs-for-Survey-Researchers. 46. In US Department of Health and Human Ser- aspx. Accessed: 2020-11-11. vices, Code of Federal Regulations (CFR) Title 45. Emily M. Bender, Dirk Hovy, and Alexandra Schofield. Karën Fort, Gilles Adda, and K Bretonnel Cohen. 2011. 2020. Integrating ethics into the NLP curriculum. Amazon mechanical turk: Gold mine or coal mine? In Proceedings of the 58th Annual Meeting of the Computational Linguistics, 37(2):413–420. Association for Computational Linguistics: Tutorial Abstracts, pages 6–9, Online. Association for Com- Mingkun Gao, Wei Xu, and Chris Callison-Burch. putational Linguistics. 2015. Cost optimization in crowdsourcing transla- tion: Low cost translations made even cheaper. In Adrian Benton, Glen Coppersmith, and Mark Dredze. Proceedings of the 2015 Conference of the North 2017. Ethical research protocols for social media American Chapter of the Association for Computa- health research. In Proceedings of the First ACL tional Linguistics: Human Language Technologies, Workshop on Ethics in Natural Language Process- pages 705–713, Denver, Colorado. Association for ing, pages 94–102. Computational Linguistics. Allan M. Brandt. 1978. Racism and research: The case Richard Gillespie. 1989. Research on human subjects: of the Tuskegee syphilis study. The Hastings Center An historical overview. Bioethics News, 8(2):s4– Report, 8(6):21–29. s15. Arthur L. Caplan. 1992. When evil intrudes. The Hast- ings Center Report, 22(6):29–32. Igor Gontcharov. 2018. Qualitative ethics in a posi- tivist frame: The Canadian experience 1998–2014. Alexander M. Capron. 2008. Legal and regulatory stan- In The SAGE Handbook of Qualitative Research dards of informed consent in research. In Ezekiel J. Ethics, pages 231–246. SAGE Publications Ltd. Emanuel, Christine Grady, Robert A. Crouch, Rei- dar K. Lie, Franklin G. Miller, and David Wendler, Mark A Graber and Abraham Graber. 2013. Internet- editors, The Oxford Textbook of Clinical Research based crowdsourcing and research ethics: the Ethics, pages 613–632. Oxford University Press, case for IRB review. Journal of medical ethics, New York, NY. 39(2):115–118. Carnegie-Mellon University, Language Technologies Mary L. Gray and Siddharth Suri. 2019. Ghost Work: Institute. 2020. Human subjects research determi- How to Stop Silicon Valley from Building a New nation worksheet. https://web.archive. Global Underclass. Houghton Mifflin Harcourt, org/web/20201111132614/http://demo. Boston, MA. clab.cs.cmu.edu/11737fa20/slides/ multilingual-21-annotation.pdf. Xiaochuang Han and Yulia Tsvetkov. 2020. Fortify- Accessed: 2020-11-11. ing toxic speech detectors against veiled toxicity. In Proceedings of the 2020 Conference on Empirical Committee on Publication Ethics. 2020. https:// Methods in Natural Language Processing (EMNLP), publicationethics.org. Accessed: 2020- pages 7732–7739, Online. Association for Computa- 11-22. tional Linguistics.
Dirk Hovy and Shannon L. Spruit. 2016. The social Winter Mason and Siddharth Suri. 2011. Conducting impact of natural language processing. In Proceed- behavioral research on Amazon’s Mechanical Turk. ings of the 54th Annual Meeting of the Association Behavior Research Methods, 44(1):1–23. for Computational Linguistics (Volume 2: Short Pa- pers), pages 591–598, Berlin, Germany. Association Michelle N. Meyer. 2020. There oughta be a law: for Computational Linguistics. When does(n’t) the U.S. Common Rule apply? The Journal of Law, Medicine & Ethics, 48(1_suppl):60– Ursula Huws. 2015. Online labour exchanges 73. PMID: 32342740. or ’crowdsourcing’: Implications for occupational safety and health. European Agency for Safety and NAACL. 2020. Ethics FAQ. https://web. Health at Work (EU-OSHA). archive.org/web/20201129202126/ https://2021.naacl.org/ethics/faq/. Mark Israel. 2015. Research Ethics and Integrity for Accessed: 2020-11-29. Social Scientists: Beyond Regulatory Compliance, 2nd edition. SAGE Publications, Thousand Oaks, Nikita Nangia, Clara Vania, Rasika Bhalerao, and CA. Samuel R. Bowman. 2020. CrowS-pairs: A chal- lenge dataset for measuring social biases in masked Andrew Conway Ivy. 1948. The history and ethics of language models. In Proceedings of the 2020 Con- the use of human subjects in medical experiments. ference on Empirical Methods in Natural Language Science, 108(2792):1–5. Processing (EMNLP), pages 1953–1967, Online. As- sociation for Computational Linguistics. Molly Jackman and Lauri Kanerva. 2015. Evolving the IRB: Building robust review for industry research. National Commission for the Protection of Human Sub- Wash. & Lee L. Rev. Online, 72:445. jects of Biomedical and Behavioral Research. 1978. The Belmont report: ethical principles and guide- Albert R. Jonsen. 1998. The ethics of research with lines for the protection of human subjects of re- human subjects: A short history. In Albert R. Jon- search. The Commission, Bethesda, MD. sen, Robert M. Veatch, and LeRoy Walters, editors, Source Book in Bioethics: A Documentary History, Vlad Niculae and Cristian Danescu-Niculescu-Mizil. pages 3–10. Georgetown University Press, Washing- 2016. Conversational markers of constructive dis- ton, DC. cussions. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Albert R. Jonsen. 2005. On the origins and future of Computational Linguistics: Human Language Tech- the Belmont report. In James F. Childress, Eric M. nologies, pages 568–578, San Diego, California. As- Meslin, and Harold T. Shapiro, editors, Belmont Re- sociation for Computational Linguistics. visited: Ethical Principles for Research with Human Subjects, pages 3–11. Georgetown University Press, Haruna Ogawa, Hitoshi Nishikawa, Takenobu Toku- Washington, DC. naga, and Hikaru Yokono. 2020. Gamification Robert L. Klitzman. 2015. The Ethics Police?: The platform for collecting task-oriented dialogue data. Struggle to Make Human Research Safe. Oxford In Proceedings of the 12th Language Resources University Press, Oxford. and Evaluation Conference, pages 7084–7093, Mar- seille, France. European Language Resources Asso- A Kumaran, Melissa Densmore, and Shaishav Kumar. ciation. 2014. Online gaming for crowd-sourcing phrase- equivalents. In Proceedings of COLING 2014, the Emily Öhman, Kaisla Kajava, Jörg Tiedemann, and 25th International Conference on Computational Timo Honkela. 2018. Creating a dataset for Linguistics: Technical Papers, pages 1238–1247. multilingual fine-grained emotion-detection using gamification-based annotation. In Proceedings of Matthew Lease, Jessica Hullman, Jeffrey Bigham, the 9th Workshop on Computational Approaches to Michael Bernstein, Juho Kim, Walter Lasecki, Saei- Subjectivity, Sentiment and Social Media Analysis, deh Bakhshi, Tanushree Mitra, and Robert Miller. pages 24–30, Brussels, Belgium. Association for 2013. Mechanical Turk is not anonymous. Avail- Computational Linguistics. able at SSRN 2228728. Michael Owen. 2006. Ethical review of social and be- Susan E. Lederer. 1995. Subjected to Science: Hu- havioral science research. In Elliott C. Kulakowski man Experimentation in America Before the Second and Lynne U. Chronister, editors, Research Adminis- World War. Johns Hopkins University Press, Balti- tration and Management, pages 543–556. Jones and more, MD. Bartlett Publishers, Sudbury, MA. Jochen L. Leidner and Vassilis Plachouras. 2017. Eth- Carla Parra Escartín, Wessel Reijers, Teresa Lynn, Joss ical by design: Ethics best practices for natural lan- Moorkens, Andy Way, and Chao-Hong Liu. 2017. guage processing. In Proceedings of the First ACL Ethical considerations in NLP shared tasks. In Pro- Workshop on Ethics in Natural Language Process- ceedings of the First ACL Workshop on Ethics in Nat- ing, pages 30–40, Valencia, Spain. Association for ural Language Processing, pages 66–73, Valencia, Computational Linguistics. Spain. Association for Computational Linguistics.
Verónica Pérez-Rosas and Rada Mihalcea. 2015. Ex- UNESCO. 2006. Universal Declaration on Bioethics periments in open domain deception detection. In and Human Rights. https://unesdoc. Proceedings of the 2015 Conference on Empirical unesco.org/ark:/48223/pf0000146180. Methods in Natural Language Processing, pages Accessed: 2021-04-06. 1120–1125, Lisbon, Portugal. Association for Com- putational Linguistics. University of Michigan, Research Ethics and Compliance. 2021. Class assign- David B Resnik. 2018. The ethics of research with hu- ments & IRB approval. https://web. man subjects: Protecting people, advancing science, archive.org/web/20210325081932/ promoting trust, volume 74. Springer, Cham. https://research-compliance. umich.edu/human-subjects/ David J. Rothman. 1987. Ethics and human exper- human-research-protection-program-hrpp/ imentation. New England Journal of Medicine, hrpp-policies/ 317(19):1195–1199. class-assignments-irb-approval. Accessed: 2021-03-25. Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Ju- rafsky, Noah A. Smith, and Yejin Choi. 2020. So- University of Virginia, Human Research Pro- cial bias frames: Reasoning about social and power tection Program. 2020. Subject com- implications of language. In Proceedings of the pensation. https://web.archive. 58th Annual Meeting of the Association for Compu- org/web/20200611193644/https: tational Linguistics, pages 5477–5490, Online. As- //research.virginia.edu/irb-hsr/ sociation for Computational Linguistics. subject-compensation. Accessed: 2020-11- 19. Zachary M Schrag. 2010. Ethical imperialism: In- University of Washington, Office of Re- stitutional review boards and the social sciences, search. 2020. Human subjects research de- 1965–2009. The Johns Hopkins University Press, termination worksheet. https://web. Baltimore, MD. archive.org/web/20201111071801/ https://www.washington.edu/ Cansu Sen, Thomas Hartvigsen, Biao Yin, Xiangnan research/wp/wp-content/uploads/ Kong, and Elke Rundensteiner. 2020. Human at- WORKSHEET_Human_Subjects_Research_ tention maps for text classification: Do humans and Determination_v2.30_2020.11.02.pdf. neural networks focus on the same words? In Pro- Accessed: 2020-11-11. ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4596– University of Winthrop, The Office of Grants and Spon- 4608, Online. Association for Computational Lin- sored Research Development. 2021. Research v/s guistics. course assignment. https://web.archive. org/web/20210325082350/https://www. Adil E. Shamoo and David E. Resnik. 2009. Responsi- winthrop.edu/uploadedFiles/grants/ ble Conduct of Research, 2 edition. Oxford Univer- Research-vs-Course-Assignment.pdf. sity Press, New York, NY. Accessed: 2021-03-25. Ashish Sharma, Adam Miner, David Atkins, and Tim Jack Urbanek, Angela Fan, Siddharth Karamcheti, Althoff. 2020. A computational approach to un- Saachi Jain, Samuel Humeau, Emily Dinan, Tim derstanding empathy expressed in text-based men- Rocktäschel, Douwe Kiela, Arthur Szlam, and Ja- tal health support. In Proceedings of the 2020 Con- son Weston. 2019. Learning to speak and act in ference on Empirical Methods in Natural Language a fantasy text adventure game. In Proceedings Processing (EMNLP), pages 5263–5276, Online. As- of the 2019 Conference on Empirical Methods in sociation for Computational Linguistics. Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Pro- Miriah Steiger, Timir J Bharucha, Sukrit Venkatagiri, cessing (EMNLP-IJCNLP), pages 673–683, Hong Martin J Riedl, and Matthew Lease. 2021. The psy- Kong, China. Association for Computational Lin- chological well-being of content moderators. In Pro- guistics. ceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, Chao Yang, Shimei Pan, Jalal Mahmud, Huahai Yang, NY, USA. Association for Computing Machinery. and Padmini Srinivasan. 2015. Using personal traits for brand preference prediction. In Proceedings of Stephanie Strassel, David Graff, Nii Martey, and the 2015 Conference on Empirical Methods in Natu- Christopher Cieri. 2000. Quality control in large ral Language Processing, pages 86–96, Lisbon, Por- annotation projects involving multiple judges: The tugal. Association for Computational Linguistics. case of the TDT corpora. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece. European Language Resources Association (ELRA).
You can also read