The cost of coordination can exceed the benefit of collaboration in performing complex tasks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The cost of coordination can exceed the benefit of collaboration in performing complex tasks Vince J. Strauba , Milena Tsvetkovaa,b , and Taha Yasseria,c,d * a Oxford Internet Institute, University of Oxford, Oxford OX1 3JS, UK b Department of Methodology, London School of Economics and Political Science, London WC2A 2AE, UK c School of Sociology, University College Dublin, Dublin D04 V1W8, Ireland d Alan Turing Institute, 96 Euston Rd, London NW1 2DB, UK arXiv:2009.11038v2 [cs.SI] 2 Jan 2021 *To whom correspondence should be addressed: taha.yasseri@ucd.ie Abstract Collective decision-making is ubiquitous when observing the behavior of intelligent agents, including humans. However, there are inconsistencies in our theoretical un- derstanding of whether there is a collective advantage from interacting with group members of varying levels of competence in solving problems of varying complexity. Moreover, most existing experiments have relied on highly stylized tasks, reducing the generality of their results. The present study narrows the gap between ex- perimental control and realistic settings, reporting the results from an analysis of collective problem-solving in the context of a real-world citizen science task envi- ronment in which individuals with manipulated differences in task-relevant training collaborated on the Wildcam Gorongosa task, hosted by The Zooniverse. We find that dyads gradually improve in performance but do not experience a collective benefit compared to individuals in most situations; rather, the cost of team coor- dination to efficiency and speed is consistently larger than the leverage of having a partner, even if they are expertly trained. It is only in terms of accuracy in the most complex tasks that having an additional expert significantly improves performance upon that of non-experts. Our findings have important theoretical and applied im- plications for collective problem-solving: to improve efficiency, one could prioritize providing task-relevant training and relying on trained experts working alone over interaction and to improve accuracy, one could target the expertise of selectively trained individuals. Keywords: collective intelligence | competence | teamwork | decision-making | citizen science 1
Introduction Intelligent agents, be they natural or artificial, often interact with others to tackle complex problems. In collective systems, agents are able to combine their own knowledge about their environment (personal information) with information ac- quired through interaction (social information) [1]. As a result, “collective minds” are more the rule than the exception [2]. In the animal kingdom, notable examples include flocks of birds, schools of fish, and colonies of eusocial insects [3], such as honey bees choosing a suitable nesting site [4]. Similarly, in the artificial realm, even simple agents such as the Internet bots may be part of sophisticated exchanges [5]. Humans are no exception; decisions made by interacting individuals are omnipresent in socioeconomic life, occurring in households, classrooms and, increasingly, via on- line networks including crowdsourcing sites, creating new interaction dynamics [6]. Yet, in spite of the prevalence and diversity of examples, the phenomenon of col- lective human decision-making is fundamentally related to a single concern: given differences in personal information and tasks that vary in complexity, when and how should individuals act together in order to improve outcomes and when should they work alone? Substantively, past work tends to build on one of two main observations: (1) judgment feats achieved by large groups and crowds consisting of distributed in- dividuals through pooling of personal information, often via simple averaging— referred to as the “wisdom of the crowd” [7, 8]—and (2) achievements by smaller collaborating groups and pairs, traditionally studied within the broader topic of group decision-making [9] but more recently revived under the rubric of “collective intelligence” [10]. Whilst both of these two terms have been redefined over the years and are often employed interchangeably in recent research combining studies of small and large groups [11, 12], they can be considered separate in so far as collective intel- ligence is primarily concerned with the process outcomes of interaction [13], going beyond the mere statistical aggregation of personal information. In this respect, 2
established theories from social psychology, economics, and organizational behavior offer competing claims as to whether collaboration benefits or hinders performance [14]. The division of labor and worker specialization theories state that group work provides a comparative advantage [15], however, psychological mechanisms such as conformity bias [16], polarization [17], and social loafing [18] are said to reduce col- lective performance because individuals may act on irrelevant social information or fall prey to herding [19]. Thus, when generalized to real-world decisions, in which all these discordant factors may occur in concert, theoretical perspectives fail to adequately stipulate if, and under what conditions, “two heads are better than one” or “too many cooks spoil the broth”. Empirically, prior work also offers conflicting advice on the performance advan- tage of collective decision-making. In recent times, studies have reached divergent conclusions on the importance of (i) diversity, in terms of individual team member attributes [20, 21]; (ii) group size and structure [22, 23, 24]; (iii) incentive schemes [25]; (iv) the nature of the task, also referred to as the “task context” (used hence- forth) or “task environment” [26]; and, perhaps most contentiously, (v) the role of social influence, that is, whether interacting with others undermines [27, 28] or improves [29, 30] collective performance. With respect to pairs of individuals or “dyads”—the most elementary social unit [31] and focus of this paper—related experimental studies have discovered that additional biases may be at play when individuals interact. Dyad members can suffer from an egocentric preference for personal information and fail to reap collective benefits by not listening to their expert counterparts [32, 33]. Similarly, pairs may underperform due to people’s tendency to give equal weight to others’ opinions [34]. A result emerging from the cognitive science literature is that individuals experience a collective benefit only when dyad members have similar levels of expertise, as this ensures members ac- curately convey the confidence of their guess and the dyad can reliably choose the best strategy [35, 36]. Crucially, such studies typically provide participants with 3
immediate feedback, alongside treating collective benefit as a static event. However, this fails to take into account that actual human pairs often have to perform tasks under uncertainty, whilst simultaneously benefitting from continued individual and social learning [37]. Notably, in many studies of collective decision-making, the benefits of social influence have so far been primarily limited to numeric judgment tasks [38, 39], e.g. number line estimation. Yet, it remains unclear whether the findings easily translate to discrete choice estimates, also known as classification tasks—such as yes/no decisions, or selecting a number of options among multiple alternatives. In other words, there is a further need to consider how the effect of collective decision- making not only varies as a function of time [40] but also depends on the task context [26]. The present study is novel in that it narrows the gap between experimental control and realistic settings by examining collective decision-making in the context of an established citizen science task—something that has not hitherto been tried for large-scale image classification platforms. As such, it advances a “solution- oriented” approach [41] that not only aims to extend our understanding of the citizen science task under study, but directly seeks to address a common scenario faced by senior researchers or general managers: assessing whether or not to rely on a qualified expert or a pair of inexperienced team members for solving various tasks effectively; and if so, whether to provide additional selective training to one or more members. Based on the observation that critical agent characteristics, particularly the relevance of prior personal knowledge, are not randomly distributed, we directly operationalize differences in the extent of task-relevant training between interacting individuals—in contrast to studies which have relied on proxies for fixed differences in decision experience and ability [39]. 4
Study Design Focusing on dyadic interaction, this study tests the hypothesis that task complexity [42], as well as the distribution of agent characteristics, specifically, variation in the levels of task-relevant training, are central to the development of—and benefits provided by—collective intelligence. To this end, we rely on a novel two-phase experiment developed to examine collective problem-solving in the context of citizen science, a form of crowdsourcing whereby volunteering amateurs or nonprofessional scientists collaborate with experts on scientific research [43]. In the experiment, participants classified pictures taken by motion-detecting camera traps in Gorongosa National Park in Mozambique as part of a wildlife conservation effort (the trail cameras are designed to automatically take a photo when an animal moves in front of them). We instructed participants, physically present in the lab, to use the Wildcam Gorongosa site, hosted by The Zooniverse1 , the world’s largest online platform for citizen science [44]. To increase the generality of the findings, the exact number and sequence of images is not controlled at the individual level, as is the case in many prior studies that have employed abstract, stylized tasks. Rather, to reflect the real-world nature of the tasks used in the experiment, we ensured participants classified images in a manner consistent with the experience of volunteering citizens visiting the Wildcam Gorongosa site. The Wildcam Gorongosa task Prior work has demonstrated that object recognition is a complex task that re- quires context-dependent knowledge and various facets of our visual intelligence [45]; hence, citizen scientists primarily carry out tasks for which human-based anal- ysis often still exceeds that of machine intelligence [46]. In the present case, the Wildcam Gorongosa task is considered suitable for the purposes of analysis for the following two reasons. First, it has high ecological validity: as part of an established 1 https://www.zooniverse.org/projects/zooniverse/wildcam-gorongosa 5
crowdsourcing platform, a type of “human-machine network” characteristic of our hyper-connected era [47], the site has engaged over 40,000 volunteers to date [48]. Given the abundance of situations in everyday life where immediate outcomes are difficult, sometimes even impossible, to establish, the task can also be generealized to other contexts in that no feedback was provided to participants (i.e., they clas- sified images without knowing the outcome of their decisions). Second, identifying various features in camera trap wildlife images is sufficiently difficult [49] that it offers the possibility of collaborative benefits, especially when dyad members have received similar training for identifying particular features. For each image, participants in the experiment were tasked with (1) detecting the presence of the animal(s), (2) identifying the species type, (3) counting how many animals there are, (4) identifying the behaviors exhibited, specifically, identifying whether the animal(s) is (a) standing, (b) resting, (c) moving, (d) eating, or (e) interacting (multiple behaviors may be selected), and (5) recognizing whether any young are present. The 52 possible species options include a ‘Nothing here’ button but no ‘I don’t know’ option (see Fig. 1A for example images and Supplementary Material, Figs. S1–S4 for further examples and screenshots of the online platform and instructions). Although we still lack a formal language of task complexity, the extent of general versus specific knowledge required to perform a task as defined by [42] can be considered a useful indicator of the complexity of the task, e.g., detecting an animal in the image (‘low complexity’) versus identifying the species (‘high complexity’). Experimental Procedure The experiment was divided into a training and a testing stage (T1 and T2 ). For T1 , participants were randomly assigned to two different training conditions, ‘General’ and ‘Targeted’, and asked to individually classify a sequence of approximately 50 images, where the content of the images was varied between conditions. T1 was 6
Figure 1: Experimental design. (A) Examples of the experimental task and illustration of the difference between the set of images seen in the general versus the targeted training condition. Instances of images classified in the testing stage are also provided. (B) Example of the interface participants used to classify images of wildlife taken by trail cameras. (C) Sequential schema of events in the Experiment. In the training stage T1 every participant classified images individually. In the testing stage T2 , participants were additionally assigned to either an individual or dyad condition. Wildcam Gorongosa imagery is reprinted under a CC BY 4.0 license with permission. designed to provide selective training to participants in the Targeted treatment condition. The set of images seen by these participants (Targeted set) thus consisted of pictures sharing specific features, sampled from a predetermined subset of all the images available to classify on Wildcam Gorongosa. Specifically, the Targeted set was restricted to pictures containing antelope species: bushbuck, duiker, impala, 7
kudu, nyala, oribi, reedbuck, and waterbuck. These animals were chosen as they look visually similar, share a number of morphological features, and exhibit similar behaviors, thus making them relatively harder to distinguish [49]. For the General condition, pictures shared less specific features and were instead sampled from a much broader predetermined subset containing a more diverse set of animals other than the ones mentioned above which are easier to distinguish, including baboons, warthogs, and lions, among many others (see Supplementary Material, Sec. S2.3.1 for details). The analysis presented below confirms the effectiveness of this treatment in producing significant differences between different groups. For T2 , participants were further randomly assigned to either a ‘Solo’ or ‘Dyad’ condition, with participants in the dyad condition having to jointly classify images; one of the dyad members was randomly assigned to input the decisions on behalf of the pair. When taking into account the level of training, this resulted in 2 individual testing conditions, ‘General Solo’ and ‘Targeted Solo’, and 3 dyad testing conditions, ‘General Dyad’, ‘Targeted Dyad’, and ‘Mixed Dyad’. T2 was designed to assess the effect of differences in task-relevant training as well as the effect of working alone versus collectively. The set of images seen by all participants, regardless of testing condition, thus consisted of pictures sampled from the Targeted set (see Supplementary Material, Sec. S2.3.2 for details). Fig. 1C illustrates the overall experimental design. To measure performance, three metrics were computed for every individual or dyad i: pace, defined as the number of images classified by i per minute (referred to as volume when considering the absolute number); accuracy, defined as the number of correct guesses made by i as a proportion of volume; and efficiency, defined as the number of correct guesses made by i per minute. These metrics are chosen as they have been widely used as measures of performance in decision science [50]. Except for pace, which is an aggregate measure—and thus the same across tasks—accuracy and efficiency were computed separately for the five tasks (see Supplementary Material, 8
Sec. S3.2 for details). Results Focusing first on the effect of training on performance across the different tasks of varying complexity (e.g., detecting an animal versus guessing the correct species), we conduct a sliding window analysis to assess the difference in performance changes over time (see Materials and Methods for details). Fig. 2 depicts the results, show- ing average change in efficiency across each of the five Wildcam Gorongosa tasks during the training (T1 ) and testing stage (T2 ) for participants in the General Solo (Fig. 2A) and Targeted Solo (Fig. 2B) groups. Relative differences in learning rates and performance changes indicate that individual performance is mediated both by level of training received and the complexity of the task [51, 52, 26]. Individuals in the General Solo group continuously improve across each task dur- ing T1 , becoming most efficient during the last third of training, as the benefits of general training gradually subside, before experiencing a sharp decline during the first phase of testing (Fig. 2A). Although they are able to recover by the end of T2 with respect to animal detection, count, and behavior—tasks of lower complexity— the consistency of the decline across tasks indicates that individuals with general training are initially less efficient when confronted with new task stimuli. In con- trast, individuals in the Targeted Solo group continuously improve upon their T1 performance throughout the course of T2 , experiencing greater relative gains in ef- ficiency compared to General Solo for each time window (Fig. 2B), suggesting that targeted training provides consistent accumulative performance benefits. The greater relative performance gains achieved by Targeted Solo compared to General Solo in T2 across all tasks clearly demonstrates the effectiveness of providing individuals with task-relevant training, whilst also validating the approach taken herein to analyze performance across tasks without changing the fundamental task 9
Figure 2: Individual training effectiveness. Performance changes in terms of average change in efficiency across each of the five Wildcam Gorongosa tasks during the training (T1 ) and testing stage (T2 ) for General Solo (A), individuals who re- ceived general training during T1 , and Targeted Solo (B), individuals who received selective training. Error bars indicate one standard error of the mean. context. In line with prior work assessing classification complexity using similar tasks [49], this trend is most evident in the case of the presence of young and species identification task, that is, tasks of high complexity (see also Supplementary Material, Fig. S6). Importantly, variation in efficiency across the different tasks opens up the question of whether performance differences are primarily the result of improvements in overall pace or task-specific accuracy, whilst also making it valid to consider how the nature of the task not only moderates the effectiveness of training, but also influences the effect of collaborating with others. Individual versus collective decision-making outcomes Turning to the main findings, a sliding window analysis was conducted to assess how performance developed over time for participants working solo and in dyads, taking into account differences in training level. To identify the primary source of potential differences, performance is now measured in terms of overall pace (irrespective of task), as well as accuracy and efficiency across each task. Fig. 3A-B depicts the results for pace, indicating that all participants improved over the course of T2 , in terms of classifying images faster. More strikingly, however, the figure shows that dyads suffer a significant decline in their pace during the start of 10
the testing phase—across training conditions and regardless of dyadic composition— whereas among the two solo groups only General Solo individuals, who are now assigned to a more difficult task in the test phase compared to their training phase, show a decline in pace; they again recover towards the end of T2 . Statistical tests of the differences in performance during the last third of the testing stage provide further evidence that participants in the solo conditions significantly outperform participants in the respective dyad conditions (p < 0.05 for pairwise comparisons; see Supplementary Material, Sec. S3.3). Moving on to differences in task-specific performance, Fig. 3C-F shows changes in accuracy and efficiency for the species identification, considered the most com- plex task. Individuals and dyads with general training unsurprisingly experienced a 12% and 13% reduction in accuracy during the first interval of T2 , respectively, be- fore gradually improving over time (Fig. 3C); however, General Dyads experienced no interaction benefit. Yet, pairing an individual with general training and an in- dividual with targeted training stems this decline: Mixed Dyads outperform both General Solo (p=0.036) and General Dyads (p=0.018)—indicating that participants with general training can improve upon their expected individual performance by collaborating with a selectively trained partner who can supply task-specific exper- tise. Targeted Solo individuals meanwhile experienced a significantly greater average improvement in accuracy during the last interval of T2 (40%; Fig. 3D) compared to Targeted Dyads (24%; p=0.003), but not compared to Mixed Dyads (29%; p=0.141); as a result, they appear to consistently achieve the greatest efficiency (Fig. 3F). In contrast, whilst General Solo achieved greater efficiency compared to General Dyads as a result of increased speed (Fig. 3A; p=0.018), the slight difference between General Solo and Mixed Dyads (Fig. 3E; p=0.557) implies that individuals without selective training experience a speed advantage when working alone but derive an accuracy benefit when collaborating with a selectively trained teammate (see also 11
Figure 3: Individual and collective outcomes. Performance differences between individuals and dyads over the course of T2 in terms of average change in pace (A-B), alongside average change in accuracy (C-D) and efficiency (E-F) for the species identification task, considered the most complex. General Solo individuals are compared to General Dyads and Mixed Dyads (left panels) and Targeted Solo individuals are compared to Targeted Dyads and Mixed Dyads (right panels). The performance changes of Mixed Dyads are computed by using the initial T1 perfor- mance of the dyad member with the same training as the individual in the respective solo condition as the ‘natural benchmark’ (see Materials and Methods for details). Error bars indicate one standard error of the mean. Supplementary Material, Fig. S8 which shows that using the smallest window length does not significantly change any of the reported findings). Surprisingly, across all other tasks, there was no clear indication of significant or consistent differences between individual and dyads in terms of accuracy, as shown in Fig. 4. When comparing changes in accuracy between General Solo and the re- 12
spective dyad conditions for the presence of young task, considered to be of medium complexity (Fig. 4G), a similar pattern can nevertheless be observed to changes in accuracy across the species identification task; notably, individuals and dyads both fail to experience any accuracy improvements. Moreover, when comparing Targeted Solo and the respective dyads conditions (Fig. 4H), Targeted Dyads appear to ex- perience the greatest improvements in accuracy during the first two intervals of T2 . However, the slight variations suggest that tasks of low complexity may be too simplistic for practically relevant performance differences to emerge. Taken together, these results provide mixed support for previous findings, demon- strating the contingent benefits of collective decision-making. The stepwise perfor- mance improvements experienced by dyads in terms of pace and accuracy for the species identification task are consistent with prior studies showing that collective performance increases gradually for complex tasks in the absence of feedback [53, 52]. Although the observation that dyads do not experience an overall collective benefit across tasks goes against prior results that indicate an interaction benefit in the context of low-level visual enumeration and numeric estimation problems [54, 55], this finding is nevertheless in line with literature that has found no pair advantage when the task context consists of general knowledge-based tests involving discrete choice decisions [56, 57]. This underscores the extent to which performance greatly depends on the task context. Notably, whilst some previous studies have suggested that individuals outperform dyads due to faster skill acquisition [58], the greater accuracy gains achieved by Mixed Dyads when compared to General Dyads and General Solo for the species identification task (Fig. 4C) shows that, at least for complex tasks, this may further depend on the level of task-relevant training among dyad members. More generally, the above results show that task-relevant training, particu- larly targeted training, provides significant performance improvements, regardless of working individually or in a pair. Importantly, the lack of substantial changes in 13
Figure 4: Individual and dyadic outcomes across tasks of medium and low complexity. Performance differences between individuals and dyads in terms of average change in accuracy for the remaining tasks across General conditions (left panels) and Targeted conditions (right panels). Error bars indicate one standard error of the mean. accuracy experienced by General Solo and Targeted Solo across the majority of tasks depicted in Fig. 4 suggests that observed differences in efficiency between the indi- vidual training groups (as shown in Fig. 2) are principally a reflection of differences 14
in speed. In other words, for simpler tasks, individuals consistently classify images correctly, regardless of training, yet selective training still improves productivity. This is in line with computational simulations of individual and social learning in the context of visual search tasks, which have shown that, in non-complex environ- ments, all strategies find the optimum; differences only occur for the time needed to converge to this optimum [59]. Discussion Some prior research has shown that interacting dyads experience an all round collec- tive benefit [60, 61], whilst other work has shown that pairs do not experience any accuracy gains from collaboration [56]. In the face of these contradictory results, this paper stresses that existing experiments and theories of collective decision-making, such as the “confidence matching” model stating that only pairs similar in skill or social sensitivity experience a collective benefit [35], do not adequately stipulate if, and under what conditions, dyads will outperform individuals. Therefore, the imme- diate contribution of this study is to demonstrate that in the context of a real-world task context, namely, Zooniverse’s Wildcam Gorongosa citizen science project, the heads of two experts, understood here as individuals with targeted training, are not always better than one. Rather, one expert is more efficient than two experts or a mixed ability dyad, indicating that the cost of team coordination to efficiency is consistently larger than the leverage of having a partner—even if that partner is also specially trained for the task at hand. Moreover, we also show that a non-expert, an individual with basic training, works faster than a dyad, even if one of the dyad members is an expert, thereby suggesting that individuals exert less effort when working in a pair. However, when it comes to accuracy in the most complex task, having one expert in a dyad significantly improves performance compared to that of individual non-experts or dyads of non-experts. 15
Returning to the situation posited at the outset of the general manager deciding whether or not to rely on a qualified expert or a pair of untrained team members for solving various tasks effectively, the results of the present study would suggest that, at least for complex classification tasks, overall performance can be maximized by relying on a pool of expertly trained individuals working alone. Similarly, if the speed at which the problem needs to be completed is key, it is also best to rely on individuals working alone. Yet, if accuracy carries more practical importance than productivity, such as when the problem constitutes a single complex task, and it is not possible to provide specialized training, then recruiting experts and building mixed ability teams may be most effective. In other words, providing specialized individual training to all or at least some workers and relying on group work only when accuracy matters may be the most effective strategy, whilst relying on a dyad of trained experts will likely represent a waste of resources, as it does not provide any additional performance gains. Taken together, these results both complement prior findings [56, 53] and provide new insights, highlighting that the extent of training received by an individual, the complexity of the task at hand, and the desired performance outcome are all critical factors that need to be accounted for when weighing up the benefits of collective decision-making. Despite the fact that collective decision-making has been studied extensively [62, 63], prior studies have relied mostly on artificial or comparatively simple tasks (e.g. number line or dot estimation), which do not reflect the nature of human interaction in daily life nor account for the uncertainty and limited task-relevant knowledge that individuals often posses (see Supplementary Material, Fig. S9A, which demonstrates that, in the present case, 86% of participants had close to none, basic, or average zoology knowledge and 0% had any expertise). The Wildcam Gorongosa task studied here, however, is not only an established citizen science challenge engaging thousands of participants but a task that shares significant fea- tures with other forms of online collaboration and information processing activities 16
more broadly, given it requires both perceptual ability and general knowledge. Thus, it can be expected that the present results can be generalized to similar environments where collaboration may be substituted or complemented with specialized training to improve outcomes. Moreover, the finding that individuals and dyads with similar training do not perform significantly differently suggests that the pair advantage observed in highly controlled experimental studies employing one-dimensional tasks may be less observable in many multi-dimensional contexts. Yet, given some re- maining shortcomings, this study also provides building blocks for future research that can help resolve some of the gaps in our understanding of the relationship between task complexity and the performance benefits experienced by interacting individuals. An obvious limitation of this study is the fact that only one type of task context was studied. Nevertheless, by focusing on a class of real-world problems and advo- cating for a “solution-oriented” approach [41], we hope that our approach will inspire further research in computational social science on decision-making using externally valid domain-general tasks. Moreover, we note that the use of a preexisting citizen science task resulted in a significant proportion of participants reporting an interest in contributing further to citizen science, independent from working alone or in a dyad (see Supplementary Material, Fig. S10 which shows that 68% of participants reported it as either somewhat or extremely likely that they would contribute to citizen science in the long term), thereby additionally demonstrating the potential of using real-world tasks to engage participants as well as advance our understand- ing of collective performance on complex problems. Going forward, we welcome recent calls for the field to take greater inspiration from animal research [64] and for researchers to consider how technologies and machines affect human interaction in online collaborative tasks [65] as additional avenues of research to explore and further advance our collective knowledge of collective intelligence. 17
Materials and Methods This study was reviewed and approved by the Oxford Internet Institute’s Depart- mental Research Ethics Committee (DREC) on behalf of the Social Sciences and Humanities Inter-Divisional Research Ethics Committee (IDREC) in accordance with the procedures laid down by the University of Oxford for ethical approval of all research involving human participants. Participants (n = 195) were recruited from the general public via the Nuffield Centre for Experimental Social Science (see Supplementary Material, Sec. S2). Zooniverse Answers As all images seen by participants have already been evaluated on the Zooniverse website, Zooniverse estimation data was used to assess performance. These data consist of the aggregated guesses of citizen scientists who used the platform prior to when the experiment was conducted. Whilst volunteer estimates have consistently been found to agree with expert-verified wildlife images [66], it is noted that the ground truth data could still be missing correct attributes that only experts could identify. However, it is by definition impossible for any participant to have performed better than the citizen scientist-provided estimates; hence, we refer to these data as the ‘ground truth’ throughout (see Supplementary Material, Sec. S3.1 for details on the content of the ground truth). Sliding Window Analysis To conduct a sliding window analysis, T1 and T2 were separately split into three equally spaced and non overlapping intervals (of length 10 and 15 minutes, respec- tively). The average change in performance was then estimated separately for each window by computing the difference in performance compared to the performance of the respective individual(s) in the first window of T1 , which can be thought of 18
as the natural baseline prior to training. In comparing individuals to dyads, two benchmarks are used: (a) a “non-interacting nominal dyad”, defined as the average of the initial performance of both dyad members working individually (the sum is used for volume and efficiency, as these are additive processes), and, in the case of Mixed Dyads, (b) a “comparably trained individual”, defined as the average initial performance of the dyad member with the same level of training as the individual in the respective solo condition (multiplied by two when considering volume and effi- ciency). Thus, by benchmarking change in performance against initial performance we can more precisely estimate whether individual training, interaction, or both, provide performance gains, and how this changes across tasks and over time. Statistical Tests All statistical tests were two-tailed and non-parametric alternatives were planned if the data strongly violated normality assumptions. For omnibus tests, the signif- icance of mean differences between groups were analyzed via post hoc tests. Effect size values were interpreted in simple and standardized terms, according to [67], when no previously reported values are available for comparison [68]. Details of the statistical tests are in Supplementary Material, Sec. S3.3. Acknowledgments We wish to thank Shaun A. Noordin and Chris Lintott for assistance and support with the Wildcam Gorongosa site as well as all the contributors to the Wildcam Gorongosa project. We are grateful to Paola Bouley and the Gorongosa Lion Project, and the Gorongosa National Park for sharing the images and annotations with us. We thank Kaitlyn Gaynor for useful comments on the manuscript and Bahador Bahrami for insightful discussions. 19
Funding Funding: This project was partially supported by the Howard Hughes Medical Institute. T.Y. was also supported by the Alan Turing Institute under the EPSRC grant no. EP/N510129/1. Authors Contribution V.J.S. analyzed the data and drafted the manuscript. M.T. designed and performed the experiments. T.Y. designed and performed the experiments, designed the anal- ysis, secured the funding and supervised the project. All authors contributed to writing the manuscript and gave final approval for publication. Competing interests The authors declare no competing interests. Data Availability The Gorongosa Lab educational platform used to run the ‘Wildcam Gorongosa task’ is hosted by The Zooniverse (www.zooniverse.com). Replication data and code are available at the Open Science Framework: https://osf.io/6rcgx/. References [1] Étienne Danchin, Luc Alain Giraldeau, Thomas J. Valone, and Richard H. Wag- ner. Public information: From nosy neighbors to cultural evolution. Science, 305(5683):487–491, 2004. [2] Iain Couzin. Collective minds. Nature, 445(7129):715, 2007. [3] Matthew M.G. Sosna, Colin R. Twomey, Joseph Bak-Coleman, Winnie Poel, Bryan C. Daniels, Pawel Romanczuk, and Iain D. Couzin. Individual and col- lective encoding of risk in animal groups. Proceedings of the National Academy of Sciences of the United States of America, 116(41):20556–20561, 2019. 20
[4] Thomas D. Seeley and P. Kirk Visscher. Group decision making in nest-site selection by honey bees. Apidologie, 35:101–116, 2004. [5] Milena Tsvetkova, Ruth Garcı́a-Gavilanes, Luciano Floridi, and Taha Yasseri. Even good bots fight: The case of Wikipedia. PLoS ONE, 12(2):1–14, 2017. [6] Milena Tsvetkova, Ruth Garciá-Gavilanes, and Taha Yasseri. Dynamics of Dis- agreement: Large-Scale Temporal Network Analysis Reveals Negative Interac- tions in Online Collaboration. Scientific Reports, 6(November):1–10, 2016. [7] Francis Galton. Vox populi. Nature, 75:450–451, 1907. [8] James Surowiecki. The Wisdom of Crowds: Why the Many Are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations. Doubleday Books ), New York, NY, 2004. [9] Norbert L. Kerr and R. Scott Tindale. Group Performance and Decision Mak- ing. Annual Review of Psychology, 55(1):623–655, 2 2004. [10] Thomas W. Malone, Robert Laubacher, and Chrysanthos Dellarocas. The col- lective intelligence genome. IEEE Engineering Management Review, 38(3):38, 2010. [11] Louis Rosenberg, David Baltaxe, and Niccolo Pescetelli. Crowds vs swarms, a comparison of intelligence. In 2016 Swarm/Human Blended Intelligence, SHBI 2016, 2016. [12] Abdullah Almaatouq, Alejandro Noriega-Campero, Abdulrahman Alotaibi, P. M. Krafft, Mehdi Moussaid, and Alex Pentland. Adaptive social networks promote the wisdom of crowds. Proceedings of the National Academy of Sci- ences of the United States of America, 117(21), 2020. [13] Antonietta Grasso and Gregorio Convertino. Collective intelligence in organi- zations: Tools and studies. Computer Supported Cooperative Work, 21(4-5): 357–369, 10 2012. [14] Eric van Dijk and Carsten KW De Dreu. Experimental games and social deci- sion making. Annual Review of Psychology, 72:2021, 2020. [15] G. S. Becker and K. M. Murphy. The Division of Labor, Coordination Costs, and Knowledge. The Quarterly Journal of Economics, 107(4):1137–1160, 11 1992. [16] I. J. Janis. Victims of groupthink: A psychological study of foreign-policy deci- sions and fiascoes. Houghton Mifflin, 1972. [17] Helmut Lamm and David G. Myers. Group-induced polarization of attitudes and behavior. Advances in Experimental Social Psychology, 11(C):145–195, 1978. 21
[18] Steven J. Karau and Kipling D. Williams. Social Loafing: A Meta-Analytic Re- view and Theoretical Integration. Journal of Personality and Social Psychology, 65(4):681–706, 1993. [19] Ramsey M. Raafat, Nick Chater, and Chris Frith. Herding in humans. Trends in Cognitive Sciences, 13(10):420–428, 2009. [20] Lu Hong and Scott E. Page. Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proceedings of the National Academy of Sciences of the United States of America, 101(46):16385–16389, 2004. [21] Stephanie De Oliveira and Richard E. Nisbett. Demographically diverse crowds are typically not much wiser than homogeneous crowds. Proceedings of the National Academy of Sciences of the United States of America, 115(9):2066– 2071, 2018. [22] Mirta Galesic, Daniel Barkoczi, and Konstantinos Katsikopoulos. Smaller crowds outperform larger crowds and individuals in realistic task conditions. Decision, 5(1):1–15, 2018. [23] Joaquin Navajas, Tamara Niella, Gerry Garbulsky, Bahador Bahrami, and Mar- iano Sigman. Aggregated knowledge from a small number of debates outper- forms the wisdom of large crowds. Nature Human Behaviour, 2(2):126–132, 2018. [24] Joshua Becker, Abdullah Almaatouq, and Agnes Horvat. Network structures of collective intelligence: The contingent benefits of group discussion. arXiv preprint arXiv:2009.07202, 2020. [25] Richard P. Mann and Dirk Helbing. Optimal incentives for collective intelli- gence. Proceedings of the National Academy of Sciences of the United States of America, 114(20):5077–5082, 2017. [26] Rui Mata, Thorsten Pachur, Bettina von Helversen, Ralph Hertwig, Jörg Rieskamp, and Lael Schooler. Ecological Rationality: A Framework for Un- derstanding and Aiding the Aging Decision Maker. Frontiers in Neuroscience, 6(FEB):19, 2 2012. [27] Jan Lorenz, Heiko Rauhut, Frank Schweitzer, and Dirk Helbing. How social influence can undermine the wisdom of crowd effect. Proceedings of the National Academy of Sciences of the United States of America, 108(22):9020–9025, 2011. [28] Lev Muchnik, Sinan Aral, and Sean J. Taylor. Social influence bias: A ran- domized experiment. Science, 341(6146):647–651, 8 2013. ISSN 10959203. doi: 10.1126/science.1240466. [29] Joshua Becker, Ethan Porter, and Damon Centola. The wisdom of partisan crowds. Proceedings of the National Academy of Sciences of the United States of America, 166(22):10717–10722, 2019. 22
[30] Simon Farrell. Social influence benefits the wisdom of individuals in the crowd. Proceedings of the National Academy of Sciences of the United States of Amer- ica, 108(36), 9 2011. [31] Steve W. J. Kozlowski and Bradford Bell. Work groups and teams in organi- zations: Review update. Handbook of Psychology, 2013. [32] Ilan Yaniv and Maxim Milyavsky. Using advice from multiple sources to re- vise and improve judgments. Organizational Behavior and Human Decision Processes, 103(1):104–120, 2007. [33] Alan Novaes Tump, Max Wolf, Jens Krause, and Ralf H.J.M. Kurvers. Indi- viduals fail to reap the collective benefits of diversity because of over-reliance on personal information. Journal of the Royal Society Interface, 01(55), 2018. [34] Ali Mahmoodi, Dan Bang, Karsten Olsen, Yuanyuan Aimee Zhao, Zhenhao Shi, Kristina Broberg, Shervin Safavi, Shihui Han, Majid Nili Ahmadabadi, Chris D Frith, Andreas Roepstorff, Geraint Rees, and Bahador Bahrami. Equality bias impairs collective decision-making across cultures. Proceedings of the National Academy of Sciences of the United States of America, 112(12):3835–3840, 2015. [35] Dan Bang, Laurence Aitchison, Rani Moran, Santiago Herce Castanon, Ba- nafsheh Rafiee, Ali Mahmoodi, Jennifer Y.F. Lau, Peter E Latham, Ba- hador Bahrami, and Christopher Summerfield. Confidence matching in group decision-making. Nature Human Behaviour, 1(6):1–8, 2017. [36] Dan Bang, Riccardo Fusaroli, Kristian Tylén, Karsten Olsen, Peter E. Latham, Jennifer Y.F. Lau, Andreas Roepstorff, Geraint Rees, Chris D. Frith, and Ba- hador Bahrami. Does interaction matter? Testing whether a confidence heuris- tic can replace interaction in collective decision-making. Consciousness and Cognition, 26(1):13–23, 2014. [37] Wataru Toyokawa, Andrew Whalen, and Kevin N Laland. Social learning strategies regulate the widsom and madness of interactive crowds. Nature Hu- man Behaviour, 3:183–193, 2019. [38] Bertrand Jayles, Hye rin Kim, Ramón Escobedo, Stéphane Cezera, Adrien Blanchet, Tatsuya Kameda, Clément Sire, and Guy Theraulaz. How social information can improve estimation accuracy in human groups. Proceedings of the National Academy of Sciences of the United States of America, 114(47): 12620–12625, 2017. [39] Francesco Sella, Robert Blakey, Dan Bang, Bahador Bahrami, and Roi Cohen Kadosh. Who gains more: Experts or novices? The benefits of interaction under numerical uncertainty. Journal of Experimental Psychology: Human Perception and Performance, 44(8):1228–1239, 2018. [40] Ethan Bernstein, Jesse Shore, and David Lazer. How intermittent breaks in interaction improve collective intelligence. Proceedings of the National Academy of Sciences of the United States of America, 115(35):8734–8739, 2018. 23
[41] Duncan J Watts. Should social science be more solution-oriented? Nature Human Behaviour, 1(0015):1–5, 2017. [42] Robert E. Wood. Task Complexity: Definition. Organizational Behavior and Human Decision Processes, 37:60–82, 1986. [43] Chiara Franzoni and Henry Sauermann. Crowd science: The organization of scientific research in open collaborative projects. Research Policy, 43(1):1–20, 2014. [44] Joe Cox, Eun Young Oh, Brooke Simmons, Chris Lintott, Karen Masters, Anita Greenhill, Gary Graham, and Kate Holmes. Defining and Measuring Success in Online Citizen Science: A Case Study of Zooniverse Projects. Computing in Science and Engineering, 17(4):28–41, 2015. [45] James J. DiCarlo, Davide Zoccolan, and Nicole C. Rust. How does the brain solve visual object recognition? Neuron, 73(3):415–434, 2012. [46] Laura Trouille, Chris J. Lintott, and Lucy F. Fortson. Citizen science frontiers: Efficiency, engagement, and serendipitous discovery with human-machine sys- tems. Proceedings of the National Academy of Sciences of the United States of America, 116(6):1902–1909, 2019. [47] Milena Tsvetkova, Taha Yasseri, Eric T. Meyer, J. Brian Pickering, Vegard Engen, Paul Walland, Marika Lüders, Asbjørn Følstad, and George Bravos. Understanding human-machine networks: A cross-disciplinary survey. ACM Computing Surveys, 50(1), 2017. [48] Wildcam Gorongosa. Wildcam Gorongosa, 2020. URL https://www. zooniverse.org/projects/zooniverse/wildcam-gorongosa/about/faq. [49] Mohammad Sadegh Norouzzadeh, Anh Nguyen, Margaret Kosmala, Alexandra Swanson, Meredith S. Palmer, Craig Packer, and Jeff Clune. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proceedings of the National Academy of Sciences of the United States of America, 115(25), 2018. [50] L. T. Maloney. Statistical decison theory and biological vision. In D. Heyer and R. Mausfeld, editors, Perception and physical world, page 145–189. Wiley, New York, NY, 2002. [51] Tiffany Saffell and Nestor Matthews. Task-specific perceptual learning on speed and direction discrimination. Vision Research, 43(12):1365–1374, 6 2003. [52] Andrew Mao, Winter Mason, Siddharth Suri, and Duncan J Watts. An exper- imental study of team size and performance on a complex task. PLoS ONE, 11 (4), 2016. 24
[53] Bahador Bahrami, Karsten Olsen, Dan Bang, Andreas Roepstorff, Geraint Rees, and Chris Frith. Together, slowly but surely: The role of social in- teraction and feedback on the build-up of benefit in collective decision-making. Journal of Experimental Psychology: Human Perception and Performance, 38 (1):3–8, 2012. [54] Asher Koriat. When two heads are better than one and when they can be worse: The amplification hypothesis. Journal of Experimental Psychology: General, 144(5):934–950, 2015. [55] Basil Wahn, Artur Czeszumski, and Peter König. Performance similarities predict collective benefits in dyadic and triadic joint visual search. PLOS ONE, 13(1):e0191179, 1 2018. [56] Jonathon P. Schuldt, Christopher F. Chabris, Anita Williams Woolley, and J. Richard Hackman. Confidence in Dyadic Decision Making: The Role of Individual Differences. Journal of Behavioral Decision Making, 30(2):168–180, 2017. [57] Matthew D. Blanchard, Simon A. Jackson, and Sabina Kleitman. Collective decision making reduces metacognitive control and increases error rates, par- ticularly for overconfident individuals. Journal of Behavioral Decision Making, 1:1–28, 2020. [58] Amy E. Crook and Margaret E. Beier. When Training With a Partner Is Inferior to Training Alone: The Importance of Dyad Type and Interaction Quality. Journal of Experimental Psychology: Applied, 16(4):335–348, 2010. [59] Daniel Barkoczi and Mirta Galesic. Social learning strategies modify the ef- fect of network structure on group performance. Nature Communications, 7 (131109):1–8, 2016. [60] Wayne L Shebilskem, J Wesley Regian, Winfred Arthur Jr, and Jeffrey A Jor- dan. A dyadic protocol for training complex skills. Human Factors, 34(3): 369–374, 1992. [61] Asher Koriat. When Are Two Heads Better than One and Why? Science, 336: 20–22, 2012. [62] Chip Heath and Rich Gonzalez. Interaction with others increases decision con- fidence but not decision quality: Evidence against information collection views of interactive decision making. Organizational Behavior and Human Decision Processes, 61(3):305–326, 1995. [63] Thomas Bose, Andreagiovanni Reina, and James AR Marshall. Collective decision-making. Current Opinion in Behavioral Sciences, 16:30–34, 8 2017. [64] Lisa O’Bryan, Margaret Beier, and Eduardo Salas. How approaches to animal swarm intelligence can improve the study of collective intelligence in human teams, 2020. 25
[65] Iyad Rahwan, Jacob W Crandall, and Jean François Bonnefon. Intelligent machines as social catalysts. Proceedings of the National Academy of Sciences of the United States of America, 117(14):7555–7557, 2020. [66] Alexandra Swanson, Margaret Kosmala, Chris Lintott, and Craig Packer. A generalized approach for producing, quantifying, and validating citizen science data from wildlife images. Conservation Biology, 30(3):520–531, 2016. [67] J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Hillsdale, NJ, 2nd edition, 1988. [68] Daniël Lakens. Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4(NOV), 2013. [69] Qualtrics Development Company. Qualtrics, 2020. URL https://www. qualtrics.com. [70] David Watson and Luciano Floridi. Crowdsourced science : sociotechnical epistemology in the e-research paradigm. Synthese, 195(2):741–764, 2018. [71] Holly Rosser and Andrea Wiggins. Tutorial designs and task types in zooniverse. Proceedings of the ACM Conference on Computer Supported Cooperative Work, CSCW, pages 177–180, 2018. [72] R. H. Bolen, J. Robaline, and B Conneely. WildCam Lab: Inquiry and Data Analysis Using Trail Camera Data from Gorongosa National Park, Mozam- bique. Proceedings of the Association for Biology Laboratory Education, 39(4), 2018. [73] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In ACM International Conference Pro- ceeding Series, pages 1097–1105, 2012. [74] Olga Russakovsky, Li Jia Li, and Li Fei-Fei. Best of both worlds: Human- machine collaboration for object annotation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 07-12-June: 2121–2131, 2015. ISSN 10636919. doi: 10.1109/CVPR.2015.7298824. [75] Bertie Vidgen and Taha Yasseri. P-Values: Misunderstood and Misused. Fron- tiers in Physics, 4(March):10–14, 2016. ISSN 2296-424X. doi: 10.3389/fphy. 2016.00006. 26
Supplementary Information for: The cost of coordination can exceed the benefit of collaboration in performing complex tasks Vince J. Strauba , Milena Tsvetkovaa,b , and Taha Yasseria,c,d * a Oxford Internet Institute, University of Oxford, Oxford OX1 3JS, UK b Department of Methodology, London School of Economics and Political Science, London WC2A 2AE, UK c School of Sociology, University College Dublin, Dublin D04 V1W8, Ireland d Alan Turing Institute, 96 Euston Rd, London NW1 2DB, UK *To whom correspondence should be addressed: taha.yasseri@ucd.ie
S1 Access to Code and Data This study was reviewed and approved by the Oxford Internet Institute’s Depart- mental Research Ethics Committee (DREC) on behalf of the Social Sciences and Humanities Inter-Divisional Research Ethics Committee (IDREC) in accordance with the procedures laid down by the University of Oxford for ethical approval of all research involving human participants. Replication data and code are available at the Open Science Framework accessible via the following url: https://osf.io/ 6rcgx/?\view_only=7d72c5c914e14f6a9f6c56d313a0c08b. The Gorongosa Lab educational platform is hosted by The Zooniverse (zooni- verse.org), an open web-based platform for large scale citizen science research projects [44]. All pictures shown are by Wildcam Gorongosa licensed under CC BY 4.0. The survey responses were recorded using the Qualtrics software [69]. S2 Details of Experimental Setup S2.1 Wildcam Gorongosa Classification Task For the experiment, participants were asked to solve citizen science classification tasks using the Wildcam Gorongosa platform, first individually during a Training stage, T1 , and then potentially in a dyad with another participant during a Testing stage, T2 (i.e., based on whether the participant is assigned to an individual or dyad condition). Citizen science is a form of crowdsourced e-Research whereby volun- teering amateurs or nonprofessional scientists collaborate with experts on scientific research [70]. These so-called ‘citizen scientists’ primarily carry out tasks for which human-based analysis often still exceeds that of machine intelligence [46]. In the context of Zooniverse, these simple intelligence tasks include transcription, annota- tion, and drawing, among others [71]. In the present case, participant(s) were tasked with recognizing animals in pictures taken by camera traps in Mozambique’s Goron- 1
gosa National Park set up to document the recovery of wildlife populations (the trail cameras are designed to automatically take a photo when an animal moves in front of them). The tasks participants had to perform are thus real-world problems which constitute an important part of both wildlife research and conservation—Wildcam Gorongosa has engaged over 40,000 volunteers to date [48] and is actively used in classroom settings in the life and environmental sciences [72]. Whilst object recogni- tion more broadly is considered a complex task which relies on different components and is consequently a subject of intensive research in vision science, artificial intel- ligence, and human-machine collaboration [73, 74, 45]. For each image, participants in the experiment were tasked with (1) detecting the presence of the animal(s), (2) identifying the species type, (3) counting how many animals there are, (4) identifying the behaviors exhibited, specifically, identi- fying whether the animal(s) is (a) standing, (b) resting, (b) moving, (c) eating, or (d) interacting behavior (multiple behaviors may be selected), and (5) recognizing whether any young are present. The 52 possible species options include a ‘Nothing here’ option and four ‘group’ categories: human, bird (other), reptiles, and rodents. Whilst all images participants saw were classified by citizen scientists as contain- ing animals, the ‘Nothing here’ button allowed participants to still classify images if they failed to detect any visible animals. The option ‘human’ is meant to indi- cate any human activity, such as the presence of vehicles. Some images contained more than one species (see Sect. S3.1); however, for the species identification task participants could only identify one instead of multiple species (i.e., single-label clas- sification). Similarly, options for the animal count (binned as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11-50, and 51+ individuals) and detecting young task (options are Yes/No) are also instances of single-label classification. Nevertheless, classifying behavior can be considered an instance of a multi-label classification task as multiple behav- ioral attributes can be selected, consequently this task was evaluated differently to the others (see Sect. S3.2). Participants were able to filter potential species by 2
You can also read