A TOTAL ERROR FRAMEWORK FOR DIGITAL TRACES OF HUMAN BEHAVIOR ON ONLINE PLATFORMS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Public Opinion Quarterly, Vol. 85, Special Issue, 2021, pp. 399–422 A TOTAL ERROR FRAMEWORK FOR DIGITAL TRACES OF HUMAN BEHAVIOR ON ONLINE PLATFORMS INDIRA SEN* FABIAN FLÖCK Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 KATRIN WELLER BERND WEIß CLAUDIA WAGNER Abstract People’s activities and opinions recorded as digital traces online, especially on social media and other web-based platforms, offer increasingly informative pictures of the public. They promise to allow inferences about populations beyond the users of the platforms on which the traces are recorded, representing real potential for the social sciences and a complement to survey-based research. But the use of digital traces brings its own complexities and new error sources to the research enterprise. Recently, researchers have begun to discuss the errors that can occur when digital traces are used to learn about humans and social phenomena. This article synthesizes this discussion and proposes a systematic way to categorize potential errors, inspired by the Total Survey Error (TSE) framework developed for survey INDIRA SEN is a doctoral researcher in the Computational Social Science Department at GESIS– Leibniz Institute for Social Sciences, Cologne, Germany. FABIAN FLÖCK is a team leader in the Computational Social Science Department, GESIS–Leibniz Institute for Social Sciences, Cologne, Germany. KATRIN WELLER is a team leader in the Computational Social Science Department, GESIS–Leibniz Institute for Social Sciences, Cologne, Germany. BERND WEIß is a team leader in the Survey Methodology Department, GESIS–Leibniz Institute for Social Sciences, Mannheim, Germany. CLAUDIA WAGNER is a professor of applied computational social science at RWTH Aachen and department head at the Computational Social Science Department at GESIS–Leibniz Institute for Social Sciences, Cologne, Germany. The authors would like to thank the editors of the POQ Special Issue, especially Frederick Conrad, and the anonymous reviewers for their constructive feedback. The authors also thank Haiko Lietz, Sebastian Stier, Anna-Carolina Haensch, Maria Zens, members of the GESIS Computational Social Science Department, and participants in the Demography Workshop at ICWSM 2019 for helpful discus- sions and suggestions. The work was supported in part by a grant from the Volkswagen Foundation [92136 to F. F.]. The authors declare that the research was conducted in the absence of any commercial that could be construed as a potential conflict of interest. *Address correspon- dence to Indira Sen, GESIS, Computational Social Science Department, 6-8 Unter Sachsenhausen, Cologne 50667, Germany; email: Indira.Sen@gesis.org. doi:10.1093/poq/nfab018 Advance Access publication August 30, 2021 C The Author(s) 2021. Published by Oxford University Press on behalf of American Association for Public Opinion Research. V All rights reserved. For permissions, please email: journals.permissions@oup.com
400 Sen et al. methodology. We introduce a conceptual framework to diagnose, un- derstand, and document errors that may occur in studies based on such digital traces. While there are clear parallels to the well-known error sources in the TSE framework, the new “Total Error Framework for Digital Traces of Human Behavior on Online Platforms” (TED-On) identifies several types of error that are specific to the use of digital Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 traces. By providing a standard vocabulary to describe these errors, the proposed framework is intended to advance communication and re- search about using digital traces in scientific social research. Introduction For decades, the empirical social sciences have relied on surveying individu- als utilizing samples, mostly taken from well-defined populations, as one of their main data sources. An accompanying development has been the contin- ual improvement of methods and statistical tools to collect and analyze sur- vey data (Groves 2011). Survey methodology (e.g., Joye et al. 2016) has distilled the various errors that occur in measuring the behavior, attitudes, and opinions of a sample population as well as generalizing to larger popula- tions into the Total Survey Error (TSE) framework. The TSE framework (see figure 1) provides a conceptual structure for identifying and describing the errors that can affect survey estimates (Groves et al. 2011; Weisberg 2009; Biemer 2010; Groves and Lyberg 2010). The tenets of the TSE are stable and provide survey designers guidance for balancing cost and efficacy of a potential survey and, not least, a common vocabulary to identify errors in their research design from sampling to inference. Recently, however, surveys have come to face various challenges, includ- ing declining participation rates, while simultaneously there has been growth in alternative modes of data collection (Groves 2011). This includes data that have not been collected in a scientifically designed process but are captured as digital traces of users’ behavior online. Since data from social media and web platforms are especially of interest to social scientists (e.g., Watts 2007; Lazer et al. 2009; Schober et al. 2016; Salganik 2017), this is the focus of the framework proposed here. Besides their use for studying user behavior on online platforms per se, digital trace data promise, under certain critical assumptions, to generalize to broad target populations similar to surveys, but at lower cost, with larger samples (Salganik 2017). They may also capture near-real-time reactions to current events (e.g., natural disasters), which surveys can only ask about in retrospect. However, digital traces come with various challenges, such as bias due to self-selection, platform affordances, data recording and sharing
TED-On 401 Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 Figure 1. Total survey error components linked to steps in the measure- ment and representational inference process (Groves 2011). practices, heterogeneity, size, and so on, raising epistemological concerns (Ruths and Pfeffer 2014; Tufekci 2014; Schober et al. 2016; Olteanu et al. 2019). Another major hurdle is created by uncertainty about exactly how the users of a platform differ from members of a population to which researchers wish to generalize, which can change over time. While not all of these issues can be mitigated, they can be documented and examined for each particular study that leverages digital traces, to understand issues of reliability and va- lidity (e.g., Lazer 2015). Only by developing a thorough understanding of the limitations of a study can we make it comparable with others. Moreover, assessing the epistemic limitations of digital trace data studies can often help
402 Sen et al. illuminate ethical concerns in the use of these data (e.g., Mittelstadt et al. 2016; Jacobs and Wallach 2019; Olteanu et al. 2019). Our contributions: Based on the TSE perspective, we propose a framework that entails the known error sources potentially involved when using digital trace data, the Total Error Framework for Digital Traces of Human Behavior Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 on Online Platforms (TED-On). This allows researchers to characterize and analyze the errors that occur when using data from online platforms to make inferences about a theoretical construct (see figure 2) in a larger target popu- lation beyond the platforms’ users.1 By connecting errors in digital trace- based studies and the TSE framework, we establish a common vocabulary for social scientists and computational social scientists to help them docu- ment, communicate, and compare their research. The TED-On, moreover, aims to foster critical reflection on study designs based on this shared vocab- ulary, and consequently better documentation standards for describing design decisions. Doing so helps lay the foundation for accurate estimates from web and social media data. In our framework, we map errors to their counterparts in the TSE framework and, unlike previous approaches that leverage the TSE perspective (Japec et al. 2015; Hsieh and Murphy 2017; Amaya, Biemer, and Kinyon 2020), describe new types of errors that arise specifi- cally from the idiosyncrasies of digital traces online and associated methods. Further, we adopt the clear distinction between measurement and representa- tion errors for our framework (cf. figure 2), as proposed by (Groves et al. 2011). Through running examples (and a case study in Supplementary Material, Section 3) that involve different online platforms, including Twitter and Wikipedia, we demonstrate how errors at every step can, in principle, be discovered and characterized when working with web and social media data. This comprises measurement from heterogeneous and unstructured sources common to digital trace data, and particularly the challenge of generalizing beyond online platforms. Background: Research with Digital Traces on the Web In this section, we view observational digital trace studies through the lens of survey methodology and in particular the “Total Error” perspective. Building on the work of many scholars, Groves et al. (2011) introduce two main conceptual principles to better organize errors: (i) linking errors to the inferential steps in a survey lifecycle and (ii) the twin inferential process that differentiates measurement and representation as sources of errors. While 1. The framework can also help document errors for studies that aim to make inference within a particular platform, for example, to understand the prevalence of hate speech on Reddit.
TED-On 403 Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 Figure 2. Potential measurement and representation errors in a digital trace-based study lifecycle. Errors are classified according to their sources: errors of measurement (due to how the construct is measured) and errors of representation (due to generalizing from the platform population to the target population). Errors follow from the design decisions made by the researcher (see steps in the center strand). “Trace” refers to both user-generated content and interactions with content (e.g., liking, viewing), as well as interactions be- tween users (e.g., following), while “user” stands for the representation of a target audience member on a platform, for example, a social media account [best viewed in color]. measurement errors arise when survey responses depart from the true values for the construct measured by particular questions, representation errors arise when the responding sample systematically differs from the target population (cf. figure 1). Our work overlaps with recent extensions of the TSE perspec- tive to Big Data (Hsieh and Murphy 2017; Amaya, Biemer, and Kinyon
404 Sen et al. 2020) but identifies an additional set of errors that are specific to digital traces created by users of online platforms, and the particular methods ap- plied for their processing and analysis. A detailed explanation of the TSE framework appears in Supplementary Material, Section 1. The TED-On framework is centered around an abstraction of the research process for inferring theoretical constructs or social phenomena from digital Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 traces, outlining the potential errors that can occur when these processes are executed. As there is no unified convention for how to conduct such re- search—potentially involving a wide range of data and methods—the actual workflows likely differ in practice, for example, depending on disciplinary backgrounds. First, like in a survey, a researcher needs to define the theoretical con- struct and a conceptual link to the ideal measurement that will quantify it (cf. “construct definition” in figure 2). However, in survey-based research, as the outcome of this step, the measurement instrument can be tailored to the research question before data is generated, including a stimulus (question). In contrast, non-designed, but “found” (or “organic”) digital traces are non- reactive, that is, they exist independently of any research design. That is, the researcher is neither defining nor administering a stimulus, but is observing the traces of platform users in the field, the origins of which are unknown and may—or may not—be related to the envisioned construct. Similarly, this means that when pre-existing digital traces are gathered, individuals are usu- ally unaware that their online activity is subject to scientific investigation. This has the advantage that researchers can directly observe how people be- have or express their attitudes, rather than relying on retrospective self- reports in surveys, prone to recall problems, social desirability bias, and other misreporting. Since there is no explicit stimulus from a survey administrator, there are also no explicit responses or respondents in the creation of digital traces. Therefore, we introduce traces and users in the TED-On. Traces can be of two types: (i) user-generated content (Johnson, Safadi, and Faraj 2015), containing rich—but often noisy—information in the form of textual and visual content (Schober et al. 2016), for example, posts, pho- tos, and biographical details in users’ profiles; and (ii) records of online ac- tivity from users who do not post content but interact with existing posts, other content, and users, for example, by liking, viewing, downloading, or following. Traces are generated by users, who ideally represent humans in the target population. Users are typically identified via user accounts on a digital plat- form but might also be captured by IP addresses (e.g., in the case of search engine users) or other ephemeral IDs. In practice, users are selected who can be reasonably believed to represent a human actor or a group of human actors or an institution. Users representing organizations or automated
TED-On 405 programs are at risk of being mistaken for proxies of individual human actors and mi ght have to be removed (cf. Section ‘Data Preprocessing’). Second, researchers select one or more platforms as a source of digital traces. The selection is driven by how adequately the platform’s userbase represents the target population but is also influenced by how well the target construct can be measured with available platform data. The platform’s affor- Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 dances (Malik and Pfeffer 2016) play a central role in how users leave traces, a new source of error for which there is no analogous survey error. The chosen platform(s) also serve(s) as a sampling frame for both (i) traces and (ii) users simultaneously. Since the sampling frame of users is likely influenced by self- selected use of the platform, digital trace data are inherently nonprobabilistic and therefore challenging, when generalizing to an off-platform population (Schober et al. 2016). For data collection, researchers can choose to sample traces and/or users (say, based on their location). Further, they may have access to the entire userbase and all traces on a platform. Sampling from these users and traces is primarily done to narrow down the data to a subset that is most relevant to the construct of interest, and secondarily because it can be logistically chal- lenging to work with the full set of data. Sampling digital traces or users dif- fers from sampling survey respondents because not all traces or users are relevant to the research question and so are usually not randomly sampled. Instead, digital traces and users are typically selected via queries, restricting sampled units to those with certain attributes. To capture this distinction, we refer to the extraction of traces and users as a selection, rather than a sample; the selection is conducted by querying the available traces. The traces and users that compose a selection are preprocessed, routinely with automated methods due to the scale of data. Preprocessing often involves inferring user attributes (demographics, bot-vs.-human, etc.) and ex- cluding users and traces based on attributes, thus creating new sources of er- ror specific to digital traces. Depending on the research goals, data points would then be aggregated, modeled, and analyzed to produce a final esti- mate, as is done for survey estimates. Ideally, the research pipeline as outlined in figure 2 would be followed se- quentially, but in practice researchers might revisit certain steps. For in- stance, researchers are often inspired by preliminary research findings, and then iteratively refine either their research question and construct or the data collection process (Howison, Wiggins, and Crowston 2011). This differs from the process in survey-based research, which has less flexibility for re- peating the data collection process. Finally, the errors in digital traces, like errors in surveys, have both bias and variance components. In a survey, the questions, the interviewer, and the respondents can be sources of both. Similarly, for digital traces, the researcher’s design choices can impact vari- ance and bias: Queries (e.g., only via certain popular keywords for a topic),
406 Sen et al. preprocessing (e.g., inferring demographics only from profile pictures), and analysis techniques (e.g., ignoring outliers) can all impact variance and bias. Platforms and their design further impact bias regarding who they attract to use the system and how they shape user behavior. Ethical challenges: While research based on digital traces of humans2 must Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 be conducted ethically, ethical decision making might potentially also limit the research design. Users have typically not consented to be part of a spe- cific study and may be unaware that researchers are using their data (Fiesler and Proferes 2018), but greater awareness of research activities may also lead users to change their behavior. Moreover, research designs may be po- tentially harmful to certain user groups and not others. For example, auto- matic classification of user groups with machine learning (ML) methods can lead to reinforcement of social inequalities, for example, for racial minorities (see “user augmentation error” below). A third challenge to balancing ethics with research efficacy arises when researchers are restricted to only those platform data that are publicly available, potentially reducing representative- ness of the selected traces and users. A Total Error Framework for Digital Traces of Human Behavior on Online Platforms (TED-On) We now map the different stages of research using digital traces to those of survey research. We account for the differences between the two and accord- ingly adapt the error framework to describe and document the different kinds of errors in each step (figure 2). CONSTRUCT DEFINITION Given that digital traces from any particular platform are not deliberately produced for the inferential study, researchers must establish a link between the data that is observable on the platform and the theoretical construct of in- terest. Constructs are abstract “elements of information” (Groves et al. 2011) that survey scientists attempt to measure by recording responses through the survey instrument and, finally, by analyzing responses. The first step of transforming a construct into a measurement involves defining the construct; this requires thinking about competing and related constructs, ideally rooted in theory. A vague definition of the construct or a mismatch between the construct and what can realistically be measured can undermine validity. 2. For a general discussion of ethical considerations when working with web and social media data, see Zimmer and Kinder-Kurlanda (2017) and Olteanu et al. (2019). For practical guidance about designing research that is ethically informed, see Frankze et al. (2020).
TED-On 407 Next, researchers have to think about how to operationalize the construct. This entails deliberating about whether a potential measurement sufficiently captures the construct and if it does not also—or instead—capture other con- structs. Because the data largely depend on what the platform captures, what is available to the public, and/or what can be accessed (e.g., via Application Programming Interfaces, APIs), operationalizing the construct may require Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 substantially rethinking it as the available data are explored. An alternative is what Salganik (2017) calls the “ready-made” approach, in which researchers develop constructs based on specific data known to be available from a platform. Examples of validity: Consider measuring the construct of presidential ap- proval with tweets. In a questionnaire one can directly measure this construct by asking, for example, “Do you approve or disapprove of the way Donald Trump is handling his job as president?”3 Although tweets about the presi- dent containing positive words could be considered approval (e.g., O’Connor et al. 2010; Pasek et al. 2020), they could also be sarcastic or directed at a target besides the president. Finally, it can be difficult to ascertain if a tweet is commenting on how the president is handling their job (a valid measure) or private life (not necessarily valid). Consider, as our second example, measuring the construct of “Influenza- Like Illness” (ILI) from Wikipedia usage (McIver and Brownstein 2014) or Google search queries (Preis and Moat 2014). The construct can be straight- forwardly defined, and the measurement for cases of ILI traditionally con- sists of reports from local medical institutions, recorded in the United States by a central agency like the Centers for Disease Control and Prevention. To develop the appropriate measurement, we have to ask what the act of access- ing ILI-related Wikipedia pages or searching for ILI-related information on Google implies. Do we assume that these individuals suffer from influenza or related symptoms, have affected peers, or are interested in learning about the disease, potentially inspired by media coverage? Even if the majority of Google or Wikipedia users believe themselves to be infected, feeling sick does not necessarily imply that a user has contracted an ILI. PLATFORM SELECTION In selecting a platform, the researcher needs to ensure there is a link between digital traces on the platform and the theoretical construct of interest. They also, however, need to account for the impact of the platform and its 3. This survey question has remained largely unchanged since its inception, aside from updating the question to ask about the current president: https://news.gallup.com/poll/160715/gallup-daily- tracking-questions-methodology.aspx.
408 Sen et al. community on what traces are observable and the likely divergence between the target and platform populations. Below, we discuss the errors that may occur due to the chosen platform(s). Platform affordances error: Just as in surveys where the design, content, and modes may introduce measurement error,4 so the behavior of users and Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 their traces can be impacted by (i) platform-specific sociocultural norms as well as (ii) the platform’s design and technical constraints (Wu and Taneja 2020), leading to a type of measurement error we call platform affordances error. For example, Facebook recommends “people you may know,” thereby impacting the friendship links that Facebook users create (Malik and Pfeffer 2016), while Twitter’s 280- character limit on tweets influences users’ writ- ing style (Gligoric, Anderson, and West 2018). “Trending topic” features can shift users’ attention, and feedback buttons may deter utterances of polarizing nature. Also, perceived or explicit norms such as community-created guide- lines or terms of service can influence what and how users post, for example, politically conservative users being less open about their opinion on a plat- form they regard as unwelcoming of conservative statements or contributors self-censoring to avoid being banned.5 Similarly, perceived privacy risks can influence user behavior. A major challenge for digital trace-based studies is, therefore, disentangling what Ruths and Pfeffer (2014) call “platform-driven behavior” from behavior that would occur independently of the platform de- sign and norms. Evolving norms or technical settings may also affect the validity of longi- tudinal studies (Bruns and Weller 2016) since these changes may cause “system drifts” as well as “behavioral drifts” (Salganik 2017), contributing to reduced reliability of measurement over time (Lazer 2015). Ideally, research- ers will thoroughly investigate how particular platform affordances may af- fect the measurement they are planning. Examples of platform affordances error: Because of character limits, users may have to be terse when tweeting, for example about the president, or may have to post multiple, threaded tweets to express their opinion on this topic. Similarly, users may be more likely to post about a topic (hashtag) that 4. For example, questions can be asked about topics for which some responses are more socially desirable than others and data collection modes can promote or inhibit accurate responding of sensitive behaviors. 5. Emerging ideologically extreme platforms like Gab and Parler have the potential to promote “migration” of entire user groups away from mainstream platforms, polarizing the platform land- scape (Ribeiro et al. 2020).
TED-On 409 Twitter’s “trends” feature indicates is popular, while refraining from express- ing opinions that they expect to be unpopular on the platform. In the case of ILI, the way individuals arrive at and interact with the articles in Wikipedia is shaped by the site’s interface design, the inter-article link structure, and the way search results connect with this structure. Users frequently arrive at Wikipedia articles from a Google search (McMahon, Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 Johnson, and Hecht 2017; Dimitrov et al. 2019), implying that Google’s ranking of Wikipedia’s ILI-related articles, given particular queries, largely determines what digital traces we observe. Platform coverage error: In general, the userbase of a platform is not aligned with any target population (unless the platform itself is being stud- ied). To the extent this mismatch leads to incorrect estimates about the popu- lation, there is platform coverage error, much like undercoverage can be related to coverage error in the TSE framework (e.g., Eckman and Kreuter 2017). Further, different online platforms exhibit varying inclusion probabili- ties, as they attract specific audiences because of topical or technological idi- osyncrasies (Smith and Anderson2018). Twitter’s demographics, for example, differ from those of the general population (Mislove et al. 2011), as do Reddit’s (Duggan and Smith 2013). Population discrepancies might also arise from differences in internet penetration or social media adoption rates in different socio-demographic or geographical groups, independent of the particular platform (Wang et al. 2019). The change of a platform’s user composition over time may also reduce reliability of the study’s results (Salganik 2017).6 Finally, to the extent that some users do not produce any traces relevant to the research question, this makes them analogous to survey non-respondents; in practice, users who do not post content are indistinguishable from mem- bers of the population who are not users, potentially inflating platform cover- age error. Examples of platform coverage error: Assume the researcher has full access to all relevant data on Twitter. Then, any collection of tweets is necessarily restricted to users who have self-selected to register and express their opinion on this specific social media platform. These respondents likely do not repre- sent any particular population (e.g., US voters), which potentially introduces coverage error that may produce misleading estimates. In addition, non- humans (e.g., bots, businesses) act as users but do not represent individual members of any human population. 6. One member of the target population may correspond to two users, depending on the plat- form’s membership criteria.
410 Sen et al. Similarly, the users of Wikipedia and search engines more likely represent male and younger individuals than the “average citizen” (Rainie and Tancer 2007). Relying on information from Wikipedia (instead of, say, a local media outlet) might lead to selection of users who differ from the population of in- terest, likely increasing potential platform coverage error. Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 DATA COLLECTION After choosing a platform, the next step in a digital trace study is collecting data. This is commonly done through APIs, web scraping, or collaborations with platform/data providers. Then, even if all recorded traces are available from the platform, researchers often select a subset of traces, users, or both,7 to maximize the relevance of traces to the construct and users to the target population. This is usually done for traces by querying explicit features such as keywords or hashtags (O’Connor et al. 2010; Diaz et al. 2016; Stier et al. 2018) and, for users, collecting their location, demographic characteristics, or inclusion in lists (Chandrasekharan et al. 2017). Each strategy comes with different pitfalls (see Supplementary Material, Section 2). While for some platforms, like Github or Stack Exchange, the entire his- tory of traces is available and the researcher can select from this set freely, others like Facebook Ads or Google Trends only share aggregated traces, and platforms such as Twitter allow researchers access to only a subset of their data. Both platforms and researchers may also decide to limit what data can be collected based in order to protect user privacy. Trace selection error: Typically, researchers query the available traces to se- lect those that are broadly relevant to the construct of interest. To the extent these queries fail to capture all relevant posts or include irrelevant posts, they create a type of measurement error we call trace selection error.8 Example of trace selection error: Assume that we aim to capture all queries about the US president entered into a search engine by its users. If searches mentioning the keyword “Trump” are collected, searches that refer to “Melania Trump” or “playing a trump card” could be included, introducing noise. Likewise, relevant queries might be excluded that simply refer to “the 7. If a researcher selects traces, they may have to collect supplementary information about the user, leaving the trace, for example, information from the user’s profile and other posts authored by the same user (Schober et al. 2016). 8. The trace selection error is loosely related to “measurement error” in the TSE—the difference between the true value and the actual response to a survey question—in that traces might be in- cluded that do not carry information about the construct, or that informative traces might be excluded.
TED-On 411 president,” “Donald,” or acronyms such as POTUS. Excluding eligible or in- cluding ineligible traces in this way might harm measurement quality. User selection error: While traces are selected according to their likely abil- ity to measure the construct of interest, their exclusion usually also removes those users that produced them, if no other traces of these users remain Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 among the selected traces. One consequence is that users with specific attrib- utes might be unintentionally excluded (e.g., teenagers may be excluded be- cause they use different terms to refer to the president). The error due to the gap between the type of users selected and the type of users comprising the platform’s userbase is user selection error, related to coverage error in the TSE framework. This error can also occur if researchers intentionally exclude users based on their features (not their traces); for instance, by removing user-profiles deemed irrelevant for inferences to the target population based on their indi- cated location or age. This user-selection error is especially complex because groups of users differ in how they voluntary self-report such attributes; for example, certain demographic groups may be less prone to reveal their loca- tion or gender in their user profiles or posts (Pavalanathan and Eisenstein 2015), and additionally can be unreliable due to variation in their true value over time (Salganik 2017). Aggravating trace and user selection errors, the subset of posts to which some platforms restrict researchers may not represent all traces created by the platform population. Such an arbitrary restriction can be closely linked to sampling error in the TSE framework, since a nonprobability sample is drawn from the platform. For example, a popular choice of obtaining Twitter data is through the 1 percent “spritzer” API, yet the free 1 percent sample is significantly different from the commercial 10 percent “gardenhose” API (Morstatter et al. 2013). One reason users might be misrepresented in a selec- tion is that highly active users may be more likely to be selected than others, simply because they produce more traces (Schober et al. 2016). Examples of user selection error: Certain groups of users (e.g., Spanish- speaking residents of the United States) may be underrepresented if they are selected based on keywords used mainly by English-speaking Americans to refer to political topics. Second, when measuring ILI from aggregate views or searches, there are no explicit user accounts of Wikipedia viewers or Google searchers.9 Different subgroups of the target population (e.g., medi- cally trained individuals) might turn to different articles for their information needs about ILI that are not included in the selected traces (Wikipedia pages 9. It is not necessary to have an account on Wikipedia or Google (Gmail) to use either.
412 Sen et al. or search engine queries). As these traces are excluded, so are these members of the target population. DATA PREPROCESSING Data preprocessing refers to the process of reducing noise by removing ineli- Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 gible traces and users from the original dataset as well as augmenting the data with auxiliary information (e.g., inferring demographic features, or link- ing with external sources). Trace preprocessing: Trace preprocessing can add additional information to traces or discard mistakenly collected traces. Although the goal of trace pre- processing is to improve measurement, it can inadvertently introduce a type of measurement error. Trace augmentation error: Traces can be augmented in several ways: for instance, sentiment detection, named entity recognition, or stance detec- tion in the textual content of a post; the annotation of “like” actions with receiver and sender information; or the annotation of image material. Due to the large number of data elements in most digital trace datasets, aug- mentation is mainly automated. Trace augmentation error may be intro- duced in this step due to false positives and negatives created by the automated method. For example, supervised machine learning (ML) methods might be based on a training set that includes erroneous labels, or they might pick up on spurious signals in the training data, reducing generalizability. The annotation of natural language texts has been partic- ularly popular in digital trace studies; and, despite a large body of re- search, many challenges remain (Puschmann and Powell 2018), including a lack of algorithmic interpretability and lack of transferability to domains for which the methods were not originally developed (Sen, Flöck, and Wagner 2020). Study designers need to carefully assess the context for which an annotation method was built before applying it. Example of trace augmentation error: Researchers often augment political tweets with sentiment measures in studies of public opinion (e.g., O’Connor et al. 2010; Barberá 2016). However, the users of different social media plat- forms might use different terminology (e.g., “Web-born” slang and abbrevia- tions) than those included in popular sentiment lexicons and may even use different words on different platforms or in different contexts, leading to mis- classification on a certain platform or for a subcommunity on that platform (Hamilton et al. 2016).
TED-On 413 Trace reduction error: Finally, certain traces may be ineligible for a vari- ety of reasons, such as spam, hashtag hijacking, or because they are irrel- evant to the task at hand. The error incurred due to the removal of potentially eligible traces—and the non-removal of ineligible traces—is trace reduction error. Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 Example of trace reduction error: Researchers might decide to use a clas- sifier that removes spam, but has a high false positive rate, inadvertently re- moving non-spam tweets. They might likewise discard tweets without textual content, thereby ignoring relevant statements made through hyper- links or embedded pictures/videos. USER PREPROCESSING In this step, the digital representations of humans are also augmented with auxiliary information or they are removed if they do not represent members of the target population. User augmentation error: Later in the research process, researchers may want to reweight digital traces by socio-demographic attributes and/or ac- tivity levels of individuals, as is done in surveys to mitigate representa- tion errors (see the “Analysis and Inference” section). However, since such attributes are rarely available from the platform, it is common to in- fer demographic attributes for individual users generally using ML meth- ods (Sap et al. 2014; Zhang et al. 2016; Wang et al. 2019). Naturally, such attribute inference can be error-prone (McCormick et al. 2017) and especially problematic if the accuracy of the inference differs systemati- cally between demographic strata (Buolamwini and Gebru 2018). Platforms may also offer aggregate information they have inferred about their users; this information is prone to the same kind of errors, unbe- knownst to the researcher (Zagheni, Weber, and Gummadi 2017). The overall error incurred due to user augmentation methods is user augmen- tation error. Example of user augmentation error: Twitter users’ gender, ethnicity, or location may be inferred to understand how presidential approval differs across demographic groups. Yet, automated gender inference methods based on images have higher error rates for African Americans than Whites (Buolamwini and Gebru 2018); therefore, gender inferred through such means may inaccurately estimate approval rates among African American males versus females compared to their White counterparts, which raises se- rious ethical concerns.
414 Sen et al. User reduction error: Preprocessing steps usually include removing spammers and non-human users (mostly bots or organizations). Users are also filtered based on criteria such as tenure on the platform, location, or presence of a profile picture. These steps are comparable to removing “ineligible units” in surveys, as these users do not represent members of the target population. The performance of particular methods for detecting and Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 removing users may well depend on the characteristics of the data (cf. Echeverrı́a et al. 2018; Wu et al. 2018). Therefore, it is important that, to the extent possible, researchers analyze the accuracy of the user reduction method to estimate the user reduction error. Examples of user reduction error: While estimating presidential approval from tweets, researchers are usually uninterested in posts by bots or organi- zations. One can detect such accounts (Alzahrani et al. 2018; Wang et al. 2019) but may not detect all of these ineligible users, or may erroneously re- move eligible users that show bot- or organization-like behavior. ANALYSIS AND INFERENCE After preprocessing the dataset, the researcher can estimate the prevalence of the construct in the selected traces and, by extension, in the target population. Trace measurement error: In this step, the construct is concretely mea- sured, for example, with sentiment scores aggregated over traces, users, days, and so on. It is therefore distinct from trace augmentation, where traces are annotated with auxiliary information.10 The measurement can be taken using many different techniques, from simple counting and heuristics to complex ML models. While simpler methods may be less powerful, they are often more interpretable and less computationally intensive. On top of errors incurred in previous stages, the model used might only detect certain aspects of the construct or be affected by spurious artifacts of the data being ana- lyzed. For example, an ML model spotting sexist attitudes may work well for detecting obviously hostile attacks based on gender identity but fail to capture “benevolent sexism,” though they are both dimensions of the con- struct of sexism (Jha and Mamidi 2017). Thus, even if the technique is devel- oped for the specific data collected, it can still suffer from low validity. Any error that distorts how the construct is estimated due to modeling choices is denoted as trace measurement error. Even if validity is high for the given study, a replication of the measurement model on a different dataset—a 10. In practice, these steps can be highly intertwined, but it is still useful to distinguish them conceptually.
TED-On 415 future selection of data from the same platform or a completely different dataset—can fail, that is, have low reliability (Conrad et al. 2019; Sen, Flöck, and Wagner 2020). Amaya, Biemer, and Kinyon (2020) as well as West and colleagues (West, Sakshaug, and Aurelien 2016; West, Sakshaug, and Kim 2017) intro- duce the concept of analytic error, which West, Sakshaug, and Kim (2017, Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 p. 489) characterize as an “important but understudied aspect of the total sur- vey error (TSE) paradigm.” Following Amaya, Biemer, and Kinyon (2020, p. 101), analytic error is broadly related to all errors made by researchers when analyzing the data and interpreting the results, especially when apply- ing automated analytic methods. A consequence of analytic error can be false conclusions and incorrectly specified models (Baker 2017). Thus, error in- curred due to the choice of modeling techniques is trace measurement error. Error due to the choice of survey weights is adjustment error and is described below. Examples of trace measurement error: Assume the construct is defined as presidential approval and traces are augmented with sentiment scores. The researcher obtains traces such as tweets and counts the positive and negative words each contains and may then define a final link function that combines all traces into a single aggregate to measure “approval.” They may aggregate the tweets by counting the normalized positive words per day (Barberá 2016), the ratio of positive and negative words per tweet, or divide the differ- ence between positive and negative words by total words per day (Pasek et al. 2020). The former calculation of counting positive words per day may underestimate negative sentiments of a particular day, while in the latter, ag- gregating negative and positive words in a day’s tweets might overestimate the effect of tweets with multiple positive (or negative) words. See Conrad et al. (2019) for an illustration of how different ways of calculating daily sentiment can produce dramatically different results. In some cases, aggregated traces cannot be matched to the users who pro- duced them, making it difficult to identify if multiple traces refer to the same user (e.g., multiple views of ILI-related Wikipedia pages viewed by the same or different persons). As a result, power users’ Wikipedia page visits may be counted much more often than the average readers’ visits. Adjustment error: In order to reduce platform coverage error (possibly aggra- vated by user selection or preprocessing), researchers might adapt techniques designed to reduce bias when estimating population parameters from opt-in (non- probability) web surveys, particularly adjustment using model-based poststratifica- tion (e.g., Goel, Obeng, and Rothschild 2017). Because selections of digital traces are inherently non-probabilistic, reweighting has been explored in studies using
416 Sen et al. digital traces (Zagheni and Weber 2015; Barberá 2016; Pasek et al. 2018; Pasek et al. 2020; Wang et al. 2019) using calibration or post-stratification. However, reweighting techniques for digital traces have not yet been studied to the extent they have for non-probability survey samples (e.g., Kohler 2019; Kohler, Kreuter, and Stuart 2019; Cornesse and Blom 2020; Cornesse et al. 2020). Broadly, there are two approaches to reweighting in digital trace-based studies. The first approach Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 reweights the data according to a known population, for example census propor- tions (Zagheni and Weber 2015; Yildiz et al. 2017; Wang et al. 2019). The second approach reweights general population surveys according to the demographic dis- tribution of the online platform itself found through usage surveys (Pasek et al. 2018, 2020). The lack of available user demographics, reweighting on the basis of attributes that are not actually associated with the platform coverage error, or choosing an adjustment method that is not well suited to the data can all cause ad- justment error, in line with Groves et al. (2011).11 Thus, researchers using digital traces may consider developing and/or applying adjustment techniques that sup- port generalization beyond the platform from which the traces are collected. Examples of adjustment error: When comparing presidential approval on Twitter with survey data, Pasek et al. (2020) reweight the survey estimates with Twitter usage demographics but fail to find alignment between both measures.12 In this case, the researchers assume that the demographics of Twitter users are the same as for a subset of users tweeting about the presi- dent, an assumption that might lead to adjustment error. Previous research in- deed indicates that users talking about politics on Twitter tend to have different characteristics than random users (Cohen and Ruths 2013) and tend to be younger and more likely to be white men (Bekafigo and McBride 2013). Related Work Evidence that errors can arise in using online digital traces to study social phenomena has been mounting (Boyd and Crawford 2012; Gayo-Avello 2012; Morstatter et al. 2013; Pavalanathan and Eisenstein 2015; Malik and Pfeffer 2016; Olteanu et al. 2019). Several error discovery strategies for digi- tal-trace-based studies have been developed in data and computer science (Tufekci 2014; Olteanu et al. 2019). Concrete efforts to identify and docu- ment errors include “Datasheets for Datasets” (Gebru et al. 2018) and “Model cards” (Mitchell et al. 2019) for discussing the specifications of 11. Corrections solely based on socio-demographic attributes are unlikely to be a panacea for re- ducing coverage or non-response issues (Schnell, Noack, and Torregroza 2017). 12. Cf. Smith and Anderson (2018) for Twitter’s demographic composition.
TED-On 417 particular datasets and ML models, respectively. These advances are compat- ible with the TED-On. Related to our approach, Schober et al. (2016) reflect on how survey and social media data differ in several ways, including the possibility that a topic may happen to be covered in a social media corpus much as in a representative survey sample, even if the corpus does not have good pop- Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 ulation coverage. Hsieh and Murphy (2017) explicitly build on the TSE framework in their Total Twitter Error framework, which describes three errors that can be mapped to survey errors specifically for Twitter. Japec et al. (2015) caution that potential error frameworks will have to account for errors specific to Big Data and outline a way to extend the TSE frame- work to such data. Closest to our work is that of Amaya, Biemer, and Kinyon (2020), who apply the Total Error Framework (TEF) to surveys and Big Data. The TEF offers a broader framework than presented here, while the TED-On specifi- cally addresses novel errors encountered when working with online digital trace data. Conclusion The use of digital traces of human behavior, attitudes, and opinion, espe- cially web and social media data, is gaining traction in the social sciences. In a growing number of scenarios, online digital traces are considered valuable complements to surveys. To make research on online traces more transparent and easier to evaluate, it is important to describe potential error sources sys- tematically. We have therefore introduced a framework for distinguishing the various kinds of errors that may be part of making inferences from these ob- servational data, related to specific platforms and their affordances, and due to the researchers’ design choices. Drawing on the TSE framework, our pro- posed TED-On framework is intended to (i) develop a shared vocabulary to facilitate dialogue among scientists from heterogeneous disciplines, and (ii) to aid in pinpointing, documenting, and ultimately avoiding errors in re- search based on online digital traces. It is our hope that the TED-On might also lay the foundation for better designs by calling attention to the distinct sources of error and where they can occur in the process of using traces for social and behavioral research. Supplementary Material SUPPLEMENTARY MATERIAL may be found in the online version of this article: https://doi.org/10.1093/poq/nfab018.
418 Sen et al. References Alzahrani, Sultan, Chinmay Gore, Amin Salehi, and Hasan Davulcu. 2018. “Finding Organizational Accounts Based on Structural and Behavioral Factors on Twitter.” In International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, edited by Halil Bisgin, Robert Thomson, Ayaz Hyder and Christopher Dancy, 164–75. Cham: Springer. Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 Amaya, Ashley, Paul P. Biemer, and David Kinyon. 2020. “Total Error in a Big Data World: Adapting the TSE Framework to Big Data.” Journal of Survey Statistics and Methodology 8(1):89–119. Baker, Reg. 2017. “Big Data: A Survey Research Perspective.” In Total Survey Error in Practice, edited by P. P. Biemer et al., pp. 47–69. Hoboken, NJ: John Wiley and Sons, Inc. Barberá, Pablo. 2016. “Less Is More? How Demographic Sample Weights Can Improve Public Opinion Estimates Based on Twitter Data.” Work Pap NYU. Bekafigo, Marija Anna, and Allan McBride. 2013. “Who Tweets about Politics? Political Participation of Twitter Users during the 2011 Gubernatorial Elections.” Social Science Computer Review 31(5):625–43. Biemer, Paul P. 2010. “Total Survey Error: Design, Implementation, and Evaluation.” Public Opinion Quarterly 74(5):817–48. Boyd, Danah, and Kate Crawford. 2012. “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.” Information, Communication and Society 15(5):662–79. Bruns, Axel, and Katrin Weller. 2016. “Twitter as a First Draft of the Present: And the Challenges of Preserving It for the Future.” In Proceedings of the 8th ACM Conference on Web Science, 183–89, Hannover, Germany. Buolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” In Conference on Fairness, Accountability and Transparency, 77–91, New York, NY. Chandrasekharan, Eshwar, Umashanthi Pavalanathan, Anirudh Srinivasan, Adam Glynn, Jacob Eisenstein, and Eric Gilbert. 2017. “You Can’t Stay Here: The Efficacy of Reddit’s 2015 Ban Examined through Hate Speech.” Proceedings of the ACM on Human-Computer Interaction 1(CSCW):1–22. Cohen, Raviv, and Derek Ruths 2013. “Classifying Political Orientation on Twitter: It’s Not Easy!” Proceedings of the International AAAI Conference on Web and Social Media 7(1). Retrieved from https://ojs.aaai.org/index.php/ICWSM/article/view/14434 Conrad, Frederick G., Johann A. Gagnon-Bartsch, Robyn A. Ferg, Michael F. Schober, Josh Pasek, and Elizabeth Hou. 2019. “Social Media as an Alternative to Surveys of Opinions about the Economy.” Social Science Computer Review 0894439319875692. Cornesse, Carina, and Annelies G. Blom. 2020. “Response Quality in Nonprobability and Probability-Based Online Panels.” Sociological Methods and Research 0049124120914940. doi: 10.1177/0049124120914940. Cornesse, Carina, Annelies G. Blom, David Dutwin, Jon A. Krosnick, Edith D. De Leeuw, Stéphane Legleye, Josh Pasek, Darren Pennay, Benjamin Phillips, Joseph W. Sakshaug, Bella Struminskaya, and Alexander Wenz. 2020. “A Review of Conceptual Approaches and Empirical Evidence on Probability and Nonprobability Sample Survey Research.” Journal of Survey Statistics and Methodology 8(1):4–36. Diaz, Fernando, Michael Gamon, Jake M. Hofman, Emre Kıcıman, and David Rothschild. 2016. “Online and Social Media Data as an Imperfect Continuous Panel Survey.” PloS One 11(1):e0145406. Duggan, Maeve, and Aaron, Smith. 2013. “6% of Online Adults Are Reddit Users.” Pew Internet and American Life Project 3:1–10.
TED-On 419 Echeverrı́a, Juan, Emiliano De Cristofaro, Nicolas Kourtellis, Ilias Leontiadis, Gianluca Stringhini, and Shi Zhou. 2018. “Lobo: Evaluation of Generalization Deficiencies in Twitter Bot Classifiers.” In Proceedings of the 34th Annual Computer Security Applications Conference, 137–46, San Juan, PR, USA. Eckman, Stephanie, and Frauke Kreuter 2017. “The Undercoverage-Nonresponse Trade-Off.” Total Survey Error in Practice, edited by Biemer PaulP., Edith D. de Leeuw, Stephanie Eckman, Brad Edwards, Frauke Kreuter, Lars E.Lyberg, N. Clyde Tucker, and Brady T. Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021 West, 95–113. Hoboken, NJ: Wiley. Fiesler, Casey, and Nicholas Proferes. 2018. “‘Participant’ Perceptions of Twitter Research Ethics.” Social Media þ Society 4(1):2056305118763366. Franzke, Aline Shakti, Anja Bechmann, Michael Zimmer, and C. Ess. 2020. “Internet research: Ethical guidelines 3.0.” Association of Internet Researchers. 4(1):2056305118763366. Gayo-Avello, Daniel. 2012. “‘I Wanted to Predict Elections with Twitter and All I Got Was This Lousy Paper’—A Balanced Survey on Election Prediction Using Twitter Data.” arXiv preprint arXiv:1204.6441. Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. “Datasheets for Datasets.” arXiv pre- print arXiv:1803.09010 Gligoric, Kristina, Ashton Anderson, and Robert West. 2018. “How Constraints Affect Content: The Case of Twitter’s Switch from 140 to 280 Characters.” In Proceedings of the International AAAI Conference on Web and Social Media 12(1). Retrieved from https://ojs. aaai.org/index.php/ICWSM/article/view/15079 Goel, Shirad, Adam Obeng, and David Rothschild. 2017. Online, Opt-In Surveys: Fast and Cheap, but Are They Accurate? Working Paper. Stanford, CA: Stanford University. Groves, Robert M. 2011. “Three Eras of Survey Research.” Public Opinion Quarterly 75(5): 861–871. Groves, Robert M., and Lars Lyberg. 2010. “Total Survey Error: Past, Present, and Future.” Public Opinion Quarterly 74(5):849–79. Groves, Robert M., Floyd J. Fowler Jr., Mick P. Couper, James M. Lepkowski, Eleanor Singer, and Roger Tourangeau. 2011. Survey Methodology, vol. 561. John Wiley and Sons. Hamilton, William L., Kevin Clark, Jure Leskovec, and Dan Jurafsky. 2016. “Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, 595. vol. 2016, Austin, Texas Howison, James, Andrea Wiggins, and Kevin Crowston. 2011. “Validity Issues in the Use of Social Network Analysis with Digital Trace Data.” Journal of the Association for Information Systems 12(12):2. Hsieh, Yuli Patrick, and Joe Murphy. 2017. “Total Twitter Error.” Total Survey Error in Practice 74:23–46. Hoboken, NJ: John Wiley. Jacobs, Abigail Z., and Hanna Wallach. 2021 “Measurement and fairness.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 375–385. Japec, Lilli, Frauke Kreuter, Marcus Berg, Paul Biemer, Paul Decker, Cliff Lampe, Julia Lane, Cathy O’Neil, and Abe Usher. 2015. “Big Data in Survey Research: AAPOR Task Force Report.” Public Opinion Quarterly 79(4):839–80. Jha, Akshita, and Radhika Mamidi. 2017. “When Does a Compliment Become Sexist? Analysis and Classification of Ambivalent Sexism Using Twitter Data.” In Proceedings of the Second Workshop on NLP and Computational Social Science, 7–16, Vancouver, Canada. Johnson, Steven L., Hani Safadi, and Samer Faraj. 2015. “The Emergence of Online Community Leadership.” Information Systems Research 26(1):165–87.
You can also read