A TOTAL ERROR FRAMEWORK FOR DIGITAL TRACES OF HUMAN BEHAVIOR ON ONLINE PLATFORMS

Page created by Willie Shelton

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Public Opinion Quarterly, Vol. 85, Special Issue, 2021, pp. 399–422

A TOTAL ERROR FRAMEWORK FOR DIGITAL
TRACES OF HUMAN BEHAVIOR ON ONLINE
PLATFORMS

INDIRA SEN*
FABIAN FLÖCK

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
KATRIN WELLER
BERND WEIß
CLAUDIA WAGNER

Abstract People’s activities and opinions recorded as digital traces
online, especially on social media and other web-based platforms, offer
increasingly informative pictures of the public. They promise to allow
inferences about populations beyond the users of the platforms on
which the traces are recorded, representing real potential for the social
sciences and a complement to survey-based research. But the use of
digital traces brings its own complexities and new error sources to the
research enterprise. Recently, researchers have begun to discuss the
errors that can occur when digital traces are used to learn about
humans and social phenomena. This article synthesizes this discussion
and proposes a systematic way to categorize potential errors, inspired
by the Total Survey Error (TSE) framework developed for survey

INDIRA SEN is a doctoral researcher in the Computational Social Science Department at GESIS–
Leibniz Institute for Social Sciences, Cologne, Germany. FABIAN FLÖCK is a team leader in the
Computational Social Science Department, GESIS–Leibniz Institute for Social Sciences,
Cologne, Germany. KATRIN WELLER is a team leader in the Computational Social Science
Department, GESIS–Leibniz Institute for Social Sciences, Cologne, Germany. BERND WEIß is a
team leader in the Survey Methodology Department, GESIS–Leibniz Institute for Social
Sciences, Mannheim, Germany. CLAUDIA WAGNER is a professor of applied computational social
science at RWTH Aachen and department head at the Computational Social Science Department
at GESIS–Leibniz Institute for Social Sciences, Cologne, Germany. The authors would like to
thank the editors of the POQ Special Issue, especially Frederick Conrad, and the anonymous
reviewers for their constructive feedback. The authors also thank Haiko Lietz, Sebastian Stier,
Anna-Carolina Haensch, Maria Zens, members of the GESIS Computational Social Science
Department, and participants in the Demography Workshop at ICWSM 2019 for helpful discus-
sions and suggestions. The work was supported in part by a grant from the Volkswagen
Foundation [92136 to F. F.]. The authors declare that the research was conducted in the absence
of any commercial that could be construed as a potential conflict of interest. *Address correspon-
dence to Indira Sen, GESIS, Computational Social Science Department, 6-8 Unter
Sachsenhausen, Cologne 50667, Germany; email: Indira.Sen@gesis.org.
doi:10.1093/poq/nfab018 Advance Access publication August 30, 2021
C The Author(s) 2021. Published by Oxford University Press on behalf of American Association for Public Opinion Research.
V
All rights reserved. For permissions, please email: journals.permissions@oup.com

400 Sen et al.

methodology. We introduce a conceptual framework to diagnose, un-
derstand, and document errors that may occur in studies based on such
digital traces. While there are clear parallels to the well-known error
sources in the TSE framework, the new “Total Error Framework for
Digital Traces of Human Behavior on Online Platforms” (TED-On)
identifies several types of error that are specific to the use of digital

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
traces. By providing a standard vocabulary to describe these errors, the
proposed framework is intended to advance communication and re-
search about using digital traces in scientific social research.

Introduction
For decades, the empirical social sciences have relied on surveying individu-
als utilizing samples, mostly taken from well-defined populations, as one of
their main data sources. An accompanying development has been the contin-
ual improvement of methods and statistical tools to collect and analyze sur-
vey data (Groves 2011). Survey methodology (e.g., Joye et al. 2016) has
distilled the various errors that occur in measuring the behavior, attitudes,
and opinions of a sample population as well as generalizing to larger popula-
tions into the Total Survey Error (TSE) framework. The TSE framework (see
figure 1) provides a conceptual structure for identifying and describing the
errors that can affect survey estimates (Groves et al. 2011; Weisberg 2009;
Biemer 2010; Groves and Lyberg 2010). The tenets of the TSE are stable
and provide survey designers guidance for balancing cost and efficacy of a
potential survey and, not least, a common vocabulary to identify errors in
their research design from sampling to inference.
Recently, however, surveys have come to face various challenges, includ-
ing declining participation rates, while simultaneously there has been growth
in alternative modes of data collection (Groves 2011).
This includes data that have not been collected in a scientifically designed
process but are captured as digital traces of users’ behavior online. Since
data from social media and web platforms are especially of interest to social
scientists (e.g., Watts 2007; Lazer et al. 2009; Schober et al. 2016; Salganik
2017), this is the focus of the framework proposed here.
Besides their use for studying user behavior on online platforms per se,
digital trace data promise, under certain critical assumptions, to generalize to
broad target populations similar to surveys, but at lower cost, with larger
samples (Salganik 2017). They may also capture near-real-time reactions to
current events (e.g., natural disasters), which surveys can only ask about in
retrospect. However, digital traces come with various challenges, such as
bias due to self-selection, platform affordances, data recording and sharing

TED-On                                                                      401

                                                                                   Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
Figure 1. Total survey error components linked to steps in the measure-
ment and representational inference process (Groves 2011).

practices, heterogeneity, size, and so on, raising epistemological concerns
(Ruths and Pfeffer 2014; Tufekci 2014; Schober et al. 2016; Olteanu et al.
2019). Another major hurdle is created by uncertainty about exactly how the
users of a platform differ from members of a population to which researchers
wish to generalize, which can change over time. While not all of these issues
can be mitigated, they can be documented and examined for each particular
study that leverages digital traces, to understand issues of reliability and va-
lidity (e.g., Lazer 2015). Only by developing a thorough understanding of
the limitations of a study can we make it comparable with others. Moreover,
assessing the epistemic limitations of digital trace data studies can often help

402 Sen et al.

illuminate ethical concerns in the use of these data (e.g., Mittelstadt et al.
2016; Jacobs and Wallach 2019; Olteanu et al. 2019).

Our contributions: Based on the TSE perspective, we propose a framework
that entails the known error sources potentially involved when using digital
trace data, the Total Error Framework for Digital Traces of Human Behavior

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
on Online Platforms (TED-On). This allows researchers to characterize and
analyze the errors that occur when using data from online platforms to make
inferences about a theoretical construct (see figure 2) in a larger target popu-
lation beyond the platforms’ users.1 By connecting errors in digital trace-
based studies and the TSE framework, we establish a common vocabulary
for social scientists and computational social scientists to help them docu-
ment, communicate, and compare their research. The TED-On, moreover,
aims to foster critical reflection on study designs based on this shared vocab-
ulary, and consequently better documentation standards for describing design
decisions. Doing so helps lay the foundation for accurate estimates from web
and social media data. In our framework, we map errors to their counterparts
in the TSE framework and, unlike previous approaches that leverage the
TSE perspective (Japec et al. 2015; Hsieh and Murphy 2017; Amaya,
Biemer, and Kinyon 2020), describe new types of errors that arise specifi-
cally from the idiosyncrasies of digital traces online and associated methods.
Further, we adopt the clear distinction between measurement and representa-
tion errors for our framework (cf. figure 2), as proposed by (Groves et al.
2011). Through running examples (and a case study in Supplementary
Material, Section 3) that involve different online platforms, including Twitter
and Wikipedia, we demonstrate how errors at every step can, in principle, be
discovered and characterized when working with web and social media data.
This comprises measurement from heterogeneous and unstructured sources
common to digital trace data, and particularly the challenge of generalizing
beyond online platforms.

Background: Research with Digital Traces on the Web
In this section, we view observational digital trace studies through the lens
of survey methodology and in particular the “Total Error” perspective.
Building on the work of many scholars, Groves et al. (2011) introduce two
main conceptual principles to better organize errors: (i) linking errors to the
inferential steps in a survey lifecycle and (ii) the twin inferential process that
differentiates measurement and representation as sources of errors. While

1. The framework can also help document errors for studies that aim to make inference within a
particular platform, for example, to understand the prevalence of hate speech on Reddit.

TED-On                                                                        403

                                                                                     Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
Figure 2. Potential measurement and representation errors in a digital
trace-based study lifecycle. Errors are classified according to their sources:
errors of measurement (due to how the construct is measured) and errors of
representation (due to generalizing from the platform population to the target
population). Errors follow from the design decisions made by the researcher
(see steps in the center strand). “Trace” refers to both user-generated content
and interactions with content (e.g., liking, viewing), as well as interactions be-
tween users (e.g., following), while “user” stands for the representation of a
target audience member on a platform, for example, a social media account
[best viewed in color].

measurement errors arise when survey responses depart from the true values
for the construct measured by particular questions, representation errors arise
when the responding sample systematically differs from the target population
(cf. figure 1). Our work overlaps with recent extensions of the TSE perspec-
tive to Big Data (Hsieh and Murphy 2017; Amaya, Biemer, and Kinyon

404 Sen et al.

2020) but identifies an additional set of errors that are specific to digital
traces created by users of online platforms, and the particular methods ap-
plied for their processing and analysis. A detailed explanation of the TSE
framework appears in Supplementary Material, Section 1.
The TED-On framework is centered around an abstraction of the research
process for inferring theoretical constructs or social phenomena from digital

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
traces, outlining the potential errors that can occur when these processes are
executed. As there is no unified convention for how to conduct such re-
search—potentially involving a wide range of data and methods—the actual
workflows likely differ in practice, for example, depending on disciplinary
backgrounds.
First, like in a survey, a researcher needs to define the theoretical con-
struct and a conceptual link to the ideal measurement that will quantify it
(cf. “construct definition” in figure 2). However, in survey-based research, as
the outcome of this step, the measurement instrument can be tailored to the
research question before data is generated, including a stimulus (question).
In contrast, non-designed, but “found” (or “organic”) digital traces are non-
reactive, that is, they exist independently of any research design. That is, the
researcher is neither defining nor administering a stimulus, but is observing
the traces of platform users in the field, the origins of which are unknown
and may—or may not—be related to the envisioned construct. Similarly, this
means that when pre-existing digital traces are gathered, individuals are usu-
ally unaware that their online activity is subject to scientific investigation.
This has the advantage that researchers can directly observe how people be-
have or express their attitudes, rather than relying on retrospective self-
reports in surveys, prone to recall problems, social desirability bias, and other
misreporting. Since there is no explicit stimulus from a survey administrator,
there are also no explicit responses or respondents in the creation of digital
traces. Therefore, we introduce traces and users in the TED-On.
Traces can be of two types: (i) user-generated content (Johnson, Safadi,
and Faraj 2015), containing rich—but often noisy—information in the form
of textual and visual content (Schober et al. 2016), for example, posts, pho-
tos, and biographical details in users’ profiles; and (ii) records of online ac-
tivity from users who do not post content but interact with existing posts,
other content, and users, for example, by liking, viewing, downloading, or
following.
Traces are generated by users, who ideally represent humans in the target
population. Users are typically identified via user accounts on a digital plat-
form but might also be captured by IP addresses (e.g., in the case of search
engine users) or other ephemeral IDs. In practice, users are selected who can
be reasonably believed to represent a human actor or a group of human
actors or an institution. Users representing organizations or automated

TED-On 405

programs are at risk of being mistaken for proxies of individual human actors
and mi ght have to be removed (cf. Section ‘Data Preprocessing’).
Second, researchers select one or more platforms as a source of digital
traces. The selection is driven by how adequately the platform’s userbase
represents the target population but is also influenced by how well the target
construct can be measured with available platform data. The platform’s affor-

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
dances (Malik and Pfeffer 2016) play a central role in how users leave traces,
a new source of error for which there is no analogous survey error. The chosen
platform(s) also serve(s) as a sampling frame for both (i) traces and (ii) users
simultaneously. Since the sampling frame of users is likely influenced by self-
selected use of the platform, digital trace data are inherently nonprobabilistic
and therefore challenging, when generalizing to an off-platform population
(Schober et al. 2016).
For data collection, researchers can choose to sample traces and/or users
(say, based on their location). Further, they may have access to the entire
userbase and all traces on a platform. Sampling from these users and traces
is primarily done to narrow down the data to a subset that is most relevant to
the construct of interest, and secondarily because it can be logistically chal-
lenging to work with the full set of data. Sampling digital traces or users dif-
fers from sampling survey respondents because not all traces or users are
relevant to the research question and so are usually not randomly sampled.
Instead, digital traces and users are typically selected via queries, restricting
sampled units to those with certain attributes. To capture this distinction, we
refer to the extraction of traces and users as a selection, rather than a sample;
the selection is conducted by querying the available traces.
The traces and users that compose a selection are preprocessed, routinely
with automated methods due to the scale of data. Preprocessing often
involves inferring user attributes (demographics, bot-vs.-human, etc.) and ex-
cluding users and traces based on attributes, thus creating new sources of er-
ror specific to digital traces. Depending on the research goals, data points
would then be aggregated, modeled, and analyzed to produce a final esti-
mate, as is done for survey estimates.
Ideally, the research pipeline as outlined in figure 2 would be followed se-
quentially, but in practice researchers might revisit certain steps. For in-
stance, researchers are often inspired by preliminary research findings, and
then iteratively refine either their research question and construct or the data
collection process (Howison, Wiggins, and Crowston 2011). This differs
from the process in survey-based research, which has less flexibility for re-
peating the data collection process. Finally, the errors in digital traces, like
errors in surveys, have both bias and variance components. In a survey, the
questions, the interviewer, and the respondents can be sources of both.
Similarly, for digital traces, the researcher’s design choices can impact vari-
ance and bias: Queries (e.g., only via certain popular keywords for a topic),

406 Sen et al.

preprocessing (e.g., inferring demographics only from profile pictures), and
analysis techniques (e.g., ignoring outliers) can all impact variance and bias.
Platforms and their design further impact bias regarding who they attract to
use the system and how they shape user behavior.

Ethical challenges: While research based on digital traces of humans2 must

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
be conducted ethically, ethical decision making might potentially also limit
the research design. Users have typically not consented to be part of a spe-
cific study and may be unaware that researchers are using their data (Fiesler
and Proferes 2018), but greater awareness of research activities may also
lead users to change their behavior. Moreover, research designs may be po-
tentially harmful to certain user groups and not others. For example, auto-
matic classification of user groups with machine learning (ML) methods can
lead to reinforcement of social inequalities, for example, for racial minorities
(see “user augmentation error” below). A third challenge to balancing ethics
with research efficacy arises when researchers are restricted to only those
platform data that are publicly available, potentially reducing representative-
ness of the selected traces and users.

A Total Error Framework for Digital Traces of Human
Behavior on Online Platforms (TED-On)
We now map the different stages of research using digital traces to those of
survey research. We account for the differences between the two and accord-
ingly adapt the error framework to describe and document the different kinds
of errors in each step (figure 2).

CONSTRUCT DEFINITION

Given that digital traces from any particular platform are not deliberately
produced for the inferential study, researchers must establish a link between
the data that is observable on the platform and the theoretical construct of in-
terest. Constructs are abstract “elements of information” (Groves et al. 2011)
that survey scientists attempt to measure by recording responses through the
survey instrument and, finally, by analyzing responses. The first step of
transforming a construct into a measurement involves defining the construct;
this requires thinking about competing and related constructs, ideally rooted
in theory. A vague definition of the construct or a mismatch between the
construct and what can realistically be measured can undermine validity.

2. For a general discussion of ethical considerations when working with web and social media
data, see Zimmer and Kinder-Kurlanda (2017) and Olteanu et al. (2019). For practical guidance
about designing research that is ethically informed, see Frankze et al. (2020).

TED-On 407

Next, researchers have to think about how to operationalize the construct.
This entails deliberating about whether a potential measurement sufficiently
captures the construct and if it does not also—or instead—capture other con-
structs. Because the data largely depend on what the platform captures, what
is available to the public, and/or what can be accessed (e.g., via Application
Programming Interfaces, APIs), operationalizing the construct may require

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
substantially rethinking it as the available data are explored. An alternative is
what Salganik (2017) calls the “ready-made” approach, in which researchers
develop constructs based on specific data known to be available from a
platform.

Examples of validity: Consider measuring the construct of presidential ap-
proval with tweets. In a questionnaire one can directly measure this construct
by asking, for example, “Do you approve or disapprove of the way Donald
Trump is handling his job as president?”3 Although tweets about the presi-
dent containing positive words could be considered approval (e.g., O’Connor
et al. 2010; Pasek et al. 2020), they could also be sarcastic or directed at a
target besides the president. Finally, it can be difficult to ascertain if a tweet
is commenting on how the president is handling their job (a valid measure)
or private life (not necessarily valid).
Consider, as our second example, measuring the construct of “Influenza-
Like Illness” (ILI) from Wikipedia usage (McIver and Brownstein 2014) or
Google search queries (Preis and Moat 2014). The construct can be straight-
forwardly defined, and the measurement for cases of ILI traditionally con-
sists of reports from local medical institutions, recorded in the United States
by a central agency like the Centers for Disease Control and Prevention. To
develop the appropriate measurement, we have to ask what the act of access-
ing ILI-related Wikipedia pages or searching for ILI-related information on
Google implies. Do we assume that these individuals suffer from influenza
or related symptoms, have affected peers, or are interested in learning about
the disease, potentially inspired by media coverage? Even if the majority of
Google or Wikipedia users believe themselves to be infected, feeling sick
does not necessarily imply that a user has contracted an ILI.

PLATFORM SELECTION

In selecting a platform, the researcher needs to ensure there is a link between
digital traces on the platform and the theoretical construct of interest. They
also, however, need to account for the impact of the platform and its

3. This survey question has remained largely unchanged since its inception, aside from updating
the question to ask about the current president: https://news.gallup.com/poll/160715/gallup-daily-
tracking-questions-methodology.aspx.

408                                                                               Sen et al.

community on what traces are observable and the likely divergence between
the target and platform populations. Below, we discuss the errors that may
occur due to the chosen platform(s).

Platform affordances error: Just as in surveys where the design, content,
and modes may introduce measurement error,4 so the behavior of users and

                                                                                                  Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
their traces can be impacted by (i) platform-specific sociocultural norms as
well as (ii) the platform’s design and technical constraints (Wu and Taneja
2020), leading to a type of measurement error we call platform affordances
error.
   For example, Facebook recommends “people you may know,” thereby
impacting the friendship links that Facebook users create (Malik and Pfeffer
2016), while Twitter’s 280- character limit on tweets influences users’ writ-
ing style (Gligoric, Anderson, and West 2018). “Trending topic” features can
shift users’ attention, and feedback buttons may deter utterances of polarizing
nature. Also, perceived or explicit norms such as community-created guide-
lines or terms of service can influence what and how users post, for example,
politically conservative users being less open about their opinion on a plat-
form they regard as unwelcoming of conservative statements or contributors
self-censoring to avoid being banned.5 Similarly, perceived privacy risks can
influence user behavior. A major challenge for digital trace-based studies is,
therefore, disentangling what Ruths and Pfeffer (2014) call “platform-driven
behavior” from behavior that would occur independently of the platform de-
sign and norms.
   Evolving norms or technical settings may also affect the validity of longi-
tudinal studies (Bruns and Weller 2016) since these changes may cause
“system drifts” as well as “behavioral drifts” (Salganik 2017), contributing to
reduced reliability of measurement over time (Lazer 2015). Ideally, research-
ers will thoroughly investigate how particular platform affordances may af-
fect the measurement they are planning.

Examples of platform affordances error: Because of character limits, users
may have to be terse when tweeting, for example about the president, or may
have to post multiple, threaded tweets to express their opinion on this topic.
Similarly, users may be more likely to post about a topic (hashtag) that

4. For example, questions can be asked about topics for which some responses are more socially
desirable than others and data collection modes can promote or inhibit accurate responding of
sensitive behaviors.
5. Emerging ideologically extreme platforms like Gab and Parler have the potential to promote
“migration” of entire user groups away from mainstream platforms, polarizing the platform land-
scape (Ribeiro et al. 2020).

TED-On 409

Twitter’s “trends” feature indicates is popular, while refraining from express-
ing opinions that they expect to be unpopular on the platform.
In the case of ILI, the way individuals arrive at and interact with the
articles in Wikipedia is shaped by the site’s interface design, the inter-article
link structure, and the way search results connect with this structure. Users
frequently arrive at Wikipedia articles from a Google search (McMahon,

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
Johnson, and Hecht 2017; Dimitrov et al. 2019), implying that Google’s
ranking of Wikipedia’s ILI-related articles, given particular queries, largely
determines what digital traces we observe.

Platform coverage error: In general, the userbase of a platform is not
aligned with any target population (unless the platform itself is being stud-
ied). To the extent this mismatch leads to incorrect estimates about the popu-
lation, there is platform coverage error, much like undercoverage can be
related to coverage error in the TSE framework (e.g., Eckman and Kreuter
2017). Further, different online platforms exhibit varying inclusion probabili-
ties, as they attract specific audiences because of topical or technological idi-
osyncrasies (Smith and Anderson2018). Twitter’s demographics, for
example, differ from those of the general population (Mislove et al. 2011),
as do Reddit’s (Duggan and Smith 2013). Population discrepancies might
also arise from differences in internet penetration or social media adoption
rates in different socio-demographic or geographical groups, independent of
the particular platform (Wang et al. 2019). The change of a platform’s user
composition over time may also reduce reliability of the study’s results
(Salganik 2017).6
Finally, to the extent that some users do not produce any traces relevant to
the research question, this makes them analogous to survey non-respondents;
in practice, users who do not post content are indistinguishable from mem-
bers of the population who are not users, potentially inflating platform cover-
age error.

Examples of platform coverage error: Assume the researcher has full access
to all relevant data on Twitter. Then, any collection of tweets is necessarily
restricted to users who have self-selected to register and express their opinion
on this specific social media platform. These respondents likely do not repre-
sent any particular population (e.g., US voters), which potentially introduces
coverage error that may produce misleading estimates. In addition, non-
humans (e.g., bots, businesses) act as users but do not represent individual
members of any human population.

6. One member of the target population may correspond to two users, depending on the plat-
form’s membership criteria.

410 Sen et al.

Similarly, the users of Wikipedia and search engines more likely represent
male and younger individuals than the “average citizen” (Rainie and Tancer
2007). Relying on information from Wikipedia (instead of, say, a local media
outlet) might lead to selection of users who differ from the population of in-
terest, likely increasing potential platform coverage error.

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
DATA COLLECTION

After choosing a platform, the next step in a digital trace study is collecting
data. This is commonly done through APIs, web scraping, or collaborations
with platform/data providers. Then, even if all recorded traces are available
from the platform, researchers often select a subset of traces, users, or both,7
to maximize the relevance of traces to the construct and users to the target
population. This is usually done for traces by querying explicit features such
as keywords or hashtags (O’Connor et al. 2010; Diaz et al. 2016; Stier et al.
2018) and, for users, collecting their location, demographic characteristics, or
inclusion in lists (Chandrasekharan et al. 2017). Each strategy comes with
different pitfalls (see Supplementary Material, Section 2).
While for some platforms, like Github or Stack Exchange, the entire his-
tory of traces is available and the researcher can select from this set freely,
others like Facebook Ads or Google Trends only share aggregated traces,
and platforms such as Twitter allow researchers access to only a subset of
their data. Both platforms and researchers may also decide to limit what data
can be collected based in order to protect user privacy.

Trace selection error: Typically, researchers query the available traces to se-
lect those that are broadly relevant to the construct of interest. To the extent
these queries fail to capture all relevant posts or include irrelevant posts, they
create a type of measurement error we call trace selection error.8

Example of trace selection error: Assume that we aim to capture all queries
about the US president entered into a search engine by its users. If searches
mentioning the keyword “Trump” are collected, searches that refer to
“Melania Trump” or “playing a trump card” could be included, introducing
noise. Likewise, relevant queries might be excluded that simply refer to “the

7. If a researcher selects traces, they may have to collect supplementary information about the
user, leaving the trace, for example, information from the user’s profile and other posts authored
by the same user (Schober et al. 2016).
8. The trace selection error is loosely related to “measurement error” in the TSE—the difference
between the true value and the actual response to a survey question—in that traces might be in-
cluded that do not carry information about the construct, or that informative traces might be
excluded.

TED-On                                                                                    411

president,” “Donald,” or acronyms such as POTUS. Excluding eligible or in-
cluding ineligible traces in this way might harm measurement quality.

User selection error: While traces are selected according to their likely abil-
ity to measure the construct of interest, their exclusion usually also removes
those users that produced them, if no other traces of these users remain

                                                                                                Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
among the selected traces. One consequence is that users with specific attrib-
utes might be unintentionally excluded (e.g., teenagers may be excluded be-
cause they use different terms to refer to the president). The error due to the
gap between the type of users selected and the type of users comprising the
platform’s userbase is user selection error, related to coverage error in the
TSE framework.
   This error can also occur if researchers intentionally exclude users based
on their features (not their traces); for instance, by removing user-profiles
deemed irrelevant for inferences to the target population based on their indi-
cated location or age. This user-selection error is especially complex because
groups of users differ in how they voluntary self-report such attributes; for
example, certain demographic groups may be less prone to reveal their loca-
tion or gender in their user profiles or posts (Pavalanathan and Eisenstein
2015), and additionally can be unreliable due to variation in their true value
over time (Salganik 2017).
   Aggravating trace and user selection errors, the subset of posts to which
some platforms restrict researchers may not represent all traces created by
the platform population. Such an arbitrary restriction can be closely linked to
sampling error in the TSE framework, since a nonprobability sample is
drawn from the platform. For example, a popular choice of obtaining Twitter
data is through the 1 percent “spritzer” API, yet the free 1 percent sample is
significantly different from the commercial 10 percent “gardenhose” API
(Morstatter et al. 2013). One reason users might be misrepresented in a selec-
tion is that highly active users may be more likely to be selected than others,
simply because they produce more traces (Schober et al. 2016).

Examples of user selection error: Certain groups of users (e.g., Spanish-
speaking residents of the United States) may be underrepresented if they are
selected based on keywords used mainly by English-speaking Americans to
refer to political topics. Second, when measuring ILI from aggregate views
or searches, there are no explicit user accounts of Wikipedia viewers or
Google searchers.9 Different subgroups of the target population (e.g., medi-
cally trained individuals) might turn to different articles for their information
needs about ILI that are not included in the selected traces (Wikipedia pages

9. It is not necessary to have an account on Wikipedia or Google (Gmail) to use either.

412 Sen et al.

or search engine queries). As these traces are excluded, so are these members
of the target population.

DATA PREPROCESSING

Data preprocessing refers to the process of reducing noise by removing ineli-

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
gible traces and users from the original dataset as well as augmenting the
data with auxiliary information (e.g., inferring demographic features, or link-
ing with external sources).

Trace preprocessing: Trace preprocessing can add additional information to
traces or discard mistakenly collected traces. Although the goal of trace pre-
processing is to improve measurement, it can inadvertently introduce a type
of measurement error.

Trace augmentation error: Traces can be augmented in several ways: for
instance, sentiment detection, named entity recognition, or stance detec-
tion in the textual content of a post; the annotation of “like” actions with
receiver and sender information; or the annotation of image material. Due
to the large number of data elements in most digital trace datasets, aug-
mentation is mainly automated. Trace augmentation error may be intro-
duced in this step due to false positives and negatives created by the
automated method. For example, supervised machine learning (ML)
methods might be based on a training set that includes erroneous labels,
or they might pick up on spurious signals in the training data, reducing
generalizability. The annotation of natural language texts has been partic-
ularly popular in digital trace studies; and, despite a large body of re-
search, many challenges remain (Puschmann and Powell 2018), including
a lack of algorithmic interpretability and lack of transferability to domains
for which the methods were not originally developed (Sen, Flöck, and
Wagner 2020). Study designers need to carefully assess the context for
which an annotation method was built before applying it.

Example of trace augmentation error: Researchers often augment political
tweets with sentiment measures in studies of public opinion (e.g., O’Connor
et al. 2010; Barberá 2016). However, the users of different social media plat-
forms might use different terminology (e.g., “Web-born” slang and abbrevia-
tions) than those included in popular sentiment lexicons and may even use
different words on different platforms or in different contexts, leading to mis-
classification on a certain platform or for a subcommunity on that platform
(Hamilton et al. 2016).

TED-On 413

Trace reduction error: Finally, certain traces may be ineligible for a vari-
ety of reasons, such as spam, hashtag hijacking, or because they are irrel-
evant to the task at hand. The error incurred due to the removal of
potentially eligible traces—and the non-removal of ineligible traces—is
trace reduction error.

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
Example of trace reduction error: Researchers might decide to use a clas-
sifier that removes spam, but has a high false positive rate, inadvertently re-
moving non-spam tweets. They might likewise discard tweets without
textual content, thereby ignoring relevant statements made through hyper-
links or embedded pictures/videos.

USER PREPROCESSING

In this step, the digital representations of humans are also augmented with
auxiliary information or they are removed if they do not represent members
of the target population.

User augmentation error: Later in the research process, researchers may
want to reweight digital traces by socio-demographic attributes and/or ac-
tivity levels of individuals, as is done in surveys to mitigate representa-
tion errors (see the “Analysis and Inference” section). However, since
such attributes are rarely available from the platform, it is common to in-
fer demographic attributes for individual users generally using ML meth-
ods (Sap et al. 2014; Zhang et al. 2016; Wang et al. 2019). Naturally,
such attribute inference can be error-prone (McCormick et al. 2017) and
especially problematic if the accuracy of the inference differs systemati-
cally between demographic strata (Buolamwini and Gebru 2018).
Platforms may also offer aggregate information they have inferred about
their users; this information is prone to the same kind of errors, unbe-
knownst to the researcher (Zagheni, Weber, and Gummadi 2017). The
overall error incurred due to user augmentation methods is user augmen-
tation error.

Example of user augmentation error: Twitter users’ gender, ethnicity, or
location may be inferred to understand how presidential approval differs
across demographic groups. Yet, automated gender inference methods based
on images have higher error rates for African Americans than Whites
(Buolamwini and Gebru 2018); therefore, gender inferred through such
means may inaccurately estimate approval rates among African American
males versus females compared to their White counterparts, which raises se-
rious ethical concerns.

414 Sen et al.

User reduction error: Preprocessing steps usually include removing
spammers and non-human users (mostly bots or organizations). Users are
also filtered based on criteria such as tenure on the platform, location, or
presence of a profile picture. These steps are comparable to removing
“ineligible units” in surveys, as these users do not represent members of the
target population. The performance of particular methods for detecting and

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
removing users may well depend on the characteristics of the data (cf.
Echeverrı́a et al. 2018; Wu et al. 2018). Therefore, it is important that, to the
extent possible, researchers analyze the accuracy of the user reduction
method to estimate the user reduction error.

Examples of user reduction error: While estimating presidential approval
from tweets, researchers are usually uninterested in posts by bots or organi-
zations. One can detect such accounts (Alzahrani et al. 2018; Wang et al.
2019) but may not detect all of these ineligible users, or may erroneously re-
move eligible users that show bot- or organization-like behavior.

ANALYSIS AND INFERENCE

After preprocessing the dataset, the researcher can estimate the prevalence of
the construct in the selected traces and, by extension, in the target
population.

Trace measurement error: In this step, the construct is concretely mea-
sured, for example, with sentiment scores aggregated over traces, users,
days, and so on. It is therefore distinct from trace augmentation, where traces
are annotated with auxiliary information.10 The measurement can be taken
using many different techniques, from simple counting and heuristics to
complex ML models. While simpler methods may be less powerful, they are
often more interpretable and less computationally intensive. On top of errors
incurred in previous stages, the model used might only detect certain aspects
of the construct or be affected by spurious artifacts of the data being ana-
lyzed. For example, an ML model spotting sexist attitudes may work well
for detecting obviously hostile attacks based on gender identity but fail to
capture “benevolent sexism,” though they are both dimensions of the con-
struct of sexism (Jha and Mamidi 2017). Thus, even if the technique is devel-
oped for the specific data collected, it can still suffer from low validity. Any
error that distorts how the construct is estimated due to modeling choices is
denoted as trace measurement error. Even if validity is high for the given
study, a replication of the measurement model on a different dataset—a

10. In practice, these steps can be highly intertwined, but it is still useful to distinguish them
conceptually.

TED-On 415

future selection of data from the same platform or a completely different
dataset—can fail, that is, have low reliability (Conrad et al. 2019; Sen,
Flöck, and Wagner 2020).
Amaya, Biemer, and Kinyon (2020) as well as West and colleagues
(West, Sakshaug, and Aurelien 2016; West, Sakshaug, and Kim 2017) intro-
duce the concept of analytic error, which West, Sakshaug, and Kim (2017,

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
p. 489) characterize as an “important but understudied aspect of the total sur-
vey error (TSE) paradigm.” Following Amaya, Biemer, and Kinyon (2020,
p. 101), analytic error is broadly related to all errors made by researchers
when analyzing the data and interpreting the results, especially when apply-
ing automated analytic methods. A consequence of analytic error can be false
conclusions and incorrectly specified models (Baker 2017). Thus, error in-
curred due to the choice of modeling techniques is trace measurement error.
Error due to the choice of survey weights is adjustment error and is described
below.

Examples of trace measurement error: Assume the construct is defined as
presidential approval and traces are augmented with sentiment scores. The
researcher obtains traces such as tweets and counts the positive and negative
words each contains and may then define a final link function that combines
all traces into a single aggregate to measure “approval.” They may aggregate
the tweets by counting the normalized positive words per day (Barberá
2016), the ratio of positive and negative words per tweet, or divide the differ-
ence between positive and negative words by total words per day (Pasek
et al. 2020). The former calculation of counting positive words per day may
underestimate negative sentiments of a particular day, while in the latter, ag-
gregating negative and positive words in a day’s tweets might overestimate the
effect of tweets with multiple positive (or negative) words. See Conrad et al.
(2019) for an illustration of how different ways of calculating daily sentiment
can produce dramatically different results.
In some cases, aggregated traces cannot be matched to the users who pro-
duced them, making it difficult to identify if multiple traces refer to the same
user (e.g., multiple views of ILI-related Wikipedia pages viewed by the same
or different persons). As a result, power users’ Wikipedia page visits may be
counted much more often than the average readers’ visits.

Adjustment error: In order to reduce platform coverage error (possibly aggra-
vated by user selection or preprocessing), researchers might adapt techniques
designed to reduce bias when estimating population parameters from opt-in (non-
probability) web surveys, particularly adjustment using model-based poststratifica-
tion (e.g., Goel, Obeng, and Rothschild 2017). Because selections of digital traces
are inherently non-probabilistic, reweighting has been explored in studies using

416 Sen et al.

digital traces (Zagheni and Weber 2015; Barberá 2016; Pasek et al. 2018; Pasek
et al. 2020; Wang et al. 2019) using calibration or post-stratification. However,
reweighting techniques for digital traces have not yet been studied to the extent
they have for non-probability survey samples (e.g., Kohler 2019; Kohler, Kreuter,
and Stuart 2019; Cornesse and Blom 2020; Cornesse et al. 2020). Broadly, there
are two approaches to reweighting in digital trace-based studies. The first approach

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
reweights the data according to a known population, for example census propor-
tions (Zagheni and Weber 2015; Yildiz et al. 2017; Wang et al. 2019). The second
approach reweights general population surveys according to the demographic dis-
tribution of the online platform itself found through usage surveys (Pasek et al.
2018, 2020). The lack of available user demographics, reweighting on the basis of
attributes that are not actually associated with the platform coverage error, or
choosing an adjustment method that is not well suited to the data can all cause ad-
justment error, in line with Groves et al. (2011).11 Thus, researchers using digital
traces may consider developing and/or applying adjustment techniques that sup-
port generalization beyond the platform from which the traces are collected.

Examples of adjustment error: When comparing presidential approval on
Twitter with survey data, Pasek et al. (2020) reweight the survey estimates
with Twitter usage demographics but fail to find alignment between both
measures.12 In this case, the researchers assume that the demographics of
Twitter users are the same as for a subset of users tweeting about the presi-
dent, an assumption that might lead to adjustment error. Previous research in-
deed indicates that users talking about politics on Twitter tend to have
different characteristics than random users (Cohen and Ruths 2013) and tend
to be younger and more likely to be white men (Bekafigo and McBride
2013).

Related Work
Evidence that errors can arise in using online digital traces to study social
phenomena has been mounting (Boyd and Crawford 2012; Gayo-Avello
2012; Morstatter et al. 2013; Pavalanathan and Eisenstein 2015; Malik and
Pfeffer 2016; Olteanu et al. 2019). Several error discovery strategies for digi-
tal-trace-based studies have been developed in data and computer science
(Tufekci 2014; Olteanu et al. 2019). Concrete efforts to identify and docu-
ment errors include “Datasheets for Datasets” (Gebru et al. 2018) and
“Model cards” (Mitchell et al. 2019) for discussing the specifications of

11. Corrections solely based on socio-demographic attributes are unlikely to be a panacea for re-
ducing coverage or non-response issues (Schnell, Noack, and Torregroza 2017).
12. Cf. Smith and Anderson (2018) for Twitter’s demographic composition.

TED-On 417

particular datasets and ML models, respectively. These advances are compat-
ible with the TED-On.
Related to our approach, Schober et al. (2016) reflect on how survey
and social media data differ in several ways, including the possibility that
a topic may happen to be covered in a social media corpus much as in a
representative survey sample, even if the corpus does not have good pop-

Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
ulation coverage. Hsieh and Murphy (2017) explicitly build on the TSE
framework in their Total Twitter Error framework, which describes three
errors that can be mapped to survey errors specifically for Twitter. Japec
et al. (2015) caution that potential error frameworks will have to account
for errors specific to Big Data and outline a way to extend the TSE frame-
work to such data.
Closest to our work is that of Amaya, Biemer, and Kinyon (2020), who
apply the Total Error Framework (TEF) to surveys and Big Data. The TEF
offers a broader framework than presented here, while the TED-On specifi-
cally addresses novel errors encountered when working with online digital
trace data.

Conclusion
The use of digital traces of human behavior, attitudes, and opinion, espe-
cially web and social media data, is gaining traction in the social sciences. In
a growing number of scenarios, online digital traces are considered valuable
complements to surveys. To make research on online traces more transparent
and easier to evaluate, it is important to describe potential error sources sys-
tematically. We have therefore introduced a framework for distinguishing the
various kinds of errors that may be part of making inferences from these ob-
servational data, related to specific platforms and their affordances, and due
to the researchers’ design choices. Drawing on the TSE framework, our pro-
posed TED-On framework is intended to (i) develop a shared vocabulary to
facilitate dialogue among scientists from heterogeneous disciplines, and (ii)
to aid in pinpointing, documenting, and ultimately avoiding errors in re-
search based on online digital traces. It is our hope that the TED-On might
also lay the foundation for better designs by calling attention to the distinct
sources of error and where they can occur in the process of using traces for
social and behavioral research.

Supplementary Material
SUPPLEMENTARY MATERIAL may be found in the online version of
this article: https://doi.org/10.1093/poq/nfab018.

418                                                                              Sen et al.

References
Alzahrani, Sultan, Chinmay Gore, Amin Salehi, and Hasan Davulcu. 2018. “Finding
  Organizational Accounts Based on Structural and Behavioral Factors on Twitter.” In
  International Conference on Social Computing, Behavioral-Cultural Modeling and
  Prediction and Behavior Representation in Modeling and Simulation, edited by Halil Bisgin,
  Robert Thomson, Ayaz Hyder and Christopher Dancy, 164–75. Cham: Springer.

                                                                                                Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
Amaya, Ashley, Paul P. Biemer, and David Kinyon. 2020. “Total Error in a Big Data World:
  Adapting the TSE Framework to Big Data.” Journal of Survey Statistics and Methodology
  8(1):89–119.
Baker, Reg. 2017. “Big Data: A Survey Research Perspective.” In Total Survey Error in
  Practice, edited by P. P. Biemer et al., pp. 47–69. Hoboken, NJ: John Wiley and Sons, Inc.
Barberá, Pablo. 2016. “Less Is More? How Demographic Sample Weights Can Improve Public
  Opinion Estimates Based on Twitter Data.” Work Pap NYU.
Bekafigo, Marija Anna, and Allan McBride. 2013. “Who Tweets about Politics? Political
  Participation of Twitter Users during the 2011 Gubernatorial Elections.” Social Science
  Computer Review 31(5):625–43.
Biemer, Paul P. 2010. “Total Survey Error: Design, Implementation, and Evaluation.” Public
  Opinion Quarterly 74(5):817–48.
Boyd, Danah, and Kate Crawford. 2012. “Critical Questions for Big Data: Provocations for a
  Cultural, Technological, and Scholarly Phenomenon.” Information, Communication and
  Society 15(5):662–79.
Bruns, Axel, and Katrin Weller. 2016. “Twitter as a First Draft of the Present: And the
  Challenges of Preserving It for the Future.” In Proceedings of the 8th ACM Conference on
  Web Science, 183–89, Hannover, Germany.
Buolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades: Intersectional Accuracy
  Disparities in Commercial Gender Classification.” In Conference on Fairness, Accountability
  and Transparency, 77–91, New York, NY.
Chandrasekharan, Eshwar, Umashanthi Pavalanathan, Anirudh Srinivasan, Adam Glynn, Jacob
  Eisenstein, and Eric Gilbert. 2017. “You Can’t Stay Here: The Efficacy of Reddit’s 2015
  Ban Examined through Hate Speech.” Proceedings of the ACM on Human-Computer
  Interaction 1(CSCW):1–22.
Cohen, Raviv, and Derek Ruths 2013. “Classifying Political Orientation on Twitter: It’s Not
  Easy!” Proceedings of the International AAAI Conference on Web and Social Media 7(1).
  Retrieved from https://ojs.aaai.org/index.php/ICWSM/article/view/14434
Conrad, Frederick G., Johann A. Gagnon-Bartsch, Robyn A. Ferg, Michael F. Schober, Josh
  Pasek, and Elizabeth Hou. 2019. “Social Media as an Alternative to Surveys of Opinions
  about the Economy.” Social Science Computer Review 0894439319875692.
Cornesse, Carina, and Annelies G. Blom. 2020. “Response Quality in Nonprobability and
  Probability-Based Online Panels.” Sociological Methods and Research 0049124120914940.
  doi: 10.1177/0049124120914940.
Cornesse, Carina, Annelies G. Blom, David Dutwin, Jon A. Krosnick, Edith D. De Leeuw,
  Stéphane Legleye, Josh Pasek, Darren Pennay, Benjamin Phillips, Joseph W. Sakshaug,
  Bella Struminskaya, and Alexander Wenz. 2020. “A Review of Conceptual Approaches and
  Empirical Evidence on Probability and Nonprobability Sample Survey Research.” Journal of
  Survey Statistics and Methodology 8(1):4–36.
Diaz, Fernando, Michael Gamon, Jake M. Hofman, Emre Kıcıman, and David Rothschild.
  2016. “Online and Social Media Data as an Imperfect Continuous Panel Survey.” PloS One
  11(1):e0145406.
Duggan, Maeve, and Aaron, Smith. 2013. “6% of Online Adults Are Reddit Users.” Pew
  Internet and American Life Project 3:1–10.

TED-On                                                                                  419

Echeverrı́a, Juan, Emiliano De Cristofaro, Nicolas Kourtellis, Ilias Leontiadis, Gianluca
  Stringhini, and Shi Zhou. 2018. “Lobo: Evaluation of Generalization Deficiencies in Twitter
  Bot Classifiers.” In Proceedings of the 34th Annual Computer Security Applications
  Conference, 137–46, San Juan, PR, USA.
Eckman, Stephanie, and Frauke Kreuter 2017. “The Undercoverage-Nonresponse Trade-Off.”
  Total Survey Error in Practice, edited by Biemer PaulP., Edith D. de Leeuw, Stephanie
  Eckman, Brad Edwards, Frauke Kreuter, Lars E.Lyberg, N. Clyde Tucker, and Brady T.

                                                                                                Downloaded from https://academic.oup.com/poq/article/85/S1/399/6359490 by guest on 08 October 2021
  West, 95–113. Hoboken, NJ: Wiley.
Fiesler, Casey, and Nicholas Proferes. 2018. “‘Participant’ Perceptions of Twitter Research
  Ethics.” Social Media þ Society 4(1):2056305118763366.
Franzke, Aline Shakti, Anja Bechmann, Michael Zimmer, and C. Ess. 2020. “Internet research:
  Ethical guidelines 3.0.” Association of Internet Researchers. 4(1):2056305118763366.
Gayo-Avello, Daniel. 2012. “‘I Wanted to Predict Elections with Twitter and All I Got Was
  This Lousy Paper’—A Balanced Survey on Election Prediction Using Twitter Data.” arXiv
  preprint arXiv:1204.6441.
Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna
  Wallach, Hal Daumé III, and Kate Crawford. 2018. “Datasheets for Datasets.” arXiv pre-
  print arXiv:1803.09010
Gligoric, Kristina, Ashton Anderson, and Robert West. 2018. “How Constraints Affect
  Content: The Case of Twitter’s Switch from 140 to 280 Characters.” In Proceedings of the
  International AAAI Conference on Web and Social Media 12(1). Retrieved from https://ojs.
  aaai.org/index.php/ICWSM/article/view/15079
Goel, Shirad, Adam Obeng, and David Rothschild. 2017. Online, Opt-In Surveys: Fast and
  Cheap, but Are They Accurate? Working Paper. Stanford, CA: Stanford University.
Groves, Robert M. 2011. “Three Eras of Survey Research.” Public Opinion Quarterly 75(5):
  861–871.
Groves, Robert M., and Lars Lyberg. 2010. “Total Survey Error: Past, Present, and Future.”
  Public Opinion Quarterly 74(5):849–79.
Groves, Robert M., Floyd J. Fowler Jr., Mick P. Couper, James M. Lepkowski, Eleanor
  Singer, and Roger Tourangeau. 2011. Survey Methodology, vol. 561. John Wiley and Sons.
Hamilton, William L., Kevin Clark, Jure Leskovec, and Dan Jurafsky. 2016. “Inducing
  Domain-Specific Sentiment Lexicons from Unlabeled Corpora.” In Proceedings of the
  Conference on Empirical Methods in Natural Language Processing. Conference on
  Empirical Methods in Natural Language Processing, 595. vol. 2016, Austin, Texas
Howison, James, Andrea Wiggins, and Kevin Crowston. 2011. “Validity Issues in the Use of
  Social Network Analysis with Digital Trace Data.” Journal of the Association for
  Information Systems 12(12):2.
Hsieh, Yuli Patrick, and Joe Murphy. 2017. “Total Twitter Error.” Total Survey Error in
  Practice 74:23–46. Hoboken, NJ: John Wiley.
Jacobs, Abigail Z., and Hanna Wallach. 2021 “Measurement and fairness.” In Proceedings of
  the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 375–385.
Japec, Lilli, Frauke Kreuter, Marcus Berg, Paul Biemer, Paul Decker, Cliff Lampe, Julia Lane,
  Cathy O’Neil, and Abe Usher. 2015. “Big Data in Survey Research: AAPOR Task Force
  Report.” Public Opinion Quarterly 79(4):839–80.
Jha, Akshita, and Radhika Mamidi. 2017. “When Does a Compliment Become Sexist?
  Analysis and Classification of Ambivalent Sexism Using Twitter Data.” In Proceedings of
  the Second Workshop on NLP and Computational Social Science, 7–16, Vancouver,
  Canada.
Johnson, Steven L., Hani Safadi, and Samer Faraj. 2015. “The Emergence of Online
  Community Leadership.” Information Systems Research 26(1):165–87.

You can also read