Methods to Establish Race or Ethnicity of Twitter Users: Scoping Review - XSL FO
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
JOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
Review
Methods to Establish Race or Ethnicity of Twitter Users: Scoping
Review
Su Golder1, PhD; Robin Stevens2, PhD; Karen O'Connor3, MSc; Richard James4, MSLIS; Graciela
Gonzalez-Hernandez3, PhD
1
Department of Health Sciences, University of York, York, United Kingdom
2
School of Communication and Journalism, University of Southern California, Los Angeles, CA, United States
3
Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United
States
4
School of Nursing Liaison and Clinical Outreach Coordinator, University of Pennsylvania, Philadelphia, PA, United States
Corresponding Author:
Su Golder, PhD
Department of Health Sciences
University of York
Heslington
York, YO10 5DD
United Kingdom
Phone: 44 01904321904
Email: su.golder@york.ac.uk
Abstract
Background: A growing amount of health research uses social media data. Those critical of social media research often cite
that it may be unrepresentative of the population; however, the suitability of social media data in digital epidemiology is more
nuanced. Identifying the demographics of social media users can help establish representativeness.
Objective: This study aims to identify the different approaches or combination of approaches to extract race or ethnicity from
social media and report on the challenges of using these methods.
Methods: We present a scoping review to identify methods used to extract the race or ethnicity of Twitter users from Twitter
data sets. We searched 17 electronic databases from the date of inception to May 15, 2021, and carried out reference checking
and hand searching to identify relevant studies. Sifting of each record was performed independently by at least two researchers,
with any disagreement discussed. Studies were required to extract the race or ethnicity of Twitter users using either manual or
computational methods or a combination of both.
Results: Of the 1249 records sifted, we identified 67 (5.36%) that met our inclusion criteria. Most studies (51/67, 76%) have
focused on US-based users and English language tweets (52/67, 78%). A range of data was used, including Twitter profile
metadata, such as names, pictures, information from bios (including self-declarations), or location or content of the tweets. A
range of methodologies was used, including manual inference, linkage to census data, commercial software, language or dialect
recognition, or machine learning or natural language processing. However, not all studies have evaluated these methods. Those
that evaluated these methods found accuracy to vary from 45% to 93% with significantly lower accuracy in identifying categories
of people of color. The inference of race or ethnicity raises important ethical questions, which can be exacerbated by the data and
methods used. The comparative accuracies of the different methods are also largely unknown.
Conclusions: There is no standard accepted approach or current guidelines for extracting or inferring the race or ethnicity of
Twitter users. Social media researchers must carefully interpret race or ethnicity and not overpromise what can be achieved, as
even manual screening is a subjective, imperfect method. Future research should establish the accuracy of methods to inform
evidence-based best practice guidelines for social media researchers and be guided by concerns of equity and social justice.
(J Med Internet Res 2022;24(4):e35788) doi: 10.2196/35788
KEYWORDS
twitter; social media; race; ethnicity
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 1
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
to identify with as many as 5-6 large racial groupings (Black,
Introduction White, Asian, Pacific Islander, Native, and other), while
Research Using Twitter Data separately choosing one ethnicity (Hispanic) [25]. However,
race and ethnicity variables continue to be misused in the study
Twitter data are increasingly being used as a surveillance and design or when drawing conclusions. For example, race or
data collection tool in health research. When millions of users ethnicity is often incorrectly treated as a predictor of poor health
post on Twitter, it translates into a vast amount of publicly rather than as a proxy for the impact of being a particular race
accessible, timely data about a variety of attitudes, behaviors, or ethnicity has on that person’s experience with the health
and preferences in a given population. Although these data were system [26]. Simply put, health disparities are driven by racism,
not originally intended as a repository of individual information, not race [27-29]. Although race or ethnicity affiliation is an
Twitter data have been retrofitted in infodemiology to important factor in understanding diverse populations, digital
investigate population-level health trends [1-15]. Researchers research must tread lightly and thoughtfully both the collection
often use Twitter data in consort with other sources to test the and assignment of race or ethnicity.
relationship between web-based discourse and offline health
behavior, public opinion, and disease incidence. Objectives
The appeal of Twitter data is clear. Twitter is one of the largest The lack of basic sociodemographic data on Twitter users has
public-facing social media platforms, with an ethnically diverse led researchers to apply a variety of approaches to better
user base [16,17] of more than 68 million US Twitter users, understand the characteristics of the people behind each tweet.
with Black users accounting for 26% of that base [18]. This The breadth of the landscape of approaches to extracting race
diverse user base gives researchers access to people they may or ethnicity is currently unknown. Our overall aim was to
have difficulty reaching using more traditional approaches [19]. summarize and assess the range of computational and manual
However, promising insights that can be derived from Twitter methods used in research based on Twitter data to determine
data are often limited by what is missing, specifically the basic the race or ethnicity of Twitter users.
sociodemographic information of each Twitter user. The
demographic attributes of users are often required in health Methods
research for subpopulation analyses, to explore differences, and
to identify inequity. Without evidence of the distal and proximal
Overview
factors that lead to racial and ethnic health disparities, it is We conducted a comprehensive scoping review of extraction
impossible to address and correct these drivers. Insights from methods and offered recommendations and cautions related to
social media data can be used to inform service provision as these approaches [30]. We selected Twitter, as it is currently
well as to develop targeted health messaging by understanding the most commonly used social media platform in health care
public perspectives from diverse populations. research, and it has some unique intrinsic characteristics that
drive the methods used for mining it. Thus, we felt that the
Extracting Demographics From Twitter methods, type of data, and social media platforms used are
However, to use social media and digital health research to related in such a way that comparing methods for different social
address disparities, we need to know not only what is said on media would add too many variables and would not be truly
Twitter but also who is saying what [20]. Although others have comparing like with like. A detailed protocol was designed for
discussed extracting or estimating features, such as location, the methods to be used in our scoping review, but we were
age, gender, language, occupation, and class, no comprehensive unable to register scoping reviews on PROSPERO. We report
review of the methods used to extract race or ethnicity has been our methods according to the PRISMA (Preferred Reporting
conducted [20]. Extracting the race and ethnicity of Twitter Items for Systematic Reviews and Meta-Analyses) scoping
users is particularly important for identifying trends, review statement [30].
experiences, and attitudes of racially and ethnically diverse
populations [21]. As race is a social construction and not a
Inclusion Criteria
genetic categorization [22,23], the practice of defining race and Overview
ethnicity in health research has been an ongoing, evolving
We devised strict inclusion criteria for our review based on the
challenge. Traditional research has the advantage of identifying
Population, Intervention, Comparators, Outcomes, and Study
the person in the study and allowing them to systematically
design format. Although this was not a review of effectiveness,
identify their racial and ethnic identities. In digital health
we felt that the Population, Intervention, Comparators,
research [22,23], determining a user’s race or ethnicity by
Outcomes, and Study design question breakdown [31] was still
extracting data from a user’s Twitter profile, metadata, or tweets
the most appropriate one available for our question format [31].
is a process that is inevitably challenging, complex, and not
The inclusion criteria are described in the following sections.
without ethical questions.
Population
Furthermore, although Twitter is used for international research,
an international comparative study of methods to determine We included only data sets of Twitter users. Studies were
race or ethnicity is difficult, practically impossible, given that eligible for inclusion if they collected information to extract or
societies use different standardized categories that describe their infer race or ethnicity directly from the users’ tweets, their
own populations [24]. A common approach in the United States profile details (such as the users’ photo or avatar, their name,
is based on the US Census Bureau practice to allow participants location, and biography [bio]), or their followers. We excluded
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 2
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
studies that extracted race or ethnicity from social media (Twitter OR Tweet* OR Tweeting OR Retweet* OR Tweep*);
platforms other than Twitter, from unspecified social media facet 2 consisted of terms for race or ethnicity; and facet 3
platforms, or those that used multiple social media platforms consisted of terms for methods of prediction, such as ML, NLP,
that included Twitter, but the data relating to Twitter were not and artificial intelligence–related terms (Table S1 in Multimedia
presented separately. Appendix 1 [3,10,12,18,20,21,32-96]). All ethnology-related
subject terms were adapted for different database taxonomies
Intervention and syntax, with standard methods for predicting subject terms
Studies were included where the methods to extract or infer the in MEDLINE and other database indexing. The methods of
race or ethnicity data of Twitter users were stated. Articles that predicting term facets were expanded using a comprehensive
used machine learning (ML), natural language processing (NLP), list of specific text analysis tools and software names extracted
human-in-the-loop, or other computationally assisted methods from the study by Hinds and Joinson [97], which included a
to predict race or ethnicity of users were included, as were comprehensive list of automated ML processes used in
manual or noncomputational methods, including photo predicting demographic markers in social media. Additional
recognition or linking to census data. We excluded studies for terms have been added from a related study [98].
which we were unable to determine the methods used or for
which we extracted data solely on other demographic Sources Searched
characteristics, such as age, gender, or geographic location. A wide range of bibliographic and gray literature databases
were selected to search for topics on computer science, health,
Comparator and social sciences. The databases (Table 1) were last searched
The use of a comparison of the methods used was not required. on May 15, 2021, with no date or other filter applied.
A method could be compared with another (such as a gold
standard), or no comparison could be undertaken. Reference checking of all included studies and any related
systematic reviews identified by the searches were conducted.
Outcome We browsed the Journal of Medical Internet Research, as this
The extraction or inference of the race or ethnicity of Twitter is a key journal in this field, and hand searched 2 relevant
users was the primary or secondary outcome of the study. As conferences, the International Conference on Weblogs and
this was a scoping review in which we aimed to demonstrate Social Media and Association for Computational Linguistics
the full landscape of the literature, no particular measurement proceedings.
of the performance of the method used was required in our Citations were exported to a shared Endnote library, and
included studies. duplicates were removed. The deduplicated records were then
Study Design imported into Rayyan to facilitate independent blinded screening
by the authors. Using the inclusion criteria, at least two screeners
Any type of research study design was considered relevant.
(SG, RS, KO, or RJ) from the research team independently
Discussion papers, commentaries, and letters were excluded.
screened each record, with disputes on inclusion discussed and
Limits a consensus decision reached.
No restrictions on date, language, or publication type were Only the first 50 records from ACL and the first 100 records
applied to the inclusion criteria. However, no potentially relevant from a Google Scholar search were screened during two searches
studies were identified in any non-English language, and the (March 11, 2020, and May 24, 2021) as these records are
period by default was since 2006, the year of the inception of displayed in order of relevance, and it was felt that after this
Twitter. number no relevant studies were being identified
[12,21,32-95,99].
Search Strategy
A database search strategy was derived by combining three
facets: facet 1 consisted of free-text terms related to Twitter
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 3
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
Table 1. Databases searched with number of records retrieved.
Database Total results, n
ACL Anthology Screened first 50 records from 2 searches
ACM Digital Library 150
CINAHL 200
Conference Proceedings Citation Index—Science 84
Conference Proceedings Citation Index—Social Science 7
Emerging Sources Citation Index 41
Google Scholar Screened first 100 records from 2 searches
IEEE Xplore 186
Library and Information Science Abstracts 120
LISTA 79
OpenGrey 0
ProQuest dissertations and theses—United Kingdom and Ireland 195
PsycINFO 72
PubMed 84
Science Citation Index 56
Social Science Citation Index 111
Zetoc 50
using the stated performance, as the performance measures and
Data Extraction validation approaches varied considerably. In addition, there is
For each included study, we extracted the following data on an no recognized gold standard data set for comparison.
excel spreadsheet:
year of publication, study country and language, race or ethnicity Results
categories extracted (such as for race—Black, White, or Asian
Overview
or for ethnicity—Hispanic or European), and paper type (journal,
conference, or thesis). We also extracted details on extraction A total of 1735 records were entered into an Endnote library
methods (such as classification models or software used), (Clarivate), and duplicates were removed, leaving 1249 (72%)
features and predictors used in extraction (tweets, profiles, and records for sifting (Figure 1). A total of 1080 records were
pictures), number of Twitter users, number of tweets or images excluded based on the title and abstract screening alone. A total
used, performance measures to evaluate methods used of 169 references were deemed potentially relevant by one of
(validation), and results of any evaluation (such as accuracy). the independent sifters (RS, GG, RJ, SG, and KO). The full text
All performance measure metrics were reported as stated in the of these articles was screened independently, and 67 studies
included studies. All the extracted data were checked by 2 [12,21,32-95,99] met our inclusion criteria and 102 references
reviewers. were excluded [77,97,100-198]. The main reason for exclusion
was that although the abstract indicated that demographic data
Quality Assessment were collected, it did not include race or ethnicity (most
There was no formally approved quality assessment tool for commonly, other demographic attributes such as gender, age,
this type of study. As this was a scoping review, we did not or location were collected). Other reasons for exclusion were
carry out any formal assessment. However, we assessed any that the researchers collected demographic data through surveys
validation performed and whether the methods were or questionnaires administered via Twitter (but not from data
reproducible. posted on Twitter) or that the researchers used a social media
platform other than Twitter.
Data Analysis
We have summarized the stated performance of the papers that
included validation. However, we could not compare approaches
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 4
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
Figure 1. Flow diagram for included studies.
ethnicity or nationality classifiers as well as race [38,48,54,66,
Characteristics of the Included Studies 83,95]. Wang and Chi [77] was a conference paper which did
Most of the studies (51/67, 76%) stated or implied that they not report the race types extracted.
were based solely or predominantly in the United States and
were limited to English language bios or tweets. A total of 6 The data objects from Twitter used to extract race or ethnicity
studies were multinational [38,41,56,66,83,86]; 1 was UK based varied, with the use of profile pictures or Twitter users’ names
(also in English) [59], another was based in Qatar [55], and 12% being the most common. Others have also used tweets in the
(8/67) of studies extracted data from tweets in multiple users’ timeline, information from Twitter bios, or Twitter users’
languages [32,38,52,55,56,66,83,86] (Table S2 in Multimedia locations. Most studies (39/67, 58%) used more than one data
Appendix 1). object from Twitter data. In addition, the data sets within the
studies varied in size between 392 and 168,000,000, with those
The most common race examined was White (58/67, 87%), using manual methods having smaller data sets ranging from
followed by Black or African American (56/67, 84%), Asian just 392 [50] to 4900 [65].
(45/67, 67%), and the most common ethnicity examined was
Hispanic/Latino (43/67, 64%). Unfortunately, although performance has been measured in 67%
(45/67) of studies (this was inconsistently measured Table 2).
Some studies (12/67, 18%) treated race as a binary classification, The metrics used to report results were particularly varied for
such as African American or not or African American or White, studies using ML or NLP and included the F1 score (which
whereas others created a multiclass classifier of 3 (15/67, 22%) combines precision and recall), accuracy, area under the curve,
or 4 classes (33/67, 49%) or a combination of classes. A total or mean average precision. Table 2 lists the methods, features,
of 6 studies identified >4 classes; however, these often included and reported performance of the top model from each study.
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 5
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
Table 2. Top system performance within studies using machine learning or natural language processing (result metrics are reflected here as reported
in the original publications).
Study Classifier MLa model Features Results reported
Accuracy F1 score Area under curve
Pennacchiotti and Popescu, 2011 [68] Binary GBDTb Images, text, topics, N/Ac 0.66 N/A
and sentiment
Pennacchiotti and Popescu, 2011 [67] Binary GBDT Images, text, topics, N/A 0.70 N/A
sentiment, and net-
work
Bergsma et al, 2013 [38] Binary SVMd Names and name 0.85 N/A N/A
clusters
Ardehaly and Culotta, 2017 [35] Binary DLLPe Text and images N/A 0.95 (image); 0.92 N/A
(text)
Volkova and Backrach, 2018 [76] Binary LRf Text, sentiment, and N/A N/A 0.97
emotion
Wood-Doughtry et al, 2018 [79] Binary CNNg Name 0.73 0.72 N/A
Saravanan, 2017 [72] Ternary CNN Text NRh NR NR
Ardehaly and Culotta, 2017 [33] Ternary DLLP Text and images N/A 0.84 (image); 0.83 N/A
(text)
Gunarathne et al, 2019 [94] Ternary CNN Text N/A 0.88 N/A
Wood-Doughtry et al, 2018 [79] Ternary CNN Name 0.62 0.43 N/A
Culotta et al, 2016 [47] Quaternary Regression Network and text N/A 0.86 N/A
Chen et al, 2015 [46] Quaternary SVM n-grams, topics, self- 0.79 0.79 0.72
declarations, and
image
Markson, 2017 [61] Quaternary CNN Synonym expansion 0.76 N/A N/A
and topics
Wang et al, 2016 [189] Quaternary CNN Images 0.84 N/A N/A
Xu et al, 2016 [82] Quaternary SVM Synonym expansion 0.76 N/A N/A
and topics
Ardehaly and Culotta, 2015 [34] Quaternary Multinomial Census, name, net- 0.83 N/A N/A
logistic regres- work, and tweet lan-
sion guage
Ardehaly, 2014 [64] Quaternary LR Census and image 0.82 0.81 N/A
tweets
Barbera, 2016 [37] Quaternary LR with ENi Tweets, emojis, and 0.81 N/A N/A
network
Wood-Doughty 2020 [81] Quaternary CNN Name, profile meta- 0.83 0.46 N/A
data, and text
Preotiuc-Pietro and Ungar, 2018 [96] Quaternary LR with EN Text, topics, senti- N/A N/A 0.88 (African
ment, part-of-speech American), 0.78
tagging, name, per- (Latino), 0.83
ceived race labels, (Asian), and 0.83
and ensemble (White)
Mueller et al, 2021 [91] Quaternary CNN Text and accounts N/A 0.25 (Asian), 0.63 N/A
followed (African American
or Black), 0.28
(Hispanic), and
0.90 (White)
Bergsma et al, 2013 [38] Multinomial SVM Name and name 0.81 N/A N/A
(>4) clusters
Nguyen et al, 2018 [66] Multinomial Neural net- Images 0.53 N/A N/A
(>4) work
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 6
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
a
ML: machine learning.
b
GBDT: gradient-boosted decision tree.
c
N/A: not applicable.
d
SVM: support vector machine.
e
DLLP: deep learning from label proportions.
f
LR: logistic regression.
g
CNN: convolutional neural network.
h
NR: not reported.
i
EN: elastic net.
We identified 14 studies [39,48,52,54,60,63,70,71,74,77,83-85,
Manual Screening 95] that used census geographic data, census surname
A total of 12 studies used manual techniques to classify Twitter classification, or a combination of both. A total of 6 studies
users into race or ethnicity categories [21,36,40,49-51,57,65, incorporated geographic census data [39,52,63,74,83,84]. For
87-90]. These studies generally combined qualitative example, Blodgett et al [39] created a simple probabilistic model
interpretations of recent tweets, information in user bios making to infer a user’s ethnicity by matching geotagged tweets with
an affirmation of racial or ethnic identity, or photographs or census block information. They averaged the demographic
images in the user timeline or profile. values of all tweets by the user and assumed this to be a rough
In most cases, tweets were first identified by text matching proxy for the user’s demographics. Stewart [74] collected tweets
based on terms of interest in the research topic, such as having tagged with geolocation information (longitude and latitude).
a baby with a birth defect [50], commenting on a controversial The ZIP code of the user was derived from this geolocation
topic [57,89], or using potentially gang- or drug-related language information and matched with the demographic information
[40]. Researchers then identified the tweet authors and, in most found in the ZIP Code Tabulation Area defined by the Census
cases, assigned race or ethnicity through hand coding based on Bureau. This information was used to find a correlation between
profile and timeline content. Some studies coded primarily based ethnicity and African American vernacular English syntax [74].
on self-identifying statements of race used in a tweet or in users’ Other studies have used the census-derived name classification
bios, such as people stating that they are a Black American system to determine race or ethnicity based on user names. We
[49,50,88,90] or hashtags [36] (such as #BlackScientist). Others identified 12 studies that predicted user race or ethnicity using
coded exclusively based on the research team’s attribution of surnames [48,54,60,63,70,71,77,83-85,95,189]. Surnames were
racial identity through the examination of profile photographs used to assign race or ethnicity using either a US census-based
[21,57] or avatar [87]. Some authors coded primarily with name classification system or, less commonly, an author
self-declarations, with secondary indicators, such as profile in-house generated classification system. Of these 12 studies,
pictures, language, usernames, or other content [40,51,65,88,89]. 7 (58%) relied solely on the user’s last names
In most cases, it appears reasonable to infer that coding was [48,54,60,63,70,71,85]. Of those that reported validating the
performed by the study authors or members of their research system, validation methods of this name-based system alone
teams, with the exception of those using the crowdsourcing were not reported, but 4 (33%) of the 12 studies reported an
marketplace, Amazon Mechanical Turk [21,90]. accuracy between 71.8% and 81.25% [63,70,71,83]. Of note, a
The agreement among coders was sometimes measured, but study reported vastly different accuracies in predicting whiteness
validity and accuracy measurements were not generally versus blackness (94% predicting White users vs 33% predicting
included. A study [65], however, documented 78% reliability African American or Black users) [83]. The remaining 2 studies
for coding race compared with census demographics, with Black augmented name-based predictions with aggregate demographic
and White users being coded accurately 90% of the time and data from the American Community Survey or equivalent
Hispanic or Asian users being accurately coded between 45% surveys. For example, statistical and text mining methods have
and 60% of the time. The high accuracy of Black users was been used to extract surnames from Twitter profiles, combining
based on the higher likelihood of Black users to self-identify. this information with census block information based on
geolocated tweets to assess the probability of the user’s race or
Census-Driven Prediction ethnicity [60]. However, these studies did not report validation
Another approach to predict race or ethnicity is to use or accuracy.
demographic information from the national census and Ad Hoc ML or NLP
census-like data and transfer it to the social media cohort. The
US-based studies largely used census-based race and ethnicity A total of 24 papers [33-35,37,38,46,47,61,64,66-68,72,76,
categories: Asian and Pacific Islander, Black or African 78-82,91-94,99] used ML or NLP to automatically classify users
American, Latino or Hispanic, Native American, and White. A based on their race or ethnicity. ML and NLP methods were
UK-based study included the categories British and Irish, West used to process the data made available by Twitter users, such
European, East European, Greek or Turkish, Southeast Asian, as profile images, tweets, and location of residence. These
other Asian, African and Caribbean, Jewish, Chinese, and other studies almost invariably consisted of larger cohorts, with
minorities [83]. considerable variation in the specific methods used.
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 7
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
Supervised ML models (in which some annotated data were imbalance when reporting their results [33,61,82]. A group,
used to train the system) were used in 12 (50%) of the 24 which was classified based on images, supplemented their
studies. The models used include support vector machine training set from an additional data source for the minority
[38,46,61], gradient-boosted decision trees [67,68], and classes [33,35]. Only 2 studies have experimented with
regression models [33,34,37,76,96]. comparator models trained on balanced data sets. In a study by
Wood-Doughty et al [81], the majority class was undersampled
Semisupervised (where a large set of unannotated data is also
in their training sets and [96] the minority classes were
used for training the system, in addition to annotated data) or
oversampled. In both cases, the overall performance of the
fully unsupervised models using neural networks or regression
models decreased in accuracy from 0.83 to 0.41 (on their best
were used for classification in 10 (42%) of the 24 studies
performing unbalanced model) and 0.84 to 0.68. [96], as the
[33,35,66,72,78,79,81,92-94].
performance boost from the models, the superior performance
A total of 2 studies used an ensemble of previously published on the majority class was eradicated.
race or ethnicity classifiers by processing the data through 4
extant models and using a majority rule approach to classify
Off-the-shelf Software
users based on the output of each classifier [80,91]. A total of 17 studies [12,32,41-45,53,55,56,58,59,62,69,73,
75,86] used off-the-shelf software packages to derive race or
ML models use features or data inputs to predict desired outputs. ethnicity. Moreover, 10 studies [32,44,45,53,55,56,58,62,69,75]
Features derived from textual information in the user’s profile used Face++ [199], 5 studies [12,41-43,73] used Demographics
description, such as name or location, have been used in some Pro [200], and 2 studies used Onomap [201] software to
studies [34,35,38,60,67,68,79,81,92,93]. Other studies included determine ethnicity [59,86]. Face++ is a validated ML face
features related to images, including but not exclusively profile detection service that analyzes features with confidence levels
images [46,67,68,189], and facial features in those images [66]. for inferred race attributes. Specifically, it uses deep learning
Some studies have used linguistic features to classify a user’s to identify whether profile pictures contain a single face and
race or ethnicity [37,38,46,47,61,67,68,72,76,78,81,92-94,96]. then the race of the face (limited to Asian, Black, and White)
Specific linguistic features used in the models include n-grams and does not infer ethnicity (eg, Hispanic) [199]. Demographics
[38,46,72,91-94], topic modeling [46,61,78], sentiment and Pro estimates the demographic characteristics based on Twitter
emotion [76], and self-reports [67,68,81]. Information about a behavior or use using NLP, entity identification, image analyses,
user’s followers or network of friends was included as a feature and network theory [200]. Onomap is a software tool used for
in some studies under the assumption that members of these classifying names [201]. A total of 3 studies that used Face++
networks have similar traits [34,37,46,47,91]. used the same baseline data set [45,62,75], and one used a partial
Labeled data sets are used to train and test supervised and subset of the same data set [69].
semisupervised ML models and to validate the output of In total, 2 studies that used Face++ [32,58] did not measure its
unsupervised learning methods. Some of the studies used performance. Another study [44] stated that Face++ could
previously created data sets that contained demographic identify race with 99% confidence or higher for 9% of total
information, such as the MORPH longitudinal face database of users. In addition, 2 studies [53,55] used Face++ along with
images [189], a database of mugshots [38], or manually other methods. One of these studies used Face++ in conjunction
annotated data from previous studies [79,81]. Others created with demographics, using a given name or full name from a
ground truth data sets from surveys [96] or by semiautomatic database that contains US census data for demographics. This
means, such as matching Twitter users to voter registrations study simply measured the percentage of Twitter users for which
[37], using extracted self-identification from user profiles or race data could be extracted (46% college students and 92%
tweets [67,68,81], or using celebrities with known ethnicities role models) but did not measure the performance of Face++
[66]. Manual annotation of Twitter users was also used based [53]. Another study [55] built a classifier model on top of using
on profile metadata [34,35,46,76], self-declarations in the Face++ and recorded an accuracy of 83.8% when compared
timeline [61,82], or user images [35,94]. Table 2 summarizes with users who stated their nationality.
the best performing ML approach, features used, and the
reported results for each study that used automatic classification A total of 4 studies [45,62,69,75] (with the same data set in full
methods. In the table, the classifier is the number of race or or in part) used the average confidence level reported by Face++
ethnicity classification groups, ML model is the top performing for race which was 85.97 (SD 0.024%), 85.99 (SD 0.03%),
algorithm reported, and features are the variables used in the 86.12 (SD 0.032%), respectively, with a CI of 95%. When one
predictions. of these studies [45] carried out its own accuracy assessment,
they found an accuracy score of 79% for race when compared
Data from Twitter are inherently imbalanced in terms of race with 100 manually annotated pictures. Huang et al [56] also
and ethnicity. In ML, it is important to attempt to mitigate the carried out an accuracy assessment and found that Face++
effects of the imbalance, as the models have difficulty learning achieved an averaged accuracy score of 88.4% for race when
from a few examples and will tend to classify to the majority compared with 250 manually annotated pictures.
class and ignore the minority class. Few studies (12/67, 18%)
have directly addressed this imbalance. Some opted to make A total of 5 studies [12,41-43,73] used Demographics Pro, and
the task binary, focusing only on their group of interest versus although they reported on Demographics Pro success in general,
all others [67,68,94] or only on the majority classes [38,76]. they did not directly report any metrics of its success. The 2
Others choose modified performance metrics that account for
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 8
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
studies using Onomap provided no validation of the software In light of our results, we have compiled our recommendations
[59,86]. for best practice, which are summarized in Figure 2 and further
examined in the Discussion section.
Figure 2. Summary of our best practice recommendations.
London English), and the use of training data that are prone to
Discussion perpetuate biases (eg, police booking photos or mug shots) were
Principal Findings all of particular concern.
As there are no currently published guidelines or even best Issues Related to the Methods Used
practice guidance, it is no surprise that researchers have used a Approaches that include or rely solely on profile pictures to
variety of methods for estimating the race or ethnicity of Twitter determine race or ethnicity can introduce bias. First, not all
users. We identified four categories for the methods used: users have a photograph as their profile picture, nor is it easy
manual screening, census-based prediction, ad hoc ML or NLP, to determine whether the picture used is that of the user. A study
and off-the-shelf software. All these methods exhibit particular on the feasibility of using Face++ found that only 30.8% of
strengths, as well as inherent biases and limitations. Twitter users had a detectable single face in their profile. A
Comparing the validity of methods for the purpose of deriving manual review of automatically detected faces determined that
race or ethnicity is difficult as classification models differ not 80% could potentially be of the user (ie, not a celebrity) [206].
only in approach but also in the definition of the classification Human annotation may introduce additional bias, and studies
of race or ethnicity itself [112,202,203]. There is also a distinct have found systematic biases in the classification of people into
lack of evaluation or validation of the methods used. Those that racial or ethnic groups based on photographs [207,208].
measured the performance of the methods used found accuracy Furthermore, humans tend to perceive their own race more
to vary from 45% to 93%, with significantly lower accuracy in readily than others [209,210]. Thus, race or ethnicity in the
identifying categories of people of color. annotation team has an impact on the accuracy of their race or
ethnicity labels, potentially skewing the sample labels toward
This review sheds little light on the performance of commercial the race or ethnicity of the annotators [211,212]. Given ML and
software packages. Previous empirical comparisons of facial NLP methods are trained on these data sets, the human biases
recognition application programming interfaces have found that transfer to automated methods, leading to poorly supervised
Face++ achieves 93% accuracy [204] and works comparatively ML and training, which has been shown to result in
better for men with lighter skins [205]. The studies included in discrimination by the algorithm [213-215]. These concerns did
our review suggested a lower accuracy. However, data on not appear to be interrogated by the study designers. Without
accuracy were not forthcoming in any of the included studies exception, they present categorization of persons into race or
using Demographics Pro [200]. Even when performance is ethnicity, assuming that a subjective reading of facial features
assessed, the methodology used may be biased if there are issues or idiomatic speech is the gold standard both for coding of race
with the gold standard used to train the model. or ethnicity and for training and evaluation of automated
In addition to the 4 overarching methods used, the studies varied methods.
in terms of the features used to determine or define race or Other methods, such as using geography or names as indicators
ethnicity. Furthermore, the reliability of the features used to of race, may also be unreliable. One could argue that the
determine or define race or ethnicity for this purpose is demographic profile for a geographic region is a better
questionable. Specifically, the use of Twitter users’ profile representation of race or ethnicity in the demographic
pictures, names, and locations, the use of unvalidated linguistic environment than an individual’s race or ethnicity. Problems
features attributed to racial groups (such as slang words, African in using postcodes or locations to decipher individual social
American vernacular English, Spanglish, or Multicultural determinants are well documented [216]. The use of census data
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 9
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
from an area that is too large may skew the results. Among the were often limited to Black, White, Hispanic, or Asian. Note
studies reviewed, some used census block data, which are that Hispanic is considered ethnicity by the US census, but most
granular, whereas others extrapolated from larger areas, such studies in ML used it as a race category, more so than Asian
as city- or county-level data. For example, Saravanan [72] (because of low numbers in this category). Multiple racial
inferred the demographics of users in a city as a certain ethnic identities exist, particularly from an international perspective,
group based on a city with a large population of that group; which overlooks multiracial or primary and secondary identities.
however, no fine-grained analysis was performed either for the In addition, inferred identities may differ from self-identity,
city chosen or for geolocation of the Twitter user. Thus, the raising further issues.
validity of their assumption that a user in Los Angeles County
Given the sensitive nature of the data, it is important as a best
is of Mexican descent [72] is questionable. As these data were
practice for the results of studies that derive race or ethnicity
then used to create a race or ethnicity dictionary of terms used
from Twitter data to be reproducible for validation and future
by that group to train their model, the questionable assumption
use. The reproducibility of most of the studies in this review
further taints downstream applications and results. The models
would be difficult or impossible, as only 5 studies were linked
also do not consider the differences between the demographics
to available code or data [38,47,79,81,108]. Furthermore, there
of Twitter users and the general demographics of the population.
is limited information regarding the coding of the training data.
In addition, census demographic data that uses names are also None of the studies detailed their annotation schemas or made
questionable because of name-taking in marriage and available annotation guidelines. Detailed guidelines as a best
indiscernible names. practice may allow recreation or extension of data sets in
situations where the original data may not be shared or where
The practice of using a Twitter user’s self-reported race or
there is data loss over time. This is particularly true of data
ethnicity would provide a label with high confidence but restrict
collected from Twitter, where the terms of use require that
the amount of usable data and introduce a margin of error
shared data sets consist of only tweet IDs, not tweets, and that
depending on the method used to extract such self-reports. For
best efforts to delete IDs from the data set if the original tweet
example, in a sample of 14 million users, >0.1% matched precise
is removed or made private by the user be in place. Additional
regular expressions created to detect self-reported race or ethnic
restrictions are placed on special use cases for sensitive
identity [128]. Another study used mentions of keywords related
information, prohibiting the storage of such sensitive
to race or ethnicity in a user’s bio; however, limited validation
information if detected or inferred from the user. Twitter
was conducted to ensure that the mention was actually related
explicitly states that information on racial or ethnic origin cannot
to the user’s race or ethnicity [67,68]. This lack of information
be derived or inferred for an individual Twitter user and allows
gathered from the profile information leads to sampling bias in
academic research studies to use only aggregate-level data for
the training of the models [152].
analysis [218]. It may be argued that this policy is more likely
Some models trained on manually annotated data did not have to be targeted at commercial activities.
high interannotator agreement; for example, Chen et al [46]
crowdsourced annotation agreement measured at 0.45. This can
Strengths and Limitations
be interpreted as weak agreement, with the percentage of reliable We did not limit our database searches and other methods by
data being 15% to 35% [217]. Training a model on such weakly study design; however, we were unable to identify any previous
labeled data produces uncertain results. reviews on the subject. To the best of our knowledge, this is the
first review of methods used to extract race or ethnicity from
It is not possible to assume the accuracy of black box proprietary social media. We identified studies from a range of disciplines
tools and algorithms. The only race or ethnicity measure that and sources and categorized and summarized the methods used.
seems empirically reliable is self-report, but this has However, we were unable to obtain information on the
considerable limitations. Thus, faulty methods continue to methodologies used by private-sector companies that created
underpin digital health research, and researchers are likely to software for this purpose. Marketing and targeted advertising
become increasingly dependent on them. The gold standard are common on social media and are likely to use race as a part
data required to know the demographic characteristics of the of their algorithms to derive target users.
Twitter user is difficult to ascertain.
We did not limit our included papers to those in which the
The methods that we highlight as best practices include directly extraction of race or ethnicity was the primary focus. Although
asking the Twitter users. This can be achieved, for example, by this can be conceived as a strength, it also meant that reporting
asking respondents of a traditional survey for both their of the methods used was often poor. The accurate recreation of
demographic data and their Twitter handles so that the data can the data lost was hampered by not knowing how decisions were
be linked [96]. This was undertaken in the NatCen Social made in the original studies, including what demographic
Research British Social Attitudes Survey 2015, which has the definitions of race or ethnicity were used, or how accuracy was
added benefit of allowing the study of the accuracy of further determined. This limited the assessment of the included studies.
methods for deriving demographic data [20]. Contacting Twitter Few studies have validated the methods or conducted an error
users may also provide a gold standard but is impractical, given analysis to assess how often race is misapplied and those that
the current terms of use of Twitter that might consider such did, rarely used the most appropriate gold standard. This makes
contact a form of spamming [72,204,205,216]. A limitation of it difficult to directly compare the results of the different
extracting race or ethnicity from social media is the necessity approaches.
to oversimplify the complexity of racial identity. The categories
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 10
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
Future Directions imperative, Nothing about us without us [219]. Documenting
Future studies should investigate their methodological and establishing the diverse competence attributes of a research
approaches to estimate race or ethnicity, offering careful team should become a standard. Emphasizing the importance
interpretations that acknowledge the significant limits of these of diverse teams within the research process will contribute to
approaches and their impact on the interpretation of the results. social and racial justice in ways other than improving the
This may include reporting the results as a range that reliability of research.
communicates the inherent uncertainty of the classification In terms of the retrieved data, the most reliable (though
model. Social media data may best be used in combination with imperfect) method for ascertaining race was when users
other information. In addition, we must always be mindful that self-identified their racial affiliation. Further research on
race is a proxy measure for the much larger impact of being a overcoming the limitations of availability and sample size may
particular race or ethnicity in a society. As a result, the be warranted. Indeed, a hybrid model with automated methods
variability associated with race and ethnicity might reveal more and manual extraction may be preferred. For example,
about the effects of racism and social stratification than about automation methods could be developed to identify potential
individual user attributes. To conduct this study ethically and self-declarations in a user profile or timeline, which can then
rigorously, we recommend several practices that can help reduce be manually interpreted.
bias and increase reproducibility.
Finally, we call for greater reporting of the validation by our
We recommend acknowledging the researchers’ bias that can colleagues. Without error analysis, computational techniques
influence the conceptualization of the implementation of the would not be able to detect bias. Further research is needed to
study. Incorporating this reflexivity, as is common in qualitative establish whether any bias is systematic or random, that is,
research, allows for the identification of potential blind spots whether inaccuracies favor one direction or another.
that weaken the research. One way to address homogenous
research teams is through the inclusion of experts in race or Conclusions
ethnicity or in those communities being examined. These biases We identified major concerns that affect the reliability of the
can also be reduced by including members of the study methods and bias the results. There are also ethical concerns
population in the research process as experts and advisers [219]. throughout the process, particularly regarding the inference of
Although big data from social media can be collected without race or ethnicity, as opposed to the extraction of self-identity.
ever connecting with the people who contributed the data, it However, the potential usefulness of social media research
does not eliminate the ethical need for researchers to include requires thoughtful consideration of the best ways to estimate
representative perspectives in research processes. Examples of demographic characteristics such as race and ethnicity [112].
patient-engaged research and patient-centered outcomes This is particularly important, given the increased access to
research, community-based participatory research, and citizen Twitter data [202,203].
science (public participation in scientific research) within the
Therefore, we propose several approaches to improve the
health and social sciences amply demonstrate the instrumental
extraction of race or ethnicity from social media, including
value and ethical obligation of intentional efforts to involve
representative research teams and a mixture of manual and
nonscientist partners in cocreation of research [219]. The quality
computational methods, as well as future research on methods
of data science can be improved by seriously heeding the
to reduce bias.
Acknowledgments
This work was supported by the National Institutes of Health (NIH) National Library of Medicine under grant NIH-NLM 1R01
(principal investigator: GG, with coapplicants KO and SG) and NIH National Institute of Drug Abuse grant R21 DA049572-02
to RS. NIH National Library of Medicine funded this research but was not involved in the design and conduct of the study;
collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; or decision
to submit the manuscript for publication.
Data Availability
The included studies are available on the web, and the extracted data are presented in Table S2 in Multimedia Appendix 1. A
preprint of this paper is also available: Golder S, Stevens R, O’Connor K, James R, Gonzalez-Hernandez G. 2021. Who Is
Tweeting? A Scoping Review of Methods to Establish Race and Ethnicity from Twitter Datasets. SocArXiv. February 14.
doi:10.31235/osf.io/wru5q.
Authors' Contributions
SG, RS, KO, RJ, and GG contributed equally to the study. RS and GG proposed the topic and the main idea. SG and RJ were
responsible for literature search. SG, RS, KO, RJ, and GG were responsible for study selection and data extraction. SG drafted
the manuscript. SG, RS, KO, RJ, and GG commented on and revised the manuscript. SG provided the final version of this
manuscript. All authors contributed to the final draft of the manuscript.
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 11
(page number not for citation purposes)
XSL• FO
RenderXJOURNAL OF MEDICAL INTERNET RESEARCH Golder et al
Conflicts of Interest
None declared.
Multimedia Appendix 1
Search strategies and characteristics of included studies.
[DOCX File , 59 KB-Multimedia Appendix 1]
References
1. Golder S, Norman G, Loke YK. Systematic review on the prevalence, frequency and comparative value of adverse events
data in social media. Br J Clin Pharmacol 2015 Oct;80(4):878-888. [doi: 10.1111/bcp.12746] [Medline: 26271492]
2. Sarker A, Ginn R, Nikfarjam A, O'Connor K, Smith K, Jayaraman S, et al. Utilizing social media data for pharmacovigilance:
a review. J Biomed Inform 2015 Apr;54:202-212 [FREE Full text] [doi: 10.1016/j.jbi.2015.02.004] [Medline: 25720841]
3. Bhattacharya M, Snyder S, Malin M, Truffa MM, Marinic S, Engelmann R, et al. Using social media data in routine
pharmacovigilance: a pilot study to identify safety signals and patient perspectives. Pharm Med 2017 Apr 17;31(3):167-174.
[doi: 10.1007/s40290-017-0186-6]
4. Convertino I, Ferraro S, Blandizzi C, Tuccori M. The usefulness of listening social media for pharmacovigilance purposes:
a systematic review. Expert Opin Drug Saf 2018 Nov;17(11):1081-1093. [doi: 10.1080/14740338.2018.1531847] [Medline:
30285501]
5. Golder S, Smith K, O'Connor K, Gross R, Hennessy S, Gonzalez-Hernandez G. A comparative view of reported adverse
effects of statins in social media, regulatory data, drug information databases and systematic reviews. Drug Saf 2021 Feb
01;44(2):167-179 [FREE Full text] [doi: 10.1007/s40264-020-00998-1] [Medline: 33001380]
6. Bychkov D, Young S. Social media as a tool to monitor adherence to HIV antiretroviral therapy. J Clin Transl Res 2018
Dec 17;3(Suppl 3):407-410 [FREE Full text] [Medline: 30873489]
7. Kalf RR, Makady A, Ten HR, Meijboom K, Goettsch WG, IMI-GetReal Workpackage 1. Use of social media in the
assessment of relative effectiveness: explorative review with examples from oncology. JMIR Cancer 2018 Jun 08;4(1):e11
[FREE Full text] [doi: 10.2196/cancer.7952] [Medline: 29884607]
8. Golder S, O'Connor K, Hennessy S, Gross R, Gonzalez-Hernandez G. Assessment of beliefs and attitudes about statins
posted on Twitter: a qualitative study. JAMA Netw Open 2020 Jun 01;3(6):e208953 [FREE Full text] [doi:
10.1001/jamanetworkopen.2020.8953] [Medline: 32584408]
9. Golder S, Bach M, O'Connor K, Gross R, Hennessy S, Gonzalez Hernandez G. Public perspectives on anti-diabetic drugs:
exploratory analysis of Twitter posts. JMIR Diabetes 2021 Jan 26;6(1):e24681 [FREE Full text] [doi: 10.2196/24681]
[Medline: 33496671]
10. Hswen Y, Naslund JA, Brownstein JS, Hawkins JB. Monitoring online discussions about suicide among Twitter users with
schizophrenia: exploratory study. JMIR Ment Health 2018 Dec 13;5(4):e11483 [FREE Full text] [doi: 10.2196/11483]
[Medline: 30545811]
11. Howie L, Hirsch B, Locklear T, Abernethy AP. Assessing the value of patient-generated data to comparative effectiveness
research. Health Aff (Millwood) 2014 Jul;33(7):1220-1228. [doi: 10.1377/hlthaff.2014.0225] [Medline: 25006149]
12. Cavazos-Rehg PA, Krauss MJ, Costello SJ, Kaiser N, Cahn ES, Fitzsimmons-Craft EE, et al. "I just want to be skinny.":
a content analysis of tweets expressing eating disorder symptoms. PLoS One 2019;14(1):e0207506 [FREE Full text] [doi:
10.1371/journal.pone.0207506] [Medline: 30650072]
13. Ahmed W, Bath PA, Sbaffi L, Demartini G. Novel insights into views towards H1N1 during the 2009 Pandemic: a thematic
analysis of Twitter data. Health Info Libr J 2019 Mar;36(1):60-72 [FREE Full text] [doi: 10.1111/hir.12247] [Medline:
30663232]
14. Cook N, Mullins A, Gautam R, Medi S, Prince C, Tyagi N, et al. Evaluating patient experiences in dry eye disease through
social media listening research. Ophthalmol Ther 2019 Sep;8(3):407-420 [FREE Full text] [doi: 10.1007/s40123-019-0188-4]
[Medline: 31161531]
15. Roccetti M, Salomoni P, Prandi C, Marfia G, Mirri S. On the interpretation of the effects of the Infliximab treatment on
Crohn’s disease patients from Facebook posts: a human vs. machine comparison. Netw Model Anal Health Inform Bioinforma
2017 Jun 26;6(1):10.1007/s13721-017-0152-y. [doi: 10.1007/s13721-017-0152-y]
16. Madden ML, Cortesi S, Gasser U, Duggan M, Smith A, Beaton M. Teens, social media, and privacy. Pew Internet &
American Life Project. 2013. URL: http://www.pewinternet.org/2013/05/21/teens-social-media-and-privacy/ [accessed
2022-04-19]
17. Chou WS, Hunt YM, Beckjord EB, Moser RP, Hesse BW. Social media use in the United States: implications for health
communication. J Med Internet Res 2009;11(4):e48 [FREE Full text] [doi: 10.2196/jmir.1249] [Medline: 19945947]
18. Social media use in 2018. Pew Research Center. URL: https://www.pewresearch.org/internet/2018/03/01/social-media-use
-in-2018/ [accessed 2022-04-19]
https://www.jmir.org/2022/4/e35788 J Med Internet Res 2022 | vol. 24 | iss. 4 | e35788 | p. 12
(page number not for citation purposes)
XSL• FO
RenderXYou can also read